Multi-Model Cost Optimization with Snowflake Cortex and OpenClaw

In Part 1 I connected OpenClaw to Snowflake Cortex as the LLM backend. Enterprise-grade security, unified billing, data stays in Snowflake. So far so good.

But after running it for a while I noticed something: most of the tokens OpenClaw burns through aren't from the main agent doing complex reasoning. They're from subagents doing file searches, reading docs, and grepping through code. Simple stuff. And all of that was running on the same expensive model as the main agent.

My naive approach was using claude-sonnet-4-5 for everything. And you guessed it - the bill reflected that.

The Key Insight: Not Every Task Needs a $3 Model

OpenClaw uses a hierarchical architecture. The main agent does the thinking - planning, reasoning, decision-making. But it delegates a lot of the grunt work to subagents: searching files, reading documentation, generating boilerplate code. These tasks are straightforward enough that a $0.03 model handles them just fine.

The main agent stays on the best model. The subagents run on whatever is cheapest for their specific job.

What Cheap Models Can Handle

I spent some time testing which models are "good enough" for different subagent tasks. Here's what I found:

Ultra-cheap ($0.03-$0.06) works perfectly for file search, grep, listing directories, and reading/summarizing docs. These are essentially pattern matching tasks. llama3.1-8b at $0.03/$0.03 per 1M tokens is my go-to here. For doc summarization, openai-gpt-5-nano at $0.06/$0.44 does a solid job.

Budget models ($0.12-$0.25) are good for boilerplate code generation, test scaffolding, simple refactoring, and config file generation. snowflake-llama-3.3-70b at $0.12/$0.12 is particularly interesting here because Snowflake tuned it specifically for their workloads. llama3.1-70b at $0.25/$0.25 handles general code generation well.

Mid-tier ($1.00-$1.25) is where you go when you actually need reasoning: code reviews, bug analysis, API integrations. claude-haiku-4-5 at $1.00/$5.00 or openai-gpt-5 at $1.25/$10.00.

The Numbers Don't Lie

Let's do the math on a typical exploration-heavy session. Say 100K input tokens and 50K output tokens total, with about 80% of that going to subagents (which is realistic for codebase exploration).

Configuration	Main Agent Cost	Subagent Cost	Total
All Sonnet	$0.30	$0.75	$1.05
Sonnet + Haiku	$0.06	$0.28	$0.34
Sonnet + llama3.1-8b	$0.06	$0.004	$0.064
Sonnet + llama3.1-70b	$0.06	$0.03	$0.09

That's a 94% cost reduction going from all-Sonnet to Sonnet + llama3.1-8b subagents. Even compared to Haiku subagents, llama3.1-8b is 97% cheaper on input and 99% cheaper on output.

What a difference.

Configuration

The setup lives in two places: ~/.openclaw/openclaw.json for the provider config and ~/.openclaw/agents/main/agent/models.json for model definitions. Here's my exploration-heavy config that I use most of the time:

{
  "providers": {
    "cortex": {
      "baseUrl": "https://<org>-<account>.snowflakecomputing.com/api/v2/cortex/v1",
      "apiKey": "<your-pat-token>",
      "api": "openai-completions",
      "models": [
        {
          "id": "claude-sonnet-4-5",
          "name": "Cortex Claude Sonnet 4.5",
          "reasoning": true,
          "input": ["text", "image"],
          "contextWindow": 200000,
          "maxTokens": 16384,
          "cost": {"input": 3.00, "output": 15.00, "cacheRead": 0.30, "cacheWrite": 3.75},
          "compat": {"supportsDeveloperRole": false, "maxTokensField": "max_completion_tokens"}
        },
        {
          "id": "llama3.1-8b",
          "name": "Cortex Llama 3.1 8B",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 32000,
          "maxTokens": 8192,
          "cost": {"input": 0.03, "output": 0.03, "cacheRead": 0, "cacheWrite": 0},
          "compat": {"supportsDeveloperRole": false, "maxTokensField": "max_completion_tokens"}
        }
      ]
    }
  },
  "agents": {
    "defaults": {
      "model": {"primary": "cortex/claude-sonnet-4-5"},
      "subagents": {
        "maxConcurrent": 8,
        "model": "cortex/llama3.1-8b"
      }
    }
  }
}

The important bit is the cost field on each model. With those configured, the Openclaw dashboard actually tracks your spend per model - so you can see exactly where the money goes.

Workflow-Specific Setups

I switch between a few configurations depending on what I'm doing:

Exploration & file search: Sonnet main + llama3.1-8b subagents ($0.03/$0.03). This is my default. 97% cheaper than Haiku subagents and perfectly fine for grepping, finding files, and navigating codebases.

Documentation & research: Sonnet main + openai-gpt-5-nano subagents ($0.06/$0.44). Slightly more capable for summarization tasks but still 94% cheaper on input than Haiku.

Code generation: Sonnet main + llama3.1-70b subagents ($0.25/$0.25). When the subagents need to write actual code rather than just find files. Balanced quality/cost.

Snowflake-specific work: Sonnet main + snowflake-llama-3.3-70b subagents ($0.12/$0.12). Snowflake's own tuned model. 88% cheaper than Haiku and optimized for SQL generation and data tasks. I use this when working on Snowflake projects specifically.

Maximum quality: When I'm working on critical production code or complex architecture and cost doesn't matter - claude-opus-4-6 main + claude-sonnet-4-5 subagents. Premium everything. But honestly I rarely need this.

Monitoring What You Spend

Beyond the Openclaw dashboard, you can query actual consumption directly from Snowflake:

SELECT 
    MODEL_NAME,
    SUM(INPUT_TOKENS) as total_input_tokens,
    SUM(OUTPUT_TOKENS) as total_output_tokens,
    SUM(CREDITS_USED) as total_credits
FROM SNOWFLAKE.ACCOUNT_USAGE.CORTEX_REST_API_USAGE_HISTORY
WHERE START_TIME >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY MODEL_NAME
ORDER BY total_credits DESC;

This gives you the ground truth on what's actually being consumed. I run this weekly to make sure my assumptions about subagent token distribution still hold.

But you don't always want to write SQL just to check how things are going. Openclaw itself ships with a usage view that breaks down token consumption and cost per model, per session. Once you have the cost fields configured in your model definitions (as shown above), the dashboard picks them up automatically and gives you a nice overview of where your tokens are going. In my case the split is pretty obvious - the main agent shows up as one big chunk and the subagent calls are spread across dozens of small, cheap requests.

What I like about this view is that you can immediately see if a subagent task is burning more tokens than expected. If one of the llama3.1-8b calls suddenly shows high token counts, that's usually a sign that the task is too complex for the cheap model and should be bumped up a tier. Most of the time though, the numbers confirm what you'd expect: the majority of subagent calls are tiny and cheap.

Available Models at a Glance

Cortex currently offers 22 models across the REST API. Here are the ones I find most relevant for Openclaw setups, grouped by what I'd actually use them for:

Tier	Model	Input/Output ($/1M tokens)	Use Case
Premium	claude-opus-4-6	$5.00/$25.00	Main agent (max quality)
Standard	claude-sonnet-4-5	$3.00/$15.00	Main agent (recommended)
Budget	claude-haiku-4-5	$1.00/$5.00	Quality subagents
Budget	llama3.1-70b	$0.25/$0.25	Code generation subagents
Ultra-Budget	snowflake-llama-3.3-70b	$0.12/$0.12	Snowflake-specific subagents
Ultra-Budget	openai-gpt-5-nano	$0.06/$0.44	Doc summarization subagents
Ultra-Budget	llama3.1-8b	$0.03/$0.03	File search subagents
Ultra-Budget	mistral-7b	$0.03/$0.03	Pattern matching subagents

There are more models available (deepseek-r1, llama4-maverick, mistral-large2, openai-o4-mini, etc.) but these are the ones I actually use regularly.

Practical Tips

Start ultra-cheap. Use llama3.1-8b for all subagents first. Upgrade individual task types only when you notice quality issues. You'd be surprised how rarely that happens for file search and navigation tasks.

Match model to task, not to habit. It's tempting to use Haiku everywhere because it's "the cheap Claude model." But for most subagent tasks, it's leaving money on the table. A $0.03 model that searches files is just as good as a $1.00 model for that specific job.

Use prompt caching. Cortex supports prompt caching for OpenAI and Anthropic models. For OpenAI models it's implicit (kicks in at 1024+ tokens). For Anthropic models you need to add cache points in the request. Either way, it cuts repeated context costs dramatically.

Run subagents in parallel. With maxConcurrent: 8 and $0.03 subagents, you can do a lot of exploration in parallel for almost nothing. Much better than sequentially running one expensive agent.

Wrapping Up

The main takeaway: most of the tokens your AI coding agent burns through are on simple tasks. File search, grep, doc reading - these don't need expensive models. By routing them to $0.03 models via Snowflake Cortex, you keep the quality where it matters (the main agent) while cutting overall costs by 90%+.

And because everything runs through Cortex, you get unified billing, enterprise security, and the ability to monitor actual usage through Snowflake's account usage views. No separate API keys to manage, no surprise bills from different providers.

Multi-Model Cost Optimization with Snowflake Cortex and OpenClaw

The Key Insight: Not Every Task Needs a $3 Model

What Cheap Models Can Handle

The Numbers Don't Lie

Configuration

Workflow-Specific Setups

Monitoring What You Spend

Available Models at a Glance

Practical Tips

Wrapping Up

Comments

More from this blog

Managing Snowflake organization listings from dbt

From dbt Models to Snowflake Semantic Views: Best Practices for Cortex Analyst

Connecting OpenClaw to Snowflake Cortex

The Hidden Trap in Snowflake's INFER_SCHEMA

Command Palette

The Key Insight: Not Every Task Needs a $3 Model

What Cheap Models Can Handle

The Numbers Don't Lie

Configuration

Workflow-Specific Setups

Monitoring What You Spend

Available Models at a Glance

Practical Tips

Wrapping Up

Comments

More from this blog