Multi-Model Cost Optimization with Snowflake Cortex and OpenClaw

In Part 1 I connected OpenClaw to Snowflake Cortex as the LLM backend. Enterprise-grade security, unified billing, data stays in Snowflake. So far so good.
But after running it for a while I noticed something: most of the tokens OpenClaw burns through aren't from the main agent doing complex reasoning. They're from subagents doing file searches, reading docs, and grepping through code. Simple stuff. And all of that was running on the same expensive model as the main agent.
My naive approach was using claude-sonnet-4-5 for everything. And you guessed it - the bill reflected that.
The Key Insight: Not Every Task Needs a $3 Model
OpenClaw uses a hierarchical architecture. The main agent does the thinking - planning, reasoning, decision-making. But it delegates a lot of the grunt work to subagents: searching files, reading documentation, generating boilerplate code. These tasks are straightforward enough that a $0.03 model handles them just fine.

The main agent stays on the best model. The subagents run on whatever is cheapest for their specific job.
What Cheap Models Can Handle
I spent some time testing which models are "good enough" for different subagent tasks. Here's what I found:
Ultra-cheap ($0.03-$0.06) works perfectly for file search, grep, listing directories, and reading/summarizing docs. These are essentially pattern matching tasks. llama3.1-8b at $0.03/$0.03 per 1M tokens is my go-to here. For doc summarization, openai-gpt-5-nano at $0.06/$0.44 does a solid job.
Budget models ($0.12-$0.25) are good for boilerplate code generation, test scaffolding, simple refactoring, and config file generation. snowflake-llama-3.3-70b at $0.12/$0.12 is particularly interesting here because Snowflake tuned it specifically for their workloads. llama3.1-70b at $0.25/$0.25 handles general code generation well.
Mid-tier ($1.00-$1.25) is where you go when you actually need reasoning: code reviews, bug analysis, API integrations. claude-haiku-4-5 at $1.00/$5.00 or openai-gpt-5 at $1.25/$10.00.
The Numbers Don't Lie
Let's do the math on a typical exploration-heavy session. Say 100K input tokens and 50K output tokens total, with about 80% of that going to subagents (which is realistic for codebase exploration).
| Configuration | Main Agent Cost | Subagent Cost | Total |
| All Sonnet | $0.30 | $0.75 | $1.05 |
| Sonnet + Haiku | $0.06 | $0.28 | $0.34 |
| Sonnet + llama3.1-8b | $0.06 | $0.004 | $0.064 |
| Sonnet + llama3.1-70b | $0.06 | $0.03 | $0.09 |
That's a 94% cost reduction going from all-Sonnet to Sonnet + llama3.1-8b subagents. Even compared to Haiku subagents, llama3.1-8b is 97% cheaper on input and 99% cheaper on output.
What a difference.
Configuration
The setup lives in two places: ~/.openclaw/openclaw.json for the provider config and ~/.openclaw/agents/main/agent/models.json for model definitions. Here's my exploration-heavy config that I use most of the time:
{
"providers": {
"cortex": {
"baseUrl": "https://<org>-<account>.snowflakecomputing.com/api/v2/cortex/v1",
"apiKey": "<your-pat-token>",
"api": "openai-completions",
"models": [
{
"id": "claude-sonnet-4-5",
"name": "Cortex Claude Sonnet 4.5",
"reasoning": true,
"input": ["text", "image"],
"contextWindow": 200000,
"maxTokens": 16384,
"cost": {"input": 3.00, "output": 15.00, "cacheRead": 0.30, "cacheWrite": 3.75},
"compat": {"supportsDeveloperRole": false, "maxTokensField": "max_completion_tokens"}
},
{
"id": "llama3.1-8b",
"name": "Cortex Llama 3.1 8B",
"reasoning": false,
"input": ["text"],
"contextWindow": 32000,
"maxTokens": 8192,
"cost": {"input": 0.03, "output": 0.03, "cacheRead": 0, "cacheWrite": 0},
"compat": {"supportsDeveloperRole": false, "maxTokensField": "max_completion_tokens"}
}
]
}
},
"agents": {
"defaults": {
"model": {"primary": "cortex/claude-sonnet-4-5"},
"subagents": {
"maxConcurrent": 8,
"model": "cortex/llama3.1-8b"
}
}
}
}
The important bit is the cost field on each model. With those configured, the Openclaw dashboard actually tracks your spend per model - so you can see exactly where the money goes.
Workflow-Specific Setups
I switch between a few configurations depending on what I'm doing:
Exploration & file search: Sonnet main + llama3.1-8b subagents ($0.03/$0.03). This is my default. 97% cheaper than Haiku subagents and perfectly fine for grepping, finding files, and navigating codebases.
Documentation & research: Sonnet main + openai-gpt-5-nano subagents ($0.06/$0.44). Slightly more capable for summarization tasks but still 94% cheaper on input than Haiku.
Code generation: Sonnet main + llama3.1-70b subagents ($0.25/$0.25). When the subagents need to write actual code rather than just find files. Balanced quality/cost.
Snowflake-specific work: Sonnet main + snowflake-llama-3.3-70b subagents ($0.12/$0.12). Snowflake's own tuned model. 88% cheaper than Haiku and optimized for SQL generation and data tasks. I use this when working on Snowflake projects specifically.
Maximum quality: When I'm working on critical production code or complex architecture and cost doesn't matter - claude-opus-4-6 main + claude-sonnet-4-5 subagents. Premium everything. But honestly I rarely need this.
Monitoring What You Spend
Beyond the Openclaw dashboard, you can query actual consumption directly from Snowflake:
SELECT
MODEL_NAME,
SUM(INPUT_TOKENS) as total_input_tokens,
SUM(OUTPUT_TOKENS) as total_output_tokens,
SUM(CREDITS_USED) as total_credits
FROM SNOWFLAKE.ACCOUNT_USAGE.CORTEX_REST_API_USAGE_HISTORY
WHERE START_TIME >= DATEADD(day, -7, CURRENT_TIMESTAMP())
GROUP BY MODEL_NAME
ORDER BY total_credits DESC;
This gives you the ground truth on what's actually being consumed. I run this weekly to make sure my assumptions about subagent token distribution still hold.
But you don't always want to write SQL just to check how things are going. Openclaw itself ships with a usage view that breaks down token consumption and cost per model, per session. Once you have the cost fields configured in your model definitions (as shown above), the dashboard picks them up automatically and gives you a nice overview of where your tokens are going. In my case the split is pretty obvious - the main agent shows up as one big chunk and the subagent calls are spread across dozens of small, cheap requests.

What I like about this view is that you can immediately see if a subagent task is burning more tokens than expected. If one of the llama3.1-8b calls suddenly shows high token counts, that's usually a sign that the task is too complex for the cheap model and should be bumped up a tier. Most of the time though, the numbers confirm what you'd expect: the majority of subagent calls are tiny and cheap.
Available Models at a Glance
Cortex currently offers 22 models across the REST API. Here are the ones I find most relevant for Openclaw setups, grouped by what I'd actually use them for:
| Tier | Model | Input/Output ($/1M tokens) | Use Case |
| Premium | claude-opus-4-6 | $5.00/$25.00 | Main agent (max quality) |
| Standard | claude-sonnet-4-5 | $3.00/$15.00 | Main agent (recommended) |
| Budget | claude-haiku-4-5 | $1.00/$5.00 | Quality subagents |
| Budget | llama3.1-70b | $0.25/$0.25 | Code generation subagents |
| Ultra-Budget | snowflake-llama-3.3-70b | $0.12/$0.12 | Snowflake-specific subagents |
| Ultra-Budget | openai-gpt-5-nano | $0.06/$0.44 | Doc summarization subagents |
| Ultra-Budget | llama3.1-8b | $0.03/$0.03 | File search subagents |
| Ultra-Budget | mistral-7b | $0.03/$0.03 | Pattern matching subagents |
There are more models available (deepseek-r1, llama4-maverick, mistral-large2, openai-o4-mini, etc.) but these are the ones I actually use regularly.
Practical Tips
Start ultra-cheap. Use llama3.1-8b for all subagents first. Upgrade individual task types only when you notice quality issues. You'd be surprised how rarely that happens for file search and navigation tasks.
Match model to task, not to habit. It's tempting to use Haiku everywhere because it's "the cheap Claude model." But for most subagent tasks, it's leaving money on the table. A $0.03 model that searches files is just as good as a $1.00 model for that specific job.
Use prompt caching. Cortex supports prompt caching for OpenAI and Anthropic models. For OpenAI models it's implicit (kicks in at 1024+ tokens). For Anthropic models you need to add cache points in the request. Either way, it cuts repeated context costs dramatically.
Run subagents in parallel. With maxConcurrent: 8 and $0.03 subagents, you can do a lot of exploration in parallel for almost nothing. Much better than sequentially running one expensive agent.
Wrapping Up
The main takeaway: most of the tokens your AI coding agent burns through are on simple tasks. File search, grep, doc reading - these don't need expensive models. By routing them to $0.03 models via Snowflake Cortex, you keep the quality where it matters (the main agent) while cutting overall costs by 90%+.
And because everything runs through Cortex, you get unified billing, enterprise security, and the ability to monitor actual usage through Snowflake's account usage views. No separate API keys to manage, no surprise bills from different providers.



