Skip to main content

Command Palette

Search for a command to run...

The Hidden Trap in Snowflake's INFER_SCHEMA

And How SQL Functions Save the Day

Updated
4 min read
The Hidden Trap in Snowflake's INFER_SCHEMA

I've always been a fan of Snowflake's INFER_SCHEMA() function. Not only because it saves you from manually typing out column definitions but the whole idea of letting Snowflake figure out your schema from actual data feels like the right level of automation. But recently I had to learn its limitations the hard way.

The Problem That Bit Me

Here's what happened: I loaded a sample file with IDs like 1, 2, 3 — single-digit integers. INFER_SCHEMA() happily inferred NUMBER(1,0) for the column. Makes sense, right? Minimal precision for single digits.

Then production data arrived with IDs in the millions.

Numeric value '1234567' is out of range

What a bummer.

The issue is that INFER_SCHEMA() acts too precise. It looks at your sample data and picks the tightest type that fits. A file with values like 1.5 and 2.3 gets inferred as NUMBER(2,1) — which explodes when 123.456 shows up later.

The Template Pattern Nobody Talks About

Here's the thing: when you create a table using USING TEMPLATE, you're not locked into the raw output of INFER_SCHEMA(). The template is just SQL — and you can transform it however you want.

Most tutorials show you this basic pattern:

CREATE TABLE my_table
USING TEMPLATE (
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
    FROM TABLE(INFER_SCHEMA(...))
);

So far so good. But that OBJECT_CONSTRUCT(*) is where the magic happens — or doesn't happen if you're just passing everything through unchanged.

SQL Functions to the Rescue

You can use any SQL function inside the template to transform column names, types, or other properties. Let me show you what I mean.

Uppercasing column names:

CREATE TABLE my_table
USING TEMPLATE (
    SELECT ARRAY_AGG(
        OBJECT_CONSTRUCT(
            'COLUMN_NAME', UPPER(COLUMN_NAME),  -- Force uppercase
            'TYPE', TYPE,
            'NULLABLE', NULLABLE,
            'ORDER_ID', ORDER_ID
        )
    )
    WITHIN GROUP (ORDER BY ORDER_ID)
    FROM TABLE(INFER_SCHEMA(...))
);

Why would you want this? Schema evolution creates columns in UPPERCASE. If your initial columns are lowercase from the CSV header but new columns come in as uppercase, you end up with case mismatches. Normalizing upfront avoids the headache.

Broadening numeric types:

Here's the pattern I use now for every INFER_SCHEMA workflow:

CREATE TABLE my_table
USING TEMPLATE (
    SELECT ARRAY_AGG(
        OBJECT_CONSTRUCT(
            'COLUMN_NAME', UPPER(COLUMN_NAME),
            'TYPE', CASE 
                WHEN REGEXP_LIKE(TYPE, 'NUMBER\\([0-9]+,\\s?0\\)') THEN 'NUMBER(38, 0)'
                WHEN STARTSWITH(TYPE, 'NUMBER(') THEN 'DOUBLE'
                ELSE TYPE
            END,
            'NULLABLE', NULLABLE,
            'ORDER_ID', ORDER_ID
        )
    )
    WITHIN GROUP (ORDER BY ORDER_ID)
    FROM TABLE(INFER_SCHEMA(...))
);

Let me break down that CASE statement:

PatternTransformationWhy
NUMBER(X, 0)NUMBER(38, 0)Integers get minimal precision — NUMBER(1,0) for single digits. Broadening to NUMBER(38,0) handles any integer.
NUMBER(X, Y)DOUBLEDecimals like NUMBER(3,2) overflow with larger values. DOUBLE provides flexibility for real-world data.
Everything elseKeep as-isStrings, timestamps, booleans don't have this problem.

The regex NUMBER\\([0-9]+,\\s?0\\) matches integer types (scale of 0), while STARTSWITH(TYPE, 'NUMBER(') catches any remaining numeric types that have decimal places.

The Full Recipe

Here's the complete pattern I use for CSV files:

-- Create file format that reads headers
CREATE OR REPLACE FILE FORMAT my_csv_format
    TYPE = 'CSV'
    PARSE_HEADER = TRUE
    ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE;

-- Create table with broadened types
CREATE OR REPLACE TABLE my_table
USING TEMPLATE (
    SELECT ARRAY_AGG(
        OBJECT_CONSTRUCT(
            'COLUMN_NAME', UPPER(COLUMN_NAME),
            'TYPE', CASE 
                WHEN REGEXP_LIKE(TYPE, 'NUMBER\\([0-9]+,\\s?0\\)') THEN 'NUMBER(38, 0)'
                WHEN STARTSWITH(TYPE, 'NUMBER(') THEN 'DOUBLE'
                ELSE TYPE
            END,
            'NULLABLE', NULLABLE,
            'ORDER_ID', ORDER_ID
        )
    )
    WITHIN GROUP (ORDER BY ORDER_ID)
    FROM TABLE(
        INFER_SCHEMA(
            LOCATION => '@my_stage/data/',
            FILE_FORMAT => 'my_csv_format'
        )
    )
)
ENABLE_SCHEMA_EVOLUTION = TRUE;

The ENABLE_SCHEMA_EVOLUTION = TRUE is critical if you want new columns to be added automatically — but that's a story for the next post.

What About Parquet?

Good news: Parquet files have type information embedded in their metadata, so the inferred types are much more accurate. You typically don't need the numeric broadening trick.

But you should still uppercase the column names for consistency with schema evolution:

CREATE TABLE my_table
USING TEMPLATE (
    SELECT ARRAY_AGG(
        OBJECT_CONSTRUCT(
            'COLUMN_NAME', UPPER(COLUMN_NAME),  -- Still important!
            'TYPE', TYPE,
            'NULLABLE', NULLABLE,
            'ORDER_ID', ORDER_ID
        )
    )
    WITHIN GROUP (ORDER BY ORDER_ID)
    FROM TABLE(INFER_SCHEMA(...))
);

The Takeaway

INFER_SCHEMA() is fantastic for prototyping and development. But for production, don't trust it blindly — post-process the template using SQL functions to broaden numeric types and normalize column names. Your future self (and your pipelines) will thank you.


Next up: How to combine this pattern with Snowpipe for automatic schema evolution. Stay tuned.