Skip to main content

Command Palette

Search for a command to run...

Inferring Schema from VARIANT Fields in Snowflake

How to automatically discover and flatten JSON schema from VARIANT columns in Snowflake using recursive FLATTEN and TYPEOF, without manually writing extraction logic for every field.

Updated
5 min read
Inferring Schema from VARIANT Fields in Snowflake

I get asked about this a lot. Someone lands JSON from a REST API into a VARIANT column, and now they want proper columns without manually writing json_data:field1::VARCHAR, json_data:field2::NUMBER for every single path. And you guessed it - there's no built-in INFER_SCHEMA for VARIANT columns like there is for staged files.

The reason? It's genuinely hard. Unlike staged files where Snowflake can sample a few files upfront, VARIANT columns can contain wildly different structures across rows, nested objects go arbitrarily deep, and arrays make things exponentially messier. So far so good - but that doesn't mean we can't build something ourselves.

The Core Trick: Recursive FLATTEN + TYPEOF

Here's the discovery query that makes everything possible:

SELECT DISTINCT
    f.path AS original_path,
    UPPER(REPLACE(f.path, '.', '_')) AS column_name,
    TYPEOF(f.value) AS data_type
FROM my_table,
LATERAL FLATTEN(json_data, RECURSIVE => TRUE) f
WHERE TYPEOF(f.value) NOT IN ('OBJECT')
  AND f.path NOT LIKE '%[%'
ORDER BY column_name;

FLATTEN with RECURSIVE => TRUE walks the entire JSON tree and returns every path. TYPEOF() tells us what's at each path. The NOT LIKE '%[%' filter excludes array contents - I'll explain why in a moment.

Run this against a typical API response and you'll see something like:

original_pathcolumn_namedata_type
address.cityADDRESS_CITYVARCHAR
address.geo.latADDRESS_GEO_LATDOUBLE
balanceBALANCEDECIMAL
is_activeIS_ACTIVEBOOLEAN
ordersORDERSARRAY
tagsTAGSARRAY

Why Arrays Stay as VARIANT

My first instinct was to recursively explode everything. But that creates a mess:

  1. Arrays can have different lengths per row - do you create ITEM_0, ITEM_1, ITEM_2... how many?

  2. Nested arrays explode your row count exponentially

  3. The resulting schema becomes unpredictable

Instead, the approach I landed on keeps top-level arrays as VARIANT columns. You can still query them with LATERAL FLATTEN when you need to, but your base schema stays stable. The people who need to dig into arrays can handle that downstream.

Wrapping It in a Procedure

Once you've got the discovery query working, wrapping it in a stored procedure makes it reusable:

CREATE OR REPLACE PROCEDURE discover_json_schema(
    source_table VARCHAR,
    variant_column VARCHAR
)
RETURNS ARRAY
LANGUAGE SQL
AS
DECLARE
    schema_array ARRAY;
BEGIN
    SELECT ARRAY_AGG(OBJECT_CONSTRUCT(
        'original_path', original_path,
        'column_name', column_name,
        'sql_type', sql_type
    ))
    INTO schema_array
    FROM (
        SELECT DISTINCT
            f.path AS original_path,
            UPPER(REPLACE(f.path, '.', '_')) AS column_name,
            CASE
                WHEN TYPEOF(f.value) = 'INTEGER' THEN 'NUMBER'
                WHEN TYPEOF(f.value) IN ('DOUBLE', 'DECIMAL') THEN 'FLOAT'
                WHEN TYPEOF(f.value) = 'BOOLEAN' THEN 'BOOLEAN'
                WHEN TYPEOF(f.value) = 'ARRAY' THEN 'VARIANT'
                ELSE 'VARCHAR'
            END AS sql_type
        FROM IDENTIFIER(:source_table),
        LATERAL FLATTEN(IDENTIFIER(:variant_column), RECURSIVE => TRUE) f
        WHERE TYPEOF(f.value) NOT IN ('OBJECT')
          AND f.path NOT LIKE '%[%'
        ORDER BY column_name
    );

    RETURN schema_array;
END;

Call it like this:

CALL discover_json_schema('my_db.my_schema.raw_api_data', 'json_data');

Returns an array of objects you can loop through to generate DDL.

[
  {
    "column_name": "EMAIL",
    "original_path": "email",
    "sql_type": "VARCHAR"
  },
  {
    "column_name": "IS_ACTIVE",
    "original_path": "is_active",
    "sql_type": "BOOLEAN"
  },
-- [... and so on ]
]

Generating Views Automatically

The natural next step - a procedure that creates a view with all discovered columns:

CREATE OR REPLACE PROCEDURE generate_flattened_view(
    source_table VARCHAR,
    variant_column VARCHAR,
    target_view VARCHAR
)
RETURNS VARCHAR
LANGUAGE SQL
AS
DECLARE
    ddl_statement VARCHAR;
    select_cols VARCHAR;
    schema_array ARRAY;
BEGIN
    CALL discover_json_schema(:source_table, :variant_column) INTO schema_array;

    SELECT LISTAGG(
        'GET_PATH(' || :variant_column || ', ''' || s.value:original_path::VARCHAR || ''')::' ||
        s.value:sql_type::VARCHAR || ' AS "' || s.value:column_name::VARCHAR || '"',
        ', '
    ) WITHIN GROUP (ORDER BY s.value:column_name::VARCHAR)
    INTO select_cols
    FROM TABLE(FLATTEN(:schema_array)) s;

    ddl_statement := 'CREATE OR REPLACE VIEW ' || :target_view ||
                     ' AS SELECT ' || select_cols ||
                     ' FROM ' || :source_table;
    EXECUTE IMMEDIATE ddl_statement;

    RETURN 'Created view: ' || :target_view;
END;

Now one call flattens your entire JSON structure:

CALL generate_flattened_view(
    'raw_api_data',
    'json_data',
    'api_data_flat'
);

Dynamic Tables for Auto-Refresh

Same idea, but with Dynamic Tables for continuously refreshing data:

CREATE OR REPLACE PROCEDURE generate_flattened_dynamic_table(
    source_table VARCHAR,
    variant_column VARCHAR,
    target_dt VARCHAR,
    warehouse VARCHAR DEFAULT 'COMPUTE_WH',
    target_lag VARCHAR DEFAULT '1 hour'
)
RETURNS VARCHAR
LANGUAGE SQL
AS
DECLARE
    ddl_statement VARCHAR;
    select_cols VARCHAR;
    schema_array ARRAY;
BEGIN
    CALL discover_json_schema(:source_table, :variant_column) INTO schema_array;

    SELECT LISTAGG(
        'GET_PATH(' || :variant_column || ', ''' || s.value:original_path::VARCHAR || ''')::' ||
        s.value:sql_type::VARCHAR || ' AS "' || s.value:column_name::VARCHAR || '"',
        ', '
    ) WITHIN GROUP (ORDER BY s.value:column_name::VARCHAR)
    INTO select_cols
    FROM TABLE(FLATTEN(:schema_array)) s;

    ddl_statement := 'CREATE OR REPLACE DYNAMIC TABLE ' || :target_dt ||
                     ' TARGET_LAG = ''' || :target_lag || '''' ||
                     ' WAREHOUSE = ' || :warehouse ||
                     ' AS SELECT ' || select_cols ||
                     ' FROM ' || :source_table;
    EXECUTE IMMEDIATE ddl_statement;

    RETURN 'Created dynamic table: ' || :target_dt;
END;

Caveats

This is a starting point, not a production-ready solution. Things you'll likely need to adjust:

  • Type conflicts: If the same path has different types across rows (eg. APIs returning null vs 0), the discovery picks one. You might want majority-wins logic or explicit overrides.

  • Column name collisions: user.id and user_id both become USER_ID. Add disambiguation if your data has this. For example you can use a different separator.

  • Schema evolution: New fields in source JSON won't automatically appear. Re-run the procedure or build a scheduled task to detect drift.

  • Performance: Scanning the entire table for schema discovery is expensive. Consider sampling with TABLESAMPLE or LIMIT for large tables.

When to Use What

  • One-off exploration: Run the discovery query directly, eyeball the results

  • Stable schema, needs to stay current: Generate a View

  • Performance-critical queries on semi-structured data: Generate a Dynamic Table

  • Complex transformation logic: Use the Python variant of the procedure and add your business rules

The full notebook with all procedures and sample data is available - drop me a line if you want it.