Inferring Schema from VARIANT Fields in Snowflake
How to automatically discover and flatten JSON schema from VARIANT columns in Snowflake using recursive FLATTEN and TYPEOF, without manually writing extraction logic for every field.

I get asked about this a lot. Someone lands JSON from a REST API into a VARIANT column, and now they want proper columns without manually writing json_data:field1::VARCHAR, json_data:field2::NUMBER for every single path. And you guessed it - there's no built-in INFER_SCHEMA for VARIANT columns like there is for staged files.
The reason? It's genuinely hard. Unlike staged files where Snowflake can sample a few files upfront, VARIANT columns can contain wildly different structures across rows, nested objects go arbitrarily deep, and arrays make things exponentially messier. So far so good - but that doesn't mean we can't build something ourselves.
The Core Trick: Recursive FLATTEN + TYPEOF
Here's the discovery query that makes everything possible:
SELECT DISTINCT
f.path AS original_path,
UPPER(REPLACE(f.path, '.', '_')) AS column_name,
TYPEOF(f.value) AS data_type
FROM my_table,
LATERAL FLATTEN(json_data, RECURSIVE => TRUE) f
WHERE TYPEOF(f.value) NOT IN ('OBJECT')
AND f.path NOT LIKE '%[%'
ORDER BY column_name;
FLATTEN with RECURSIVE => TRUE walks the entire JSON tree and returns every path. TYPEOF() tells us what's at each path. The NOT LIKE '%[%' filter excludes array contents - I'll explain why in a moment.
Run this against a typical API response and you'll see something like:
| original_path | column_name | data_type |
| address.city | ADDRESS_CITY | VARCHAR |
| address.geo.lat | ADDRESS_GEO_LAT | DOUBLE |
| balance | BALANCE | DECIMAL |
| is_active | IS_ACTIVE | BOOLEAN |
| orders | ORDERS | ARRAY |
| tags | TAGS | ARRAY |
Why Arrays Stay as VARIANT
My first instinct was to recursively explode everything. But that creates a mess:
Arrays can have different lengths per row - do you create
ITEM_0,ITEM_1,ITEM_2... how many?Nested arrays explode your row count exponentially
The resulting schema becomes unpredictable
Instead, the approach I landed on keeps top-level arrays as VARIANT columns. You can still query them with LATERAL FLATTEN when you need to, but your base schema stays stable. The people who need to dig into arrays can handle that downstream.
Wrapping It in a Procedure
Once you've got the discovery query working, wrapping it in a stored procedure makes it reusable:
CREATE OR REPLACE PROCEDURE discover_json_schema(
source_table VARCHAR,
variant_column VARCHAR
)
RETURNS ARRAY
LANGUAGE SQL
AS
DECLARE
schema_array ARRAY;
BEGIN
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(
'original_path', original_path,
'column_name', column_name,
'sql_type', sql_type
))
INTO schema_array
FROM (
SELECT DISTINCT
f.path AS original_path,
UPPER(REPLACE(f.path, '.', '_')) AS column_name,
CASE
WHEN TYPEOF(f.value) = 'INTEGER' THEN 'NUMBER'
WHEN TYPEOF(f.value) IN ('DOUBLE', 'DECIMAL') THEN 'FLOAT'
WHEN TYPEOF(f.value) = 'BOOLEAN' THEN 'BOOLEAN'
WHEN TYPEOF(f.value) = 'ARRAY' THEN 'VARIANT'
ELSE 'VARCHAR'
END AS sql_type
FROM IDENTIFIER(:source_table),
LATERAL FLATTEN(IDENTIFIER(:variant_column), RECURSIVE => TRUE) f
WHERE TYPEOF(f.value) NOT IN ('OBJECT')
AND f.path NOT LIKE '%[%'
ORDER BY column_name
);
RETURN schema_array;
END;
Call it like this:
CALL discover_json_schema('my_db.my_schema.raw_api_data', 'json_data');
Returns an array of objects you can loop through to generate DDL.
[
{
"column_name": "EMAIL",
"original_path": "email",
"sql_type": "VARCHAR"
},
{
"column_name": "IS_ACTIVE",
"original_path": "is_active",
"sql_type": "BOOLEAN"
},
-- [... and so on ]
]
Generating Views Automatically
The natural next step - a procedure that creates a view with all discovered columns:
CREATE OR REPLACE PROCEDURE generate_flattened_view(
source_table VARCHAR,
variant_column VARCHAR,
target_view VARCHAR
)
RETURNS VARCHAR
LANGUAGE SQL
AS
DECLARE
ddl_statement VARCHAR;
select_cols VARCHAR;
schema_array ARRAY;
BEGIN
CALL discover_json_schema(:source_table, :variant_column) INTO schema_array;
SELECT LISTAGG(
'GET_PATH(' || :variant_column || ', ''' || s.value:original_path::VARCHAR || ''')::' ||
s.value:sql_type::VARCHAR || ' AS "' || s.value:column_name::VARCHAR || '"',
', '
) WITHIN GROUP (ORDER BY s.value:column_name::VARCHAR)
INTO select_cols
FROM TABLE(FLATTEN(:schema_array)) s;
ddl_statement := 'CREATE OR REPLACE VIEW ' || :target_view ||
' AS SELECT ' || select_cols ||
' FROM ' || :source_table;
EXECUTE IMMEDIATE ddl_statement;
RETURN 'Created view: ' || :target_view;
END;
Now one call flattens your entire JSON structure:
CALL generate_flattened_view(
'raw_api_data',
'json_data',
'api_data_flat'
);
Dynamic Tables for Auto-Refresh
Same idea, but with Dynamic Tables for continuously refreshing data:
CREATE OR REPLACE PROCEDURE generate_flattened_dynamic_table(
source_table VARCHAR,
variant_column VARCHAR,
target_dt VARCHAR,
warehouse VARCHAR DEFAULT 'COMPUTE_WH',
target_lag VARCHAR DEFAULT '1 hour'
)
RETURNS VARCHAR
LANGUAGE SQL
AS
DECLARE
ddl_statement VARCHAR;
select_cols VARCHAR;
schema_array ARRAY;
BEGIN
CALL discover_json_schema(:source_table, :variant_column) INTO schema_array;
SELECT LISTAGG(
'GET_PATH(' || :variant_column || ', ''' || s.value:original_path::VARCHAR || ''')::' ||
s.value:sql_type::VARCHAR || ' AS "' || s.value:column_name::VARCHAR || '"',
', '
) WITHIN GROUP (ORDER BY s.value:column_name::VARCHAR)
INTO select_cols
FROM TABLE(FLATTEN(:schema_array)) s;
ddl_statement := 'CREATE OR REPLACE DYNAMIC TABLE ' || :target_dt ||
' TARGET_LAG = ''' || :target_lag || '''' ||
' WAREHOUSE = ' || :warehouse ||
' AS SELECT ' || select_cols ||
' FROM ' || :source_table;
EXECUTE IMMEDIATE ddl_statement;
RETURN 'Created dynamic table: ' || :target_dt;
END;
Caveats
This is a starting point, not a production-ready solution. Things you'll likely need to adjust:
Type conflicts: If the same path has different types across rows (eg. APIs returning
nullvs0), the discovery picks one. You might want majority-wins logic or explicit overrides.Column name collisions:
user.idanduser_idboth becomeUSER_ID. Add disambiguation if your data has this. For example you can use a different separator.Schema evolution: New fields in source JSON won't automatically appear. Re-run the procedure or build a scheduled task to detect drift.
Performance: Scanning the entire table for schema discovery is expensive. Consider sampling with
TABLESAMPLEorLIMITfor large tables.
When to Use What
One-off exploration: Run the discovery query directly, eyeball the results
Stable schema, needs to stay current: Generate a View
Performance-critical queries on semi-structured data: Generate a Dynamic Table
Complex transformation logic: Use the Python variant of the procedure and add your business rules
The full notebook with all procedures and sample data is available - drop me a line if you want it.




