[Small Talk] Converting CSV files with many columns to JSON and loading them #SnowflakeDB

モダンデータスタック(MDS)

Snowflake

安原朋紀

2025.09.06

Introduction

I had an opportunity to load data files with many columns in Snowflake, and I'm sharing what I tried in this article.

Conclusion of this article

If you can convert data files beforehand, transforming them into semi-structured formats like JSON before loading and loading them into a single VARIANT type column can be expected to improve loading performance.

What I'll try in this article

I'll try the following approaches for loading data files with many columns:

Create data files (CSV) with many columns
- Maximum number of columns set to 20,000
Load the data files (execute COPY command)
- Loading in CSV format
  - Define a table with maximum 20,000 columns
- Convert data files to JSON format and load into 1 column with VARIANT type
  - Expand vertically within Snowflake after loading
  - Expand horizontally within Snowflake after loading## Preparation

Sample data creation and CSV to JSON file conversion were done using Python.

The various Python functions used during verification are as follows. The creation of these functions was handled by generative AI.

Function to create sample data

					
			import pandas as pd
import numpy as np
import sys

def generate_survey_data(rows: int, cols: int):
    """
    Function that generates a sample of survey response data with the number of rows and columns 
    specified in the arguments and outputs it as a CSV

    Args:
        rows (int): Number of rows (number of respondents)
        cols (int): Number of columns (USERID + number of questions)
    """
    if cols < 1:
        print("Please specify 1 or more for the number of columns.")
        return

    # Generate column names
    columns = ['USERID']
    for i in range(1, cols):
        columns.append(f'Q-{i}')

    # Create data excluding USERID column (randomly generate 0, 1)
    data = np.random.randint(0, 2, size=(rows, cols - 1))

    # Create DataFrame
    df = pd.DataFrame(data, columns=columns[1:])

    # Add USERID column
    df.insert(0, 'USERID', range(1, rows + 1))

    # Output as CSV file
    file_name = f'survey_data_{rows}x{cols}.csv'
    df.to_csv(file_name, index=False)
    print(f'CSV file generated: {file_name}')

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print("Usage: python generate_survey_data.py <rows> <columns>")
    else:
        try:
            rows = int(sys.argv[1])
            cols = int(sys.argv[2])
            generate_survey_data(rows, cols)
        except ValueError:
            print("Please specify integers as arguments.")

		
		

Function to create DDL statements for tables with many columns

					
			import sys

def generate_ddl_to_file(table_name: str, num_questions: int, output_file: str):
    """
    Function that generates Snowflake CREATE TABLE DDL for the specified table name 
    and number of questions, and writes it to a file.

    Args:
        table_name (str): Name of the table to create.
        num_questions (int): Number of question columns (Q-1, Q-2...).
        output_file (str): Filename to write the DDL statement to.
    """
    if num_questions < 1:
        print("Error: Please specify an integer of 1 or greater for the number of questions.")
        return

    columns = ['\t"USERID" NUMBER(2,0)']

    for i in range(1, num_questions + 1):
        columns.append(f'\t"Q-{i}" NUMBER(1,0)')

    columns_str = ",\n".join(columns)
    ddl = f"create or replace TABLE {table_name} (\n{columns_str}\n);"

    try:
        with open(output_file, 'w') as f:
            f.write(ddl)
        print(f"DDL successfully written to '{output_file}'.")
    except Exception as e:
        print(f"Error: Failed to write to file. Details: {e}")

if __name__ == "__main__":
    # Check number of arguments (script name + number of questions = 2)
    if len(sys.argv) != 2:
        print("Usage: python generate_snowflake_ddl.py <number of questions>")
        sys.exit(1)

    num_questions_str = sys.argv[1]

    try:
        num_questions = int(num_questions_str)
        if num_questions < 1:
            raise ValueError("Please specify an integer of 1 or greater for the number of questions.")

        # Auto-generate table name and output filename
        total_columns = num_questions + 1
        table_name = f'MY_SURVEY_TABLE_{total_columns}'
        output_file = f'{table_name}.sql'

        generate_ddl_to_file(table_name, num_questions, output_file)
    except ValueError as e:
        print(f"Error: {e}")
```:::details Function to Convert CSV Files to JSON Format

```python
import csv
import json
import sys

def csv_to_json(csv_file_path, json_file_path):
    """
    Function that reads a specified CSV file and outputs it as a JSON file.

    Args:
        csv_file_path (str): Path to the CSV file to read.
        json_file_path (str): Path to the JSON file to output.
    """
    data = []
    try:
        with open(csv_file_path, 'r') as csv_file:
            csv_reader = csv.DictReader(csv_file)
            for row in csv_reader:
                data.append(row)
    except FileNotFoundError:
        print(f"Error: File '{csv_file_path}' not found.")
        return
    except Exception as e:
        print(f"Error: Failed to read CSV file. Details: {e}")
        return

    try:
        with open(json_file_path, 'w', encoding='utf-8') as json_file:
            json.dump(data, json_file, indent=4, ensure_ascii=False)
        print(f"Read '{csv_file_path}' and converted it to JSON file as '{json_file_path}'.")
    except Exception as e:
        print(f"Error: Failed to write JSON file. Details: {e}")

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print("Usage: python csv_to_json_converter.py <input CSV filename> <output JSON filename>")
        sys.exit(1)

    input_csv = sys.argv[1]
    output_json = sys.argv[2]
    csv_to_json(input_csv, output_json)

		
		

Sample Data

Using the above function, I created the following data files:

					
			# 1000 columns
$ python generate_sample_data.py 10 1000
CSV file generated: survey_data_10x1000.csv
# 10,000 columns
$ python generate_sample_data.py 10 10000
CSV file generated: survey_data_10x10000.csv
# 20,000 columns
$ python generate_sample_data.py 10 20000
CSV file generated: survey_data_10x20000.csv

		
		

The content has 10 records in each file, while the number of columns is 1000, 10,000, and 20,000 respectively. Here is a partial view of the data file:

			
			USERID,Q-1,Q-2,...,Q-997,Q-998,Q-999
1,0,1,...,1,0,1
2,0,1,...,1,0,1
3,1,1,...,0,0,1
.
.
.
8,1,1,...,1,0,1
9,1,1,...,0,0,1
10,1,1,...,1,1,1
```### Snowflake Side: Creating Various Objects

Let's create database, schema, internal stage and other objects for verification.

```sql
CREATE DATABASE IF NOT EXISTS test_db;
USE SCHEMA PUBLIC;
--Create internal stage
CREATE STAGE my_int_stage;

I created two file formats. The one with PARSE_HEADER = TRUE will be used when defining tables through schema detection.

					
			--Create file formats
CREATE OR REPLACE FILE FORMAT my_csv_format_parse_header
  TYPE = CSV
  FIELD_DELIMITER = ','
  PARSE_HEADER = TRUE
  EMPTY_FIELD_AS_NULL = true
  COMPRESSION = NONE;

CREATE OR REPLACE FILE FORMAT my_csv_format
  TYPE = CSV
  FIELD_DELIMITER = ','
  SKIP_HEADER = 1
  EMPTY_FIELD_AS_NULL = true
  COMPRESSION = NONE;

		
		

Let's create virtual warehouses for verification. I've prepared two different sizes here.

					
			--Virtual warehouse
USE ROLE SYSADMIN;

CREATE WAREHOUSE xs_wh
  WAREHOUSE_SIZE = XSMALL
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

CREATE WAREHOUSE large_wh
  WAREHOUSE_SIZE = LARGE
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE
  INITIALLY_SUSPENDED = TRUE;

		
		

I've also disabled caching.

					
			ALTER ACCOUNT SET USE_CACHED_RESULT = false;

The sample data files that we created have been placed in the internal stage.

					
			>ls @my_int_stage;
+---------------------------------------+--------+----------------------------------+-------------------------------+
| name                                  |   size | md5                              | last_modified                 |
|---------------------------------------+--------+----------------------------------+-------------------------------|
| my_int_stage/survey_data_10x1000.csv  |  25904 | 6c12874310c2c463e72cf8b541f69a60 | Sun, 31 Aug 2025 06:10:26 GMT |
| my_int_stage/survey_data_10x10000.csv | 268896 | d206836f0772efaa56cd121058528c08 | Sun, 31 Aug 2025 06:10:26 GMT |
| my_int_stage/survey_data_10x20000.csv | 548896 | 106958550d2deb822149657d885cabe3 | Sun, 31 Aug 2025 06:10:26 GMT |
+---------------------------------------+--------+----------------------------------+-------------------------------+

		
		

Let's Try It

Let's load the CSV data files with different numbers of columns.### 1000 Columns

Due to the large number of columns, I defined the table using the schema detection feature.

					
			--Define table using schema detection feature: 1000 columns
CREATE OR REPLACE TABLE MY_SURVEY_TABLE_1000
    USING TEMPLATE (
        SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
        FROM TABLE(
            INFER_SCHEMA(
                LOCATION=>'@my_int_stage/survey_data_10x1000.csv',
                FILE_FORMAT=>'my_csv_format_parse_header'
                )
            )
        );

--Confirm the number of columns
>DESC TABLE MY_SURVEY_TABLE_1000
   ->> SELECT count(*) FROM $1;
+----------+
| COUNT(*) |
|----------|
|     1000 |
+----------+

--Check USERID and 1000 columns
>DESC TABLE MY_SURVEY_TABLE_1000
	->> SELECT "name" FROM $1 LIMIT 3;
+--------+
| name   |
|--------|
| USERID |
| Q-1    |
| Q-2    |
+--------+

>DESC TABLE MY_SURVEY_TABLE_1000
	->> SELECT "name" FROM $1 LIMIT 3 OFFSET 997;
+-------+
| name  |
|-------|
| Q-997 |
| Q-998 |
| Q-999 |
+-------+

		
		

Let's load the data. In this case with this number of columns, I only tested with XS size. Here, the data loaded in about 5 seconds.

					
			>ALTER WAREHOUSE xs_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE xs_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
>COPY INTO MY_SURVEY_TABLE_1000
	FROM @my_int_stage/survey_data_10x1000.csv
	FILE_FORMAT=(FORMAT_NAME = 'my_csv_format');
+--------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                 | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|--------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| my_int_stage/survey_data_10x1000.csv | LOADED |          10 |          10 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+--------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
```![image](https://devio2024-media.developers.io/image/upload/v1757151283/2025/09/06/jehmjakzdd78hdqbaiux.png)### 10000 Columns CSV

Next, we'll try a CSV data file with 10000 columns. Since there are many columns, we similarly defined the table using the schema detection feature.

```sql
--Define table using schema detection feature: 10000 columns
CREATE OR REPLACE TABLE MY_SURVEY_TABLE_10000
    USING TEMPLATE (
        SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
        FROM TABLE(
            INFER_SCHEMA(
                LOCATION=>'@my_int_stage/survey_data_10x10000.csv',
                FILE_FORMAT=>'my_csv_format_parse_header'
                )
            )
        );

--Check the number of columns
>DESC TABLE MY_SURVEY_TABLE_10000
   ->> SELECT count(*) FROM $1;
+----------+
| COUNT(*) |
|----------|
|    10000 |
+----------+

		
		

From here, we'll try loading with each warehouse size.

XS size

					
			>ALTER WAREHOUSE xs_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE xs_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO MY_SURVEY_TABLE_10000
	FROM @my_int_stage/survey_data_10x10000.csv
	FILE_FORMAT=(FORMAT_NAME = 'my_csv_format');
+---------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                  | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|---------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| my_int_stage/survey_data_10x10000.csv | LOADED |          10 |          10 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+---------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
```![image 1](https://devio2024-media.developers.io/image/upload/v1757151288/2025/09/06/yzczvcbh7pj0xt6hwxyd.png)

- L size

```sql
TRUNCATE TABLE MY_SURVEY_TABLE_10000;

>ALTER WAREHOUSE large_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE large_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO MY_SURVEY_TABLE_10000
	FROM @my_int_stage/survey_data_10x10000.csv
	FILE_FORMAT=(FORMAT_NAME = 'my_csv_format');
+---------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                  | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|---------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| my_int_stage/survey_data_10x10000.csv | LOADED |          10 |          10 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+---------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+

		
		

The above shows the results of each single trial, and each took approximately 10 minutes to load. During testing, since we are loading a single file, there is no visible benefit to changing the size.
*Details will be explained later.### 2000 Columns CSV

Let's try loading a CSV file with an increased number of columns - 20000 columns. When attempting to define a table using schema detection in this case, an error occurred.

					
			>CREATE OR REPLACE TABLE MY_SURVEY_TABLE_20000
	 USING TEMPLATE (
	     SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
	     FROM TABLE(
	         INFER_SCHEMA(
	             LOCATION=>'@my_int_stage/survey_data_10x20000.csv',
	             FILE_FORMAT=>'my_csv_format_parse_header'
	             )
	         )
	     );
100316 (22000): Column position is 10000 but max column position allowed is 9999

		
		

According to the error message, the maximum number of columns appears to be 9,999. Since we were able to define a table with 10,000 columns in the previous step, I tried with a data file containing 10,001 columns, which resulted in an error.

					
			$ python generate_sample_data.py 10 10001

This produces an error:

					
			>CREATE OR REPLACE TABLE MY_SURVEY_TABLE_10001
	 USING TEMPLATE (
	     SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
	     FROM TABLE(
	         INFER_SCHEMA(
	             LOCATION=>'@my_int_stage/survey_data_10x10001.csv',
	             FILE_FORMAT=>'my_csv_format_parse_header'
	             )
	         )
	     );
100316 (22000): Column position is 10000 but max column position allowed is 9999

		
		

Although the schema detection feature gives an error, Snowflake still allows us to define the table directly. I generated a DDL statement using the function prepared in advance, and defined the table by executing this file.

					
			$ python generate_ddl.py 19999
DDL has been successfully written to 'MY_SURVEY_TABLE_20000.sql'.

Part of the defined DDL statement:

					
			create or replace TABLE MY_SURVEY_TABLE_20000 (
	"USERID" NUMBER(2,0),
	"Q-1" NUMBER(1,0),
	"Q-2" NUMBER(1,0),
	・
	・
	・
	"Q-19997" NUMBER(1,0),
	"Q-19998" NUMBER(1,0),
	"Q-19999" NUMBER(1,0)
);

		
		

I placed the SQL file in the stage and executed it:

					
			--Define the table
>EXECUTE IMMEDIATE FROM @my_int_stage/MY_SURVEY_TABLE_20000.sql;
+---------------------------------------------------+
| status                                            |
|---------------------------------------------------|
| Table MY_SURVEY_TABLE_20000 successfully created. |
+---------------------------------------------------+

--Confirm the number of columns
>DESC TABLE MY_SURVEY_TABLE_20000
   ->> SELECT count(*) FROM $1;
+----------+
| COUNT(*) |
|----------|
|    20000 |
+----------+

		
		

Let's try loading with different warehouse sizes.

XS size

					
			>ALTER WAREHOUSE xs_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE xs_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO MY_SURVEY_TABLE_20000
  FROM @my_int_stage/survey_data_10x20000.csv
  FILE_FORMAT=(FORMAT_NAME = 'my_csv_format');

>SELECT COUNT(*) FROM MY_SURVEY_TABLE_20000;
+----------+
| COUNT(*) |
|----------|
|       10 |
+----------+


		
		

L size

					
			TRUNCATE TABLE MY_SURVEY_TABLE_20000;

>ALTER WAREHOUSE large_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE large_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO MY_SURVEY_TABLE_20000
	FROM @my_int_stage/survey_data_10x20000.csv
	FILE_FORMAT=(FORMAT_NAME = 'my_csv_format');

>SELECT COUNT(*) FROM MY_SURVEY_TABLE_20000;

		
		

Both took more than 30 minutes to complete. Similarly, no benefits were observed from increasing the warehouse size.### Converting to JSON Format and Loading

Let's convert CSV to JSON and load it as a single column of VARIANT type. We'll try this with original column counts of 10000 and 20000 columns.
First, we'll convert our data files to JSON format.

					
			$ python csv_to_json.py survey_data_10x10000.csv survey_data_10x10000.json
Successfully read 'survey_data_10x10000.csv' and converted it to JSON file 'survey_data_10x10000.json'.
$ python csv_to_json.py survey_data_10x20000.csv survey_data_10x20000.json
Successfully read 'survey_data_10x20000.csv' and converted it to JSON file 'survey_data_10x20000.json'.

The contents of the JSON file look like this:

					
			[
    {
        "USERID": "1",
        "Q-1": "1",
        "Q-2": "0",
        ・
        ・
        ・
        "Q-9998": "1",
        "Q-9999": "1"
    },
    {
        "USERID": "2",
        "Q-1": "1",
        "Q-2": "0",
        ・
        ・
        ・
        "Q-9998": "0",
        "Q-9999": "0"
    }
]

		
		

We'll place the created JSON files in the internal stage.### Converting 10000-column CSV to JSON format and loading

I will try this with various sizes as well.

XS size

					
			--Define the table
CREATE OR REPLACE TABLE test_10000 (v VARIANT);

>ALTER WAREHOUSE xs_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE xs_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO test_10000
	FROM @my_int_stage/survey_data_10x10000.json
	FILE_FORMAT = (
	  TYPE = 'JSON'
	  ,STRIP_OUTER_ARRAY = TRUE
);
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                   | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| my_int_stage/survey_data_10x10000.json | LOADED |          10 |          10 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+

		
		

L size

					
			TRUNCATE TABLE test_10000;

>ALTER WAREHOUSE large_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE large_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO test_10000
	FROM @my_int_stage/survey_data_10x10000.json
	FILE_FORMAT = (
	  TYPE = 'JSON'
	  ,STRIP_OUTER_ARRAY = TRUE
);
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                   | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| my_int_stage/survey_data_10x10000.json | LOADED |          10 |          10 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
```![image 6](https://devio2024-media.developers.io/image/upload/v1757151311/2025/09/06/q8u06l5a18wxqj09keml.png)

In this case, even with XS size, it loaded in about 2 seconds. After loading, it looks like the image below.

![image 7](https://devio2024-media.developers.io/image/upload/v1757151316/2025/09/06/ton9m2i18f1babf3dorb.png)### Converting 20000-column CSV to JSON format and loading

I tried this with different sizes as well.

- XS size

```sql
--Define the table
CREATE OR REPLACE TABLE test_20000 (v VARIANT);

>ALTER WAREHOUSE xs_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE xs_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO test_20000
	FROM @my_int_stage/survey_data_10x20000.json
	FILE_FORMAT = (
	  TYPE = 'JSON'
	  ,STRIP_OUTER_ARRAY = TRUE
);
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                   | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| my_int_stage/survey_data_10x20000.json | LOADED |          10 |          10 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+

		
		

L size

					
			TRUNCATE TABLE test_20000;

>ALTER WAREHOUSE large_wh RESUME;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>USE WAREHOUSE large_wh;
+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+

>COPY INTO test_20000
	FROM @my_int_stage/survey_data_10x20000.json
	FILE_FORMAT = (
	  TYPE = 'JSON'
	  ,STRIP_OUTER_ARRAY = TRUE
);
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                                   | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| my_int_stage/survey_data_10x20000.json | LOADED |          10 |          10 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+----------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
```![image 9](https://devio2024-media.developers.io/image/upload/v1757151327/2025/09/06/diendfz3bvkbk0ok637a.png)

It loaded within 2 seconds even with XS size.

### Let's try transformation processing after loading

Since we loaded data into a VARIANT type column, the loaded data looks as shown in the figure below. Each record stores survey responses for each user in a key-value format.

![image 10](https://devio2024-media.developers.io/image/upload/v1757151332/2025/09/06/zczevfap9i39ojlagnji.png)

We can convert it to a vertical format with the following query. First, let's try with data originally from a CSV file with 10,000 columns that was loaded in JSON format.

- XS size

```sql
ALTER WAREHOUSE xs_wh RESUME;
USE WAREHOUSE xs_wh;

SELECT
    t.v:"USERID"::NUMBER AS userid,
    f.key::VARCHAR AS key,
    f.value::VARCHAR AS value
FROM
    test_10000 AS t,
    LATERAL FLATTEN(input => t.v) AS f
WHERE
    f.key <> 'USERID'
ORDER BY
    userid,
    key;

		
		

The query details are as shown below. It completed in less than a second.

L size

The query details are as shown below. Similarly, it completed in less than a second.

Next, let's try with data originally from a CSV file with 20,000 columns that was loaded in JSON format.

XS size

Query details

L size

This also completed in about 1 second. If it's not essential to maintain over 10,000 columns simultaneously, the vertical format is likely to improve both performance and ease of analysis.### Expand Columns

Finally, after importing a CSV with 10,000 columns into Snowflake in JSON format, I tried expanding it back to 10,000 columns within Snowflake.
Below is an example tested with an L-size warehouse, which resulted in a processing time of over 10 minutes, similar to the initial loading time.

					
			CREATE OR REPLACE TABLE test_10000_flat AS
SELECT
    *
FROM
    (
        SELECT
            t.v:"USERID"::NUMBER AS userid,
            f.key::VARCHAR AS key,
            f.value::VARCHAR AS value
        FROM
            test_10000 AS t,
            LATERAL FLATTEN(input => t.v) AS f
        WHERE
            f.key <> 'USERID'
    ) AS source_data
PIVOT(
    MAX(value) FOR key IN (ANY)
) AS pivoted_data
ORDER BY
    userid;

--Check the number of columns
DESC TABLE test_10000_flat
   ->> SELECT count(*) FROM $1;
+----------+
| COUNT(*) |
|----------|
|    10000 |
+----------+

		
		

Query details

## Considerations for Tables with Many Columns

First, regarding the number of data files related to the current configuration during data loading, the following relevant information is available:

Snowflake virtual warehouses have each thread ingesting a single file at a time.
An XS size warehouse provides 8 threads, and the number of available threads doubles with each increase in warehouse size. Therefore, you can benefit from scaling up when loading a large number of files.
On the other hand, in cases like this one where the number of files is small, increasing the size won't change the processing speed, as each thread processes a single file.

Additionally, the relationship between number of columns and performance is explained in detail here:

Tables with many columns experience decreased performance due to reduced compression rates and increased metadata processing during compilation. In particular, compilation time significantly affects query execution speed.
For reference, below is a comparison of compilation times for relevant queries. There are three records, with the query content from top to bottom as follows:

SELECT from a table storing a 10,000-column CSV converted to JSON in a single VARIANT type column
SELECT from a table where a single VARIANT column was expanded to 10,000 columns within Snowflake
Creating a 10,000-column table within Snowflake from a table with a single VARIANT column (CTAS)

					
			>SELECT 
	QUERY_TEXT
	,QUERY_TYPE
	,TOTAL_ELAPSED_TIME
	,BYTES_SCANNED,COMPILATION_TIME
  ,EXECUTION_TIME
 FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
 WHERE QUERY_ID IN ('01beddb4-0002-c33c-0002-c186000471da','01beddae-0002-c2d9-0002-c186000442aa','01bedd84-0002-c2c4-0002-c1860003c3fe')
 ORDER BY start_time DESC;
+------------------------------------------------+------------------------+--------------------+---------------+------------------+----------------+
| QUERY_TEXT                                     | QUERY_TYPE             | TOTAL_ELAPSED_TIME | BYTES_SCANNED | COMPILATION_TIME | EXECUTION_TIME |
|------------------------------------------------+------------------------+--------------------+---------------+------------------+----------------|
| select * from test_10000;                      | SELECT                 |                831 |        403488 |              184 |            508 |
| select * from test_10000_flat;                 | SELECT                 |             108843 |       3183136 |           107624 |           1082 |
| CREATE OR REPLACE TABLE test_10000_flat AS     | CREATE_TABLE_AS_SELECT |             764722 |        403488 |           709523 |          54311 |
| SELECT                                         |                        |                    |               |                  |                |
|     *                                          |                        |                    |               |                  |                |
| FROM                                           |                        |                    |               |                  |                |
|     (                                          |                        |                    |               |                  |                |
|         SELECT                                 |                        |                    |               |                  |                |
|             t.v:"USERID"::NUMBER AS userid,    |                        |                    |               |                  |                |
|             f.key::VARCHAR AS key,             |                        |                    |               |                  |                |
|             f.value::VARCHAR AS value          |                        |                    |               |                  |                |
|         FROM                                   |                        |                    |               |                  |                |
|             test_10000 AS t,                   |                        |                    |               |                  |                |
|             LATERAL FLATTEN(input => t.v) AS f |                        |                    |               |                  |                |
|         WHERE                                  |                        |                    |               |                  |                |
|             f.key <> 'USERID'                  |                        |                    |               |                  |                |
|     ) AS source_data                           |                        |                    |               |                  |                |
| PIVOT(                                         |                        |                    |               |                  |                |
|     MAX(value) FOR key IN (ANY)                |                        |                    |               |                  |                |
| ) AS pivoted_data                              |                        |                    |               |                  |                |
| ORDER BY                                       |                        |                    |               |                  |                |
|     userid;                                    |                        |                    |               |                  |                |
+------------------------------------------------+------------------------+--------------------+---------------+------------------+----------------+
```Compared to the first query that directly references JSON (184ms), the query that expands columns before referencing them in Snowflake has an overwhelmingly longer compilation time. Particularly when creating a table using CREATE TABLE AS SELECT, the compilation time exceeded 700,000ms, possibly due to the massive metadata processing that occurs.

## Conclusion

I tried loading a CSV file with many columns.
I hope this information will be useful as a reference.