Loading data into Snowflake and performance of large joins

Published on March 16, 2018
Updated on May 14, 2024

Introduction

In this blog post we will load a large dataset into Snowflake and then evaluate the performance of joins in Snowflake.

Loading large data into Snowflake

Dataset

The dataset we will load is hosted on Kaggle and contains Checkouts of Seattle library from 2006 until 2017. You can also download the data and see some samples here.
The dataset consists of two main file types: Checkouts and the Library Connection Inventory.
To start off the process we will create tables on Snowflake for those two files.
The DDL statements are:

create table checkouts (
    BibNumber NUMBER,
    ItemBarcode NUMBER,
    ItemType varchar(30),
    Collection varchar(30),
    CallNumber varchar(70),
    CheckoutDateTime varchar(40)
);
create table inventory (
  BibNumber NUMBER,
  Title varchar(200),
  Author varchar(200),
  ISBN varchar(50),
  PublicationYear varchar(20),
  Publisher varchar(100),
  Subjects varchar(100),
  ItemType varchar(10),
  ItemCollection varchar(10),
  FloatingItem varchar(10),
  ItemLocation varchar(10),
  ReportDate varchar(35),
  ItemCount NUMBER
);

create table checkouts (

BibNumber NUMBER,

ItemBarcode NUMBER,

ItemType varchar(30),

Collection varchar(30),

CallNumber varchar(70),

CheckoutDateTime varchar(40)

);

create table inventory (

BibNumber NUMBER,

Title varchar(200),

Author varchar(200),

ISBN varchar(50),

PublicationYear varchar(20),

Publisher varchar(100),

Subjects varchar(100),

ItemType varchar(10),

ItemCollection varchar(10),

FloatingItem varchar(10),

ItemLocation varchar(10),

ReportDate varchar(35),

ItemCount NUMBER

);

A detail to notice is that the book contained in each checkout event can be linked to the table containing inventory of books using the BibNumber .

Loading the dataset into Snowflake

To load the dataset from our local machine into Snowflake, we will use SnowSQL – a command line application developed by Snowflake for loading data. SnowSQL is currently the only way to upload data from a local machine into Snowflake’s staging area.
SnowSQL can be used to fully automate the loading procedure. It does authentication of users and the command line interface is one the best we’ve seen mainly because of excellent autocompletion support.
The process will be as follows:

Use the PUT command to upload the file(s) into Snowflake staging area
Use the COPY INTO command to populate tables we defined with data in Snowflake staging area

Uploading files to Snowflake staging area

Once we download the data from Kaggle (2GB compressed, 6GB uncompressed), we can start with the uploading process.
First we will define a stage (staging area) on Snowflake. During the definition of a stage, it’s usually also good to specify the default file format.
Our data is in CSV format with commas (‘,’) being the field delimiter. The columns containing string values in the Inventory file have quotes around them so we will add another parameter to handle this. The file format can be defined as follows:

CREATE OR REPLACE file format load_csv_format
  TYPE = 'CSV'
  FIELD_DELIMITER = ','
  FIELD_OPTIONALLY_ENCLOSED_BY = '"'
  ;

CREATE OR REPLACE file format load_csv_format

TYPE = 'CSV'

FIELD_DELIMITER = ','

FIELD_OPTIONALLY_ENCLOSED_BY = '"'

;

Now we can define the stage:

CREATE OR REPLACE stage library_load
  file_format = load_csv_format
;

CREATE OR REPLACE stage library_load

file_format = load_csv_format

;

Copying the data from local machine into the stage doesn’t require much validation. The file format is not checked during the execution of PUT command but only during the COPY INTO command.Now we can define the stage:
To copy the files we will use following statements that also utilize Regular Expressions:

put file:///Checkouts_By_Title_Data_Lens_20* @library_load
                                                    auto_compress = true;
put file:///Library_Collection_Inventory.csv @library_load
                                                    auto_compress = true;

put file:///Checkouts_By_Title_Data_Lens_20* @library_load

auto_compress = true;

put file:///Library_Collection_Inventory.csv @library_load

auto_compress = true;

In this case compressed file that get uploaded to Snowflake are of average size ~200MB. Ideally we would split this into even smaller files of sizes 10-100MB so that the COPY INTO command can be better parallelized. You can read more about these considerations in Snowflake’s manual.

Copying the data into tables

The process of copying the data into tables is usually more complicated and really depends on the quality of your data.
Now we are going to copy the staged files into according tables on Snowflake.
We can then copy the data as following:

copy into checkouts
   from @library_load
   pattern = ‘.*Checkouts_By_Title_Data_Lens_20.*’
   file_format = (format_name = load_csv_format)
   on_error = 'skip_file_1%'

copy into checkouts

from @library_load

pattern = ‘.*Checkouts_By_Title_Data_Lens_20.*’

file_format = (format_name = load_csv_format)

on_error = 'skip_file_1%'

We used the ON_ERROR clause because some rows from 2005 and 2006 missed the date column. You could also ignore those errors but we decided to remove those which didn’t have values. Loading the “Library_Collection_Inventory.csv” file was slightly more difficult as the string columns were enclosed with double quotes. This was dealt with in the definition of the file format. Another issue is that some string columns have extreme lengths (1000+ characters) so we decided to truncate those values ie. discard them. We added an extra TRUNCATECOLUMNS clause to deal with this.
The statement we used is the following:

copy into inventory
     from @library_load pattern='.*Library_Collection.*'
     file_format = (format_name = load_csv_format)
     on_error = 'skip_file_1%'
     TRUNCATECOLUMNS = TRUE;

copy into inventory

from @library_load pattern='.*Library_Collection.*'

file_format = (format_name = load_csv_format)

on_error = 'skip_file_1%'

TRUNCATECOLUMNS = TRUE;

Before doing the actual loading it’s a good practice to run the COPY INTO statement in validation mode by adding the VALIDATION_MODE clause. In validation mode Snowflake will not load the rows to the corresponding table but rather it will parse the data (according to the schema of your table) and return all the rows that were successfully parsed. It’s a great way to validate the quality of your data before actually loading it.

copy into inventory
     from @library_load pattern='.*Library_Collection.*'
     file_format = (format_name = load_csv_format)
     on_error = 'skip_file_1%'
     VALIDATION_MODE = RETURN_100_ROWS

copy into inventory

from @library_load pattern='.*Library_Collection.*'

file_format = (format_name = load_csv_format)

on_error = 'skip_file_1%'

VALIDATION_MODE = RETURN_100_ROWS

Joins on large tables

Now that we managed to load all the files into Snowflake let’s check the size of our table containing checkouts:

SELECT COUNT(*) FROM checkouts

1	SELECT COUNT(*) FROM checkouts

The query returned 91 931494. So we have almost 92 million rows.
Now we are going to test how Snowflake handles a join between two large transaction tables with 90M rows each. The tests will be done on the smallest XS instance.
We will join the checkouts table with itself on two fields: CheckoutDateTime and ItemBarcode.

Attempt 1

We will join the table with itself and see how the data is loaded. Specifically is the data loaded from the remote storage or the cache:

SELECT *
FROM checkouts c1
JOIN checkouts c2
    ON c1.CheckoutDateTime = c2.CheckoutDateTime
    AND c1.ItemBarcode = c2.ItemBarcode

SELECT *

FROM checkouts c1

JOIN checkouts c2

ON c1.CheckoutDateTime = c2.CheckoutDateTime

AND c1.ItemBarcode = c2.ItemBarcode

We started the query with clean caches ie. data wasn’t loaded into the local machine cache.
The resulting execution plan was:

The TableScan[1] operator which took 14.5% of the time actually loaded the data exclusively from remote storage while the TableScan[2] operator loaded it from the cache.
The most time consuming operation in the query was the Result operator which wrote the Result Set into Snowflake’s proprietary Key-Value store.
The join operator was the second most time consuming operator because it spilled to disk 4.83GB of data (compared to the output of 2.55GB).

Attempt 2

We will now create a clone of the checkouts table and then join it with the original table.

CREATE TABLE checkouts_clone CLONE checkouts

1	CREATE TABLE checkouts_clone CLONE checkouts

Now let’s join them:

SELECT *
FROM checkouts c1
JOIN checkouts_clone c2
    ON c1.CheckoutDateTime = c2.CheckoutDateTime
    AND c1.ItemBarcode = c2.ItemBarcode

SELECT *

FROM checkouts c1

JOIN checkouts_clone c2

ON c1.CheckoutDateTime = c2.CheckoutDateTime

AND c1.ItemBarcode = c2.ItemBarcode

The resulting execution plan is:

We can see that again only half of the data was loaded from remote storage and the other half was loaded from the cache. We assume that this happens because the join operator for one side first loads the data from remote storage into the local machine and then because cloned tables share the same underlying partitions, the optimizer can use the cached data for the join operator on the other side.
Again the same reasons made the Result and Join operators take the same amount of time.

Attempt 3

Now we will try to create an actual copy of the table using the CREATE TABLE AS statement:

CREATE TABLE checkouts2 AS
    SELECT * FROM checkouts

1 2	CREATE TABLE checkouts2 AS SELECT * FROM checkouts

The query is again:

SELECT *
FROM checkouts c1
JOIN checkouts2 c2
    ON c1.CheckoutDateTime = c2.CheckoutDateTime
    AND c1.ItemBarcode = c2.ItemBarcode

SELECT *

FROM checkouts c1

JOIN checkouts2 c2

ON c1.CheckoutDateTime = c2.CheckoutDateTime

AND c1.ItemBarcode = c2.ItemBarcode

The execution plan is:

This time data was fully loaded from the remote storage for both TableScan operators.
But it’s important to notice that this didn’t affect the overall execution time of the query that much – 132 seconds compared to 132 and 133 seconds for the first two attempts.
Because we saw the issues with Join operator spilling data to disk, it is a good idea to try a larger Virtual Warehouse size.

Increasing the size of our Virtual Warehouse

As we used the XS virtual warehouse before, now we will try the S sized one. The XS sized warehouse utilizes a single (1) server per cluster (we use only a single cluster). For the usage it bills 1 credit per hour. The S sized instance utilizes two (2) servers per cluster and is billed as 2 credits per hour.
We will keep the same query from Attempt 3.
The resulting execution plan is:

We can see that this time the query finished much faster 77 seconds compared to 132 seconds. This is mainly due to the fact that the Join operator didn’t have to spill anything to disk. This is due to the fact that larger Virtual Warehouses get more memory so the join could have been completed in-memory.

Enjoyed this post? Have a look at the other posts on our blog.
Contact us for Snowflake professional services.
We created the content in partnership with Snowflake.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Loading data into Snowflake and performance of large joins

Introduction

Loading large data into Snowflake

Dataset

Loading the dataset into Snowflake

Uploading files to Snowflake staging area

Copying the data into tables

Joins on large tables

Attempt 1

Attempt 2

Attempt 3

Increasing the size of our Virtual Warehouse

Loading data into Snowflake and performance of large joins

Introduction

Loading large data into Snowflake

Dataset

Loading the dataset into Snowflake

Uploading files to Snowflake staging area

Copying the data into tables

Joins on large tables

Attempt 1

Attempt 2

Attempt 3

Increasing the size of our Virtual Warehouse

Related Articles

Iceberg Ahead! All you need to know about Snowflake’s Polaris Catalog

How to Parse & Flatten XML in Snowflake With Examples

Data Orchestration Deep Dive Snowflake Tasks. An Airflow replacement?

Cookies consent