Comparing Snowflake cloud data warehouse to AWS Athena query service.

November 27, 2018

In this blog post we look at the commonalities and differences between the Snowflake cloud data warehouse and the AWS Athena query service. AWS Athena is built on top of open source technology Presto DB. As of the day of publishing, it does not support all of the features of Presto.

Table of Contents

Utility billing for data analysis

Both platforms implement a design that separates compute from storage. One can be scaled without having to scale the other. This is very different to a traditional MPP database such as Redshift, Teradata, Vertica etc. I have discussed the differences between the two approaches in detail in my post SQL on Hadoop, BigQuery, or Exadata. Please don’t call them MPP.
Both services follow a pay as you go model. You are only billed for the resources you use up. It’s like your electricity bill. This pricing model for data analysis was pioneered by Google (BigQuery) and Snowflake. Amazon then adopted the model and released AWS Athena in early 2017. It is now also implemented by Oracle for their autonomous data warehouse.
Snowflake charges you for storage and compute. Pricing starts at $2 per hour (minimum of 1 minute billed; by the second thereafter. Pricing for regions and editions differs). With Athena you pay for storage and the amount of data that is scanned during a query. For Athena you are charged $5 per TB scanned with a minimum scanning of 10 MB ($0.00004768). In other words you are punished for running queries over small data sets. Google BigQuery offers a similar price tag: $5 per Terabyte of scanned data. However, with BigQuery you are charged for the raw/uncompressed data whereas for Athena you pay for the data (either compressed or in raw format depending on your scenario). BigQuery gives you 1 TB for free and there is no minimum query charge. Assuming 5x compression, Google BigQuery on the surface is 5 times more expensive than Athena.

The multi-million dollar question

To forecast your annual consumption for both Athena and Snowflake you need to have a fair understanding of your use case, the workload and the number of users.

But how can we compare pricing between the Snowflake model and the Athena model?

We ran a quick test fully scanning a 392 GB table on Snowflake with SELECT * using a Medium sized virtual warehouse. The query executed in 316s. Extrapolated to 1 TB that gives us 827s or ~14 minutes. Translated into USD that gives us $1.86 to scan 1 TB of data (Medium instance costs are $8 for one hour).
Any pricing comparison should also take into account factors such as performance, management overhead, features etc.

If you are struggling with pricing feel free to reach out to us for help. We also have a set of guidelines on how to bring down the costs of your cloud data warehouse.

Management

Both Snowflake and Athena claim to follow a serverless model. In other words it’s a self managed service. Everything is taken care of for you. There are no knobs to tune or administer. Compare this to other data warehouse platforms. Just scrolling through the list of parameters you can set in Hive (a popular data warehouse on Hadoop) takes a few minutes 🙂
With Snowflake there isn’t really a single parameter you have to tune or any maintenance you have to perform. No vacuum (re-sorting your data), no distribution keys, no stats collection, no data redistribution, no upgrades. You don’t need any DBAs either. There are no backups to manage as you are continously protected.

Let’s have a look at Athena. Upgrades, backups etc. are all taken care off. Combined with AWS Glue, Athena is even able to auto-generate schemas for you. You can achieve something similar on Snowflake using Fivetran and similar tools. There aren’t too many knobs or parameters to set or tune. However, as Athena does not come with a cost based optimizer there is a good bit of manual effort required to get good performance for your SQL queries. We also need to write ETL to convert our data to Parquet or ORC and to create partitions etc. We discuss the limitations of this further down in the post.

Data warehouse versus query service

Snowflake is a proper relational data warehouse platform and supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control (MVCC). In other words the stuff you would expect from a mature data warehouse platform. Snowflake has its roots in the world of relational databases.

Athena is positioned as a query service for running queries against data that already sits on S3. The use case is very limited. It shouldn’t come as a surprise then that Athena does not have any of the mature features you would expect from a relational data warehouse platform such as ACID, transactions etc. Athena is just an SQL query engine. It is based on Presto DB, a SQL engine with its roots in the Hadoop world.

Storage on Athena is immutable, which means that you can’t update, upsert, or delete individual records. This may be considered as a huge limitation. However, as I have outlined in my article on dimensional modeling in the era of big data there are ways to work around this limitation. It just adds extra effort and overhead 🙁

SQL

Both engines offer a mature set of SQL features. We see Snowflake slightly ahead, e.g. it offers User Defined Functions (UDF) and User Defined Aggregate Functions. We have a comparison of SQL support of various vendors on our website for download.
On the other hand, Athena supports geo spatial data types, which makes it convenient to work with maps and geo data. Currently, there is no support for geo-spatial data in Snowflake.

Performance

Scalability

You scale Snowflake by selecting an instance type aka virtual warehouse. Instance types come in different sizes starting with XS and ranging to 4XL. The virtual warehouse represents the compute component of the platform. If you want to cut your execution time in half just double the size of your warehouse by the click of a button. The cost stays constant. Quite amazing. You get stuff done 2x, 4x, 8x, 16x etc. faster and don’t pay a cent more. Optionally you can set your warehouse to auto-scale and Snowflake automatically increases the warehouse size when there is increased demand on the platform.

Apart from scaling an individual virtual warehouse you can scale Snowflake by adding additional warehouses to the mix. This is particularly useful for physically separating different types of warehouse consumers from each other, e.g. you could spin up a virtual warehouse for a demanding ETL job or another one for your power users. It’s a bit like on the schoolyard where you separate kids by age and maturity.

You don’t have any of these options in Athena. Athena is a blackbox when it comes to scaling and concurrency. We could not find any way to scale a query by throwing hardware/money at the problem. Based on the documentation it seems to autoscale with an increase in the concurrent load. This is quite similar to the BigQuery model where the user does not have any control over scaling individual queries.

Concurrency

Snowflake has unlimited concurrency. You can run as many sessions in parallel as you would like by either scaling the size of a virtual warehouse or by adding additional warehouse instances.
Athena’s architecture also allows for unlimited concurrency. However, in practice AWS limits the concurrency to 20 parallel SQL queries (of type DML). You can request an increase of this limit by contacting AWS support.

Optimizer

Snowflake comes with a mature cost based optimizer (CBO). As per Snowflake philosophy everything is taken care off for you. You don’t collect statistics after data has been loaded or significant changes have been made to data in a table. The result of CBO planning can be examined using query profiles.

You don’t need to gather statistics in Athena either. But for a different reason. It is hard to believe, but Athena does not come with a CBO. In this article on Athena best practices I have come across stuff like this:
“When you join two tables, specify the larger table on the left side of join and the smaller table on the right side of the join. Presto distributes the table on the right to worker nodes, and then streams the table on the left to do the join. If the table on the right is smaller, then there is less memory used and the query runs faster”.

LOL. It’s reminiscient of the 1990ies. It’s probably best to avoid table joins altogether, in particular joining two large tables.
A third party CBO developed by Starburst in co-operation with Teradata is available for Presto DB but has not made its way into Athena (yet?).

Caching

Snowflake caches data you query on SSDs on the compute nodes. If the same data or a subset of the data is needed for a different query then the data is retrieved from cache. If you suspend your virtual warehouse, the data continues to reside on the compute nodes for some time. There is a good chance that the data is still cached when you spin up your virtual warehouse at a later stage.

Athena does not seem to have a data caching feature. At least we could not find any mention of such a feature.

The other type of caching is result caching. Both Snowflake and Athena (to a much lesser degree) support it. For each query the result is physically set down to storage and if the same query is run again is satisfied from the cache. For Snowflake this happens automatically. For Athena it is a manual process. Query results are stored in buckets on S3. Re-using the result set for another query requires us to create DDL against that table. We also need to point the query to it. In reality all this effort makes this feature unusable on Athena. However, if you just want to download the results of a query then this may help you.

Partitioning/Clustering

Partitioning and clustering (similar in concept is bucketing on Athena) are manual processes on Athena. Both are fully automated on Snowflake.

Security

Both Snowflake and Athena offer the usual security features such as encryption, authentication and authorisation.

Snowflake has a security feature that allows you to secure data at the column and row level.

With Snowflake you can set up a multi-tenant environment and only expose certain data sets to some of your users. Secure views can also be used with Snowflake’s data sharing feature. Athena does not have any similar features. You can secure objects at the less granular table level though.

Data loading

Combining AWS Glue crawlers with Athena is a nice feature to auto generate a schema for querying your data on S3 as it takes away the pain of defining DDL for your data sets. You also have this option in Snowflake using third party tools such as Fivetran.

Paradoxically, data loading is Athena’s biggest weakness and in my opinion its ultimate downfall as a query service (nevermind a data warehouse). We have already learned that Athena is positioned as a query service with schema on read and a NoETL engine. In theory you just dump your data on S3 and query away. In practice it is more difficult. For efficient querying we want our data in Athena partitioned, bucketed, compressed, and in columnar format (Parquet or ORC). As per the Athena best practices guide you can achieve cost savings of 99%+ following these steps. So on the one hand you save a lot of money by applying these steps, but on the other all the nice benefits of schema on read and NoETL go out the window. You have to write a lot of code and run lengthy ETL pipelines. Sorry to burst the schema on read bubble. And did I tell you about optimal file sizes and the small files problem? Well, if your files are smaller than the blocksize (Athena documentation mentions 128 MB, which seems quite low) you will need to run some extra code to merge your files. The last problem you will encounter with schema on read and Athena is data quality. If your files do not fully conform to what you specify then querying will fail :-(.

Even though Snowflake is a schema on write engine it gets a lot closer to the NoETL ideal than Athena. Yes, in Snowflake you have to load your data and create a schema, but the platform takes care of everything else for you. You don’t have to create partitions or buckets. You don’t have to worry about ORC or Parquet, or compression. It is all done for you. Tools such as Fivetran or our own tool Flexter for XML and JSON also take care of target schema creation.

Snowflake also handles data quality issues by validating your data upload, optionally writing rogue errors to an error hospital during bulk loads.

Other stuff

Eco-system.

There are fewer tools that allow you to work with Athena than with Snowflake. Athena is mainly supported by AWS tools such as Quicksight etc.

Semi-structured data

Both platforms support JSON. Snowflake supports XML as a preview feature. Using SQL for querying semi-structured data works reasonably well for simple scenarios with low or medium data volumes.
Support for XML/JSON as data types and XQuery etc. to convert semi-structured data to tables is nice. But it will not get you very far in terms of automating the whole conversion process from XML/JSON to relational data.

Data analysts still need to understand the structure of the XML/JSON and the schema definition.
They still need to make sense of the XML elements and types and their relationships.
Data modelers need to create a physical target model.
Data engineers need to write code to populate the target schema and create universally unique IDs.
If you want to use JSON SQL extensions and XQuery to normalize JSON/XML and build a relational model table by table then you will face the problem of maintaining links between those tables. You would be forced to either use natural keys (they rarely exist in JSON/XML) or denormalize, which is impossible for deep or wide (M:N) structures. All of this limits the usefulness of SQL extension to practically very few, simple scenarios.
Changes to the source schema(s) can not be detected automagically. Rogue XML files may break processing.
Upgrading from one version of the schema to another version may require significant refactoring effort.

We have developed Flexterto address all of these limitations. Flexter automatically converts JSON/XML to a relational format in Snowflake or any other relational database. It also supports Hadoop (ORC, Parquet, Avro) and text (CSV etc.).

Lock in

Athena is built on top of Presto DB and could in theory be installed in your own data centre. Snowflake is only available in the cloud on AWS and Azure.

Connectivity

Both Snowflake and Athena come with SDKs. They both support JDBC and ODBC. Snowflake also ships connectors for Spark and Python and drivers for Node.js, .Net, and Go.

Time travel, data cloning, data sharing, materialised views

Snowflake has some cool features such as point in time queries, no cost data cloning, data sharing, materialised views that don’t have an equivalent feature in Athena.

The verdict. Unqueryable query service?

As we have seen, Athena does not compare favourably as a data warehouse platform to Snowflake. I think we can all agree on this point. As a matter of fact, AWS don’t position it as a data warehouse. They market it as a query service for data on S3. It tightly integrates with the AWS Glue Catalog to detect and create schemas (DDL). Nice! In theory you should be able to query away to your heart’s content. In practice however, you first need to convert your data to Parquet or ORC, partition, bucket, compress, adapt its file size etc. before you are ready to rock. And even then you better be careful with your queries. It’s probably best to avoid joins of large tables altogether.

Is Athena useful as a query service? Personally, I would use Snowflake any time for this use case until AWS Athena adds more automation and lives up to the promises of schema on read. With Snowflake you have to perform considerably less ETL work. At the same time you get superior functionality and performance for a similar or even lower price tag.

Reach out to me with any questions. Also make sure to have a look at our data architecture advisory services.

Have you had any experience with AWS Athena, PrestoDB, or Snowflake? I would love to hear from you in the comments (scroll down).

Enjoyed this post? Have a look at the other posts on our blog.

We created the content in partnership with Snowflake.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Comparing Snowflake cloud data warehouse to AWS Athena query service.

Utility billing for data analysis

The multi-million dollar question

Management

Data warehouse versus query service

SQL

Performance

Scalability

Concurrency

Optimizer

Caching

Partitioning/Clustering

Security

Data loading

Other stuff

Eco-system.

Semi-structured data

Lock in

Connectivity

Time travel, data cloning, data sharing, materialised views

The verdict. Unqueryable query service?

Comparing Snowflake cloud data warehouse to AWS Athena query service.

Utility billing for data analysis

The multi-million dollar question

Management

Data warehouse versus query service

SQL

Performance

Scalability

Concurrency

Optimizer

Caching

Partitioning/Clustering

Security

Data loading

Other stuff

Eco-system.

Semi-structured data

Lock in

Connectivity

Time travel, data cloning, data sharing, materialised views

The verdict. Unqueryable query service?

Related Articles

Benchmarking Snowflake Native Apps

Optimal Usage of Snowflake Virtual Warehouses: Breaking Down Parallelism & Concurrency

7 guardrails against common mistakes that inflate Snowflake credit usage

Cookies consent