In this blog post we look at the commonalities and differences between the Snowflake cloud data warehouse and the AWS Athena query service. AWS Athena is built on top of open source technology Presto DB. As of the day of publishing, it does not support all of the features of Presto.
Both platforms implement a design that separates compute from storage. One can be scaled without having to scale the other. This is very different to a traditional MPP database such as Redshift, Teradata, Vertica etc. I have discussed the differences between the two approaches in detail in my post SQL on Hadoop, BigQuery, or Exadata. Please don’t call them MPP.
Both services follow a pay as you go model. You are only billed for the resources you use up. It’s like your electricity bill. This pricing model for data analysis was pioneered by Google (BigQuery) and Snowflake. Amazon then adopted the model and released AWS Athena in early 2017. It is now also implemented by Oracle for their autonomous data warehouse.
Snowflake charges you for storage and compute. Pricing starts at $2 per hour (minimum of 1 minute billed; by the second thereafter. Pricing for regions and editions differs). With Athena you pay for storage and the amount of data that is scanned during a query. For Athena you are charged $5 per TB scanned with a minimum scanning of 10 MB ($0.00004768). In other words you are punished for running queries over small data sets. Google BigQuery offers a similar price tag: $5 per Terabyte of scanned data. However, with BigQuery you are charged for the raw/uncompressed data whereas for Athena you pay for the data (either compressed or in raw format depending on your scenario). BigQuery gives you 1 TB for free and there is no minimum query charge. Assuming 5x compression, Google BigQuery on the surface is 5 times more expensive than Athena.
To forecast your annual consumption for both Athena and Snowflake you need to have a fair understanding of your use case, the workload and the number of users.
But how can we compare pricing between the Snowflake model and the Athena model?
We ran a quick test fully scanning a 392 GB table on Snowflake with SELECT * using a Medium sized virtual warehouse. The query executed in 316s. Extrapolated to 1 TB that gives us 827s or ~14 minutes. Translated into USD that gives us $1.86 to scan 1 TB of data (Medium instance costs are $8 for one hour).
Any pricing comparison should also take into account factors such as performance, management overhead, features etc.
If you are struggling with pricing feel free to reach out to us for help. We also have a set of guidelines on how to bring down the costs of your cloud data warehouse.
Both Snowflake and Athena claim to follow a serverless model. In other words it’s a self managed service. Everything is taken care of for you. There are no knobs to tune or administer. Compare this to other data warehouse platforms. Just scrolling through the list of parameters you can set in Hive (a popular data warehouse on Hadoop) takes a few minutes 🙂
With Snowflake there isn’t really a single parameter you have to tune or any maintenance you have to perform. No vacuum (re-sorting your data), no distribution keys, no stats collection, no data redistribution, no upgrades. You don’t need any DBAs either. There are no backups to manage as you are continously protected.
Let’s have a look at Athena. Upgrades, backups etc. are all taken care off. Combined with AWS Glue, Athena is even able to auto-generate schemas for you. You can achieve something similar on Snowflake using Fivetran and similar tools. There aren’t too many knobs or parameters to set or tune. However, as Athena does not come with a cost based optimizer there is a good bit of manual effort required to get good performance for your SQL queries. We also need to write ETL to convert our data to Parquet or ORC and to create partitions etc. We discuss the limitations of this further down in the post.
Snowflake is a proper relational data warehouse platform and supports transactions, different isolation levels, ACID, read consistency and multi version concurrency control (MVCC). In other words the stuff you would expect from a mature data warehouse platform. Snowflake has its roots in the world of relational databases.
Athena is positioned as a query service for running queries against data that already sits on S3. The use case is very limited. It shouldn’t come as a surprise then that Athena does not have any of the mature features you would expect from a relational data warehouse platform such as ACID, transactions etc. Athena is just an SQL query engine. It is based on Presto DB, a SQL engine with its roots in the Hadoop world.
Storage on Athena is immutable, which means that you can’t update, upsert, or delete individual records. This may be considered as a huge limitation. However, as I have outlined in my article on dimensional modeling in the era of big data there are ways to work around this limitation. It just adds extra effort and overhead 🙁
Both engines offer a mature set of SQL features. We see Snowflake slightly ahead, e.g. it offers User Defined Functions (UDF) and User Defined Aggregate Functions. We have a comparison of SQL support of various vendors on our website for download.
On the other hand, Athena supports geo spatial data types, which makes it convenient to work with maps and geo data. Currently, there is no support for geo-spatial data in Snowflake.
You scale Snowflake by selecting an instance type aka virtual warehouse. Instance types come in different sizes starting with XS and ranging to 4XL. The virtual warehouse represents the compute component of the platform. If you want to cut your execution time in half just double the size of your warehouse by the click of a button. The cost stays constant. Quite amazing. You get stuff done 2x, 4x, 8x, 16x etc. faster and don’t pay a cent more. Optionally you can set your warehouse to auto-scale and Snowflake automatically increases the warehouse size when there is increased demand on the platform.
Apart from scaling an individual virtual warehouse you can scale Snowflake by adding additional warehouses to the mix. This is particularly useful for physically separating different types of warehouse consumers from each other, e.g. you could spin up a virtual warehouse for a demanding ETL job or another one for your power users. It’s a bit like on the schoolyard where you separate kids by age and maturity.
You don’t have any of these options in Athena. Athena is a blackbox when it comes to scaling and concurrency. We could not find any way to scale a query by throwing hardware/money at the problem. Based on the documentation it seems to autoscale with an increase in the concurrent load. This is quite similar to the BigQuery model where the user does not have any control over scaling individual queries.
Snowflake has unlimited concurrency. You can run as many sessions in parallel as you would like by either scaling the size of a virtual warehouse or by adding additional warehouse instances.
Athena’s architecture also allows for unlimited concurrency. However, in practice AWS limits the concurrency to 20 parallel SQL queries (of type DML). You can request an increase of this limit by contacting AWS support.
Snowflake comes with a mature cost based optimizer (CBO). As per Snowflake philosophy everything is taken care off for you. You don’t collect statistics after data has been loaded or significant changes have been made to data in a table. The result of CBO planning can be examined using query profiles.
You don’t need to gather statistics in Athena either. But for a different reason. It is hard to believe, but Athena does not come with a CBO. In this article on Athena best practices I have come across stuff like this:
“When you join two tables, specify the larger table on the left side of join and the smaller table on the right side of the join. Presto distributes the table on the right to worker nodes, and then streams the table on the left to do the join. If the table on the right is smaller, then there is less memory used and the query runs faster”.
LOL. It’s reminiscient of the 1990ies. It’s probably best to avoid table joins altogether, in particular joining two large tables.
A third party CBO developed by Starburst in co-operation with Teradata is available for Presto DB but has not made its way into Athena (yet?).
Snowflake caches data you query on SSDs on the compute nodes. If the same data or a subset of the data is needed for a different query then the data is retrieved from cache. If you suspend your virtual warehouse, the data continues to reside on the compute nodes for some time. There is a good chance that the data is still cached when you spin up your virtual warehouse at a later stage.
Athena does not seem to have a data caching feature. At least we could not find any mention of such a feature.
The other type of caching is result caching. Both Snowflake and Athena (to a much lesser degree) support it. For each query the result is physically set down to storage and if the same query is run again is satisfied from the cache. For Snowflake this happens automatically. For Athena it is a manual process. Query results are stored in buckets on S3. Re-using the result set for another query requires us to create DDL against that table. We also need to point the query to it. In reality all this effort makes this feature unusable on Athena. However, if you just want to download the results of a query then this may help you.
Partitioning and clustering (similar in concept is bucketing on Athena) are manual processes on Athena. Both are fully automated on Snowflake.
Both Snowflake and Athena offer the usual security features such as encryption, authentication and authorisation.
Snowflake has a security feature that allows you to secure data at the column and row level.
With Snowflake you can set up a multi-tenant environment and only expose certain data sets to some of your users. Secure views can also be used with Snowflake’s data sharing feature. Athena does not have any similar features. You can secure objects at the less granular table level though.
Combining AWS Glue crawlers with Athena is a nice feature to auto generate a schema for querying your data on S3 as it takes away the pain of defining DDL for your data sets. You also have this option in Snowflake using third party tools such as Fivetran.
Paradoxically, data loading is Athena’s biggest weakness and in my opinion its ultimate downfall as a query service (nevermind a data warehouse). We have already learned that Athena is positioned as a query service with schema on read and a NoETL engine. In theory you just dump your data on S3 and query away. In practice it is more difficult. For efficient querying we want our data in Athena partitioned, bucketed, compressed, and in columnar format (Parquet or ORC). As per the Athena best practices guide you can achieve cost savings of 99%+ following these steps. So on the one hand you save a lot of money by applying these steps, but on the other all the nice benefits of schema on read and NoETL go out the window. You have to write a lot of code and run lengthy ETL pipelines. Sorry to burst the schema on read bubble. And did I tell you about optimal file sizes and the small files problem? Well, if your files are smaller than the blocksize (Athena documentation mentions 128 MB, which seems quite low) you will need to run some extra code to merge your files. The last problem you will encounter with schema on read and Athena is data quality. If your files do not fully conform to what you specify then querying will fail :-(.
Even though Snowflake is a schema on write engine it gets a lot closer to the NoETL ideal than Athena. Yes, in Snowflake you have to load your data and create a schema, but the platform takes care of everything else for you. You don’t have to create partitions or buckets. You don’t have to worry about ORC or Parquet, or compression. It is all done for you. Tools such as Fivetran or our own tool Flexter for XML and JSON also take care of target schema creation.
Snowflake also handles data quality issues by validating your data upload, optionally writing rogue errors to an error hospital during bulk loads.
There are fewer tools that allow you to work with Athena than with Snowflake. Athena is mainly supported by AWS tools such as Quicksight etc.
Both platforms support JSON. Snowflake supports XML as a preview feature. Using SQL for querying semi-structured data works reasonably well for simple scenarios with low or medium data volumes.
Support for XML/JSON as data types and XQuery etc. to convert semi-structured data to tables is nice. But it will not get you very far in terms of automating the whole conversion process from XML/JSON to relational data.
- Data analysts still need to understand the structure of the XML/JSON and the schema definition.
- They still need to make sense of the XML elements and types and their relationships.
Data modelers need to create a physical target model.
- Data engineers need to write code to populate the target schema and create universally unique IDs.
- If you want to use JSON SQL extensions and XQuery to normalize JSON/XML and build a relational model table by table then you will face the problem of maintaining links between those tables. You would be forced to either use natural keys (they rarely exist in JSON/XML) or denormalize, which is impossible for deep or wide (M:N) structures. All of this limits the usefulness of SQL extension to practically very few, simple scenarios.
- Changes to the source schema(s) can not be detected automagically. Rogue XML files may break processing.
- Upgrading from one version of the schema to another version may require significant refactoring effort.
We have developed Flexterto address all of these limitations. Flexter automatically converts JSON/XML to a relational format in Snowflake or any other relational database. It also supports Hadoop (ORC, Parquet, Avro) and text (CSV etc.).
Athena is built on top of Presto DB and could in theory be installed in your own data centre. Snowflake is only available in the cloud on AWS and Azure.
Both Snowflake and Athena come with SDKs. They both support JDBC and ODBC. Snowflake also ships connectors for Spark and Python and drivers for Node.js, .Net, and Go.
Snowflake has some cool features such as point in time queries, no cost data cloning, data sharing, materialised views that don’t have an equivalent feature in Athena.
As we have seen, Athena does not compare favourably as a data warehouse platform to Snowflake. I think we can all agree on this point. As a matter of fact, AWS don’t position it as a data warehouse. They market it as a query service for data on S3. It tightly integrates with the AWS Glue Catalog to detect and create schemas (DDL). Nice! In theory you should be able to query away to your heart’s content. In practice however, you first need to convert your data to Parquet or ORC, partition, bucket, compress, adapt its file size etc. before you are ready to rock. And even then you better be careful with your queries. It’s probably best to avoid joins of large tables altogether.
Is Athena useful as a query service? Personally, I would use Snowflake any time for this use case until AWS Athena adds more automation and lives up to the promises of schema on read. With Snowflake you have to perform considerably less ETL work. At the same time you get superior functionality and performance for a similar or even lower price tag.
Have you had any experience with AWS Athena, PrestoDB, or Snowflake? I would love to hear from you in the comments (scroll down).