What is Snowflake DataWarehouse and how does it work - The Architecture
Now that we all agree, Hadoop and big data solutions are complicated enough, that companies spend a fortune on hiring people and needs a lot of tuning to make them stable and work really well. With growing data needs, companies need a highly scalable and cost-efficient way to store their huge piles of data.
Snowflake is DataWarehouse on the Cloud. At the time of writing this article, Snowflake is available on Amazon's AWS and Microsoft Azure clouds. Snowflake, a company founded by former employees of Oracle, Microsoft, and Vectorwise, offers a scalable data warehouse as a service from the cloud. Their selling point is the separation of database disk capacity from CPU performance, and they offer this at a cost-effective price. You only pay for the capacity and performance you use. Snowflake stores the data in micro-partitions, compressed, encrypted and in columnar format.
Wait, Is it a real deal?
I tried their solution in a project and here's what I loved:
- Maintenance - This is my favorite part. There are no PRIMARY KEYS, INDEXES, COLLECT STATS or other table tuning required in Snowflake. You simply create a table and start loading data. And for better, you don't need the infrastructure team and the DBA's to maintain it
- Simplicity - The solution is so simple and intuitive. If you already know SQL, there is a very tiny learning curve. And there's nothing much changed from the developer's perspective. It supports ANSI SQL Standards and most of your existing reports should work fine without any issue
- Unlimited Capacity (Storage) - Unlike traditional databases, you don't have to allocate space to the databases, you simply create a database and tables and start loading data. The capacity will not run out. They never even ask me how much capacity I need. And actually, I don’t even care, since capacity is inexpensive. All too inexpensive
- Unlimited Scalable Compute - In Snowflake, The Virtual warehouses come in T-shirt sizes, you can choose an XS / Small / Medium / XL / 2XL ...4XL. And you can enable autoscaling to scale proportionately to your load. And you only pay for the active usage, when you no more need them you can suspend them
Snowflake charms with its simplicity
- Compression is enabled by default
- All the data is encrypted by default (Even Amazon/Snowflake itself can't read your data)
- No more bothering with Indexes, Collect stats, Primary keys. They aren't even available in Snowflake. You simply load the data and Snowflake just performs without any of that geeky stuff
- Snowflake is Columnar by default, so it only cares about that 5 columns in your report even if your table has a zillion columns
- Your ETL / Testing teams need a clone of Production data for testing? No problem its just a command away (and it does that without even duplicating the data)
How does Snowflake really work?
Snowflake applies the best practices of AWS and has built a very cost-effective and scalable service on top of them. The S3 service is inexpensive, stable and scalable for storing large volumes of data, and launching EC2 instances in the cloud on an as-needed basis makes a “pay-per-use” model possible
There are three layers in Snowflake (All in the cloud):
- Storage layer: Data in Snowflake is micro-partitioned, compressed and encrypted in the AWS S3 service and connected to through virtual data warehouses (Scales horizontally)
- Virtual warehouse layer: This is the compute layer of the snowflake. Virtual data warehouses retrieve and review the data and return a result set to the user. The technology applied by the virtual data warehouses is covered by Snowflake’s IPR, but in practice, the warehouses are EC2 instances in Auto Scaling groups that have huge amounts of RAM enabling caching of large masses of data. You can use a Virtual warehouse for your ETL needs and a separate Virtual warehouse for your reporting, the data is shared between the two warehouses but not the computing power.
- Services layer: This is the snowflake's secret sauce and what they claim as their brain. The services layer for Snowflake authenticates user sessions, provides management, enforces security functions, performs query compilation and optimization, and coordinates all transactions. The services layer is constructed of stateless compute resources, running across multiple availability zones and utilizing a highly available, distributed metadata store for global state management
Not convinced yet, look at their client list: