Snowflake is a Bozeman, Montana-based cloud-based data warehousing platform founded in 2012 by three partners, two of which were ex-Oracle data architects. The Snowflake platform provides a fast, scalable solution for storing, processing, and analyzing large volumes of data. Snowflake is a fully managed SaaS product, but the customer can choose the cloud provider and the region where the Snowflake instance will be running. Snowflake is available on Google Cloud Compute (GCP), Amazon Web Services (AWS), and Microsoft Azure (Azure). Since Snowflake is fully managed software, it involves almost no maintenance, and many companies run Snowflake without a DBA (database administrator.) Snowflake’s pricing model is also pay-as-you-go for both computing and storage. This means you can pay for storage and computing power separately and only pay for what you use.
Most traditional databases force computing memory and storage to be scaled together with the database’s server. These databases use virtual (or physical) server resources like computing power (CPU), memory (RAM), and storage (Disk Space). If additional storage, memory, or computational power is needed, the server has to be taken offline for maintenance (or at least restarted), and additional resources must be added while the server is offline.
Snowflake avoids this problem by grouping compute, memory, and temp storage into a “warehouse” and separating all other non-temporary storage. This allows for the ability to access data through 1 or more warehouses (allowing for scale-out and scale-up). This also means that warehouses can be scaled up or down with basically zero downtime. You can also have many separate warehouses so the warehouse can be matched to the workload needed for a particular tool or task. Need something fast now? You can use a larger warehouse right now. Want to constraint a task or tool? - use the smallest warehouse size.
Separation of storage from the compute and memory layer is one of the most compelling features of Snowflake. The warehouse architecture (separate compute) gives lots of flexibility and the convenience of easy scaling, depending on your needs. The separate storage also provides flexibility and has the nice side effect of being extremely affordable, not much more than the underlying storage layer used by Snowflake (i.e., Azure Blob, Amazon s3, or Google Cloud Storage).
Snowflake optimizes queries automatically. Snowflake can automatically compile and optimize queries and perform automatic data clustering, caching, and query profiling. These features maximize query performance automatically, reducing the need for manual intervention from database admins or developers.
Snowflake has granular access controls with row-level security, data encryption at rest and in transit, and native integration with security services such as Duo for multi-factor authentication. Snowflake is also SOC 2, PCI DSS, HIPAA, and GDPR compliant.
Snowflake and Redshift can both handle large amounts of data, and they both have columnar data architectures. In a columnar database, queries run faster when you select fewer columns in your query, which is NOT true of a traditional RDBMS like MySQL. While both tools are equipped for big data, Snowflake has several advantages:
As mentioned previously, YES, you can store semi-structured data. Snowflake natively supports JSON, JSON-B (Binary), Parquet, Avro, XML, and ORC. This is especially helpful if you have committed to an ELT (Extract Load Transform) architecture. Since Snowflake can store semi-structured data, you can more easily ingest and store data in its original format without pre-processing or transforming it. Snowflake also performs late schema binding, which means it can automatically recognize and handle the structure of the data at query time. This means if a number is stored in JSON as text, it can be easily cast to numeric at query runtime without any query performance penalty.
Snowflake’s query optimizer also takes advantage of the underlying metadata and statistics of the semi-structured data. It leverages various techniques to eliminate unnecessary data from the query process, which improves query performance. Snowflake also stores indexing and statistics on semi-structured data that it can also utilize to optimize query plans.
Querying semi-structured data can be accomplished using Snowflake’s built-in JSONPATH and other functions designed to structure data into a tabular format. You can also join this data with other Snowflake data already in a standard SQL table or view it in the same query. These queries can be executed, and Snowflake will create efficient query plans to query both the structured and unstructured data.
Lastly, since Snowflake’s storage layer utilizes columnar storage and other compression techniques, Snowflake can more efficiently scan and retrieve relevant data, which reduces I/O operations and improves performance.
Yes - you can use Snowflake as part of a data lake-house architecture. Snowflake can be used as a data storage layer within your Data Lakehouse. As mentioned previously, Snowflake supports various data formats, including structured, semi-structured (e.g., JSON, JSON-B, Parquet), and even external tables that reference data stored in Amazon S3 or another external cloud storage system.
Integrating Snowflake with Delta Lake is one of the most interesting ways to use it as part of your data lake house. Delta Lake is an open source ACID compliant (Atomicity, Consistency, Isolation, Durability) SQL-based access layer for data stored in various locations (S3, relational databases, etc.). Snowflake supports the ability to query Delta Lake as a source via Snowflake external tables.
A few advantages of using Snowflake as part of your Data Lakehouse architecture are:
1) Single go-to source of truth for your data 2) Standardized data Security/Governance (all managed by Snowflake) 3) Ability to use Snowflake dynamic compute resources on demand and pay as you go.
Although being part of a data Lakehouse architecture isn’t a primary use-case for Snowflake, and some of the features are still new and have some work to be truly first class, this is an exciting and compelling use case for companies that can’t (or don’t want to) store all of their data in Snowflake.
Snowflake is a premier cloud-based data warehousing platform known for its scalable, fully managed, and pay-as-you-go architecture. It also separates compute from storage, automatically optimizes queries, and offers robust security features allowing for flexible scaling and minimal downtime. Snowflake differentiates itself from Redshift with more flexible scalability, query optimization, and better semi-structured data handling.
Lastly, Snowflake can be part of a data Lakehouse architecture, including integration with technologies like Delta Lake.
Snowflake’s support for semi-structured data formats enables easy storage and querying of JSON, Parquet, Avro, XML, and ORC data. It also allows users to use SQL and SQL functions to parse semi-structured data and join it with structured data for analysis and data visualization.