[2025] Understanding Snowflake Query Engine Internals
You must have heard Snowflake ’s query engine is cloud native, not developed on existing engines, and provides better performance, making Snowflake a better choice for AI Cloud Platform. But have you ever wondered how Snowflake’s Query Engine works? If not, then this article is just for you.
Starting with the basics….
Life of a Query in Snowflake
The life of a query in Snowflake starts with the client sending a query to the Snowflake Engine. The very first thing the query engine does is to do the result cache lookup. If it finds the result, it generates the output for the query.
If the resultset is not available in the cache, the Query Planner and Optimizer process the query. They prune and filter the query and use metadata to identify the exact records that need to be processed.
Post that the Virtual warehouse scans only needed data from the local SSD or blob storage, processes it, and returns it to the cloud services layer.
Finally, the result is processed and stored in the cache for future use and returned to the client.
How Query Compilation Works in Snowflake?
The Query Compiler is the upper layer of the Query Engine. The query text is given as input to the Parser, the Parser generates the output in a format called Query Block Internal Representation. The parser checks for syntax errors.
Think of it like: Grammar check and sentence breakdown
Then the parser sends it to the semantic analysis, which performs type checking and name resolution. It also performs other optimizations, such as applying row access policy and other security policies.
Think of it like: Making sure your sentence makes logical sense.
After that, we get the logical rewrites stage. It helps in optimizing the query by rewriting it into a logically equivalent but more efficient version. For instance, removing redundant expressions, flattening nested subqueries, etc.
Think of it like: Saying the same thing more smartly.
The next stage is micro partition pruning, which uses the internal component Pruner, which is repeatedly invoked multiple times for the process of compilation.
It determines which micro-partitions are relevant to your query, based on metadata like min/max values.
Think of it like: Massive performance booster, by avoiding unnecessary data scans.
The initial Plan generation builds the initial logical plan using relational algebra. It’s not optimized yet, just a functional breakdown of what needs to happen.
Think of it like: A raw to-do list for your query.
Plan Rewrites applies the internal optimizations to the initial logical plan, like predicate pushdown, join simplification, and aggregation pushdown. It helps reduce the data movement and improve execution performance.
Think of it like: Tidying and organizing your to-do list for max efficiency.
Cost-Based Join Ordering evaluates multiple join orders and calculates a cost estimate. It then chooses the join order with the lowest estimated cost.
Think of it like: Choosing the fastest route based on traffic conditions.
Finally, we have the last layer, which is Physical Plan generation. It converts the optimized logical plan into a physical execution plan that includes actual operators, distribution strategy, and resource assignment.
Think of it like: Turning your travel plan into GPS directions with real roads and turns.
Understanding Query Execution Layer of Query Engine
The query execution layer is the place where the actual work happens based on the compiled and optimized query plan.
It has 3 major components:
Columnar execution:
Snowflake stores and processes data in columnar format (not row-based).
This provides faster reads when you are only querying specific columns. It also provides better compression and efficient scanning since values in a column tend to be similar.
Think of it like: Reading only the chapter you need from a book instead of the whole book.
Vectorized execution:
It processes data in batches instead of row-by-row. Each vector is a batch of thousands of values from a column. It makes the CPU cache-friendly execution, enabling SIMD instructions.
For instance, instead of:
amount[0] > 100
amount[1] > 100...
It does:
amount[0:1023] > 100 in one go
Think of it like: Running a washing machine full of clothes (vector) instead of washing one sock at a time (row-based)
Push-Based Flow Control:
This means the data producers (like scan operators) push data downstream to consumers(like joins or aggregations), as fast as the consumer can handle.
This is different from Pull-based flow control, where consumers request data from producers. This provides pipeline parallelism, allows backpressure handling.
Think of it like: A conveyor belt system in a factory, where items move continuously through each processing stage.
The columnar, vectorized, and push based makes Snowflake’s execution engine massively parallel, low-latency and resource-efficient.
Query Statistics Collection in Snowflake
Snowflake automatically collects statistics (“metadata”) at multiple levels:
Table and micro-partition level
Column Level
Sub-column Level
Snowflake statistics are always up-to-date and accurate. These are used as input to the Optimizer cost model, and for certain optimizations like pruning and constant folding.
Columns and table metadata are cached in the cloud services layer, so the users do not have to worry about managing the statistics manually.
Lastly, Understanding Query Plan Terminology
The query plan is a DAG that consists of operators connected by links.
Operators:
They are the nodes in the graph that contain common operators like join, filters, and aggregations.
Links:
The links are responsible for data exchanges and handle parallel distribution of data.
How to get Query Information?
There are 2 approaches: you can get query plan information using Snowsight UI or by using get_query_operator_stats().
As a best practice, the larger tables should be on the right side of the DAG, which means DAG should be at the right side slope.
If you are interested in reading Snowflake’s architecture, then here’s a quick blog for you:
If you are a Data architect and looking to implement a data lakehouse using Apache Iceberg and Snowflake, here’s a quick link for you:
About Me:
Hi there! I am Divyansh Saxena
I am an Snowflake Advanced Certified Data Architect with a proven track record of success in Snowflake AI Data Cloud technology. Highly skilled in designing, implementing, and maintaining data pipelines, ETL workflows, and data warehousing solutions. Possessing advanced knowledge of Snowflake’s features and functionality, I am a Snowflake Data Superhero since 2023. With a major career in Snowflake Data Cloud, I have a deep understanding of cloud-native data architecture and can leverage it to deliver high-performing, scalable, and secure data solutions.
Follow me on Medium for regular updates on Snowflake Best Practices and other trending topics:
Also, I am open to connecting all data enthusiasts across the globe on LinkedIn: