Skip to main content

Micro-Partitions

Updated Sep 27, 2022 ·

Overview

Large tables in Snowflake can contain billions of rows, so queries need a way to avoid scanning everything. Instead of traditional indexes, Snowflake avoids full table scans by splitting data into micro-partitions and using metadata to skip unnecessary reads.

  • Only needed columns are read
  • Metadata is used to skip data
  • No manual indexing is required

This design helps Snowflake decide what data to read before scanning anything, which keeps large queries efficient.

Micro-Partitions

Snowflake automatically breaks data into small storage chunks when it is loaded.

  • Data is split into 50 to 500 MB chunks
  • Each column is stored separately
  • Compression is applied automatically

Micro-partitions allow Snowflake to read only the exact data needed for a query instead of scanning full rows. This makes large table queries faster and more efficient without manual tuning.

Each micro-partition stores summary information ("metadata") about its data.

  • Stores min and max values
  • Tracks row counts
  • Tracks null and distinct values

This metadata is managed automatically by Snowflake and is stored in the cloud services layer. It is used to quickly decide whether a partition should be scanned or skipped.

Partition Pruning

Snowflake uses metadata to skip unnecessary partitions during queries. This process is called partition pruning and it improves query performance.

  • Checks min and max values per partition
  • Skips partitions outside filter range
  • Reduces data scanned

In the example below, a query filters sales data for the month of May.

SELECT *
FROM sales
WHERE order_date BETWEEN '2022-05-01' AND '2022-05-31';

Snowflake evaluates each partition before scanning it and skips any partition that does not contain May data. Here, only partition P5 is scanned because its date range matches May. All other partitions are skipped based on metadata.

Note: Each row below represents a micro-partition, not individual rows.

PartitionMin DateMax DateTotal RowsRegionRemarks
P12022-01-012022-01-31120,000EastSkipped if filtering May
P22022-02-012022-02-28115,000WestSkipped if filtering May
P32022-03-012022-03-31130,000NorthSkipped if filtering May
P42022-04-012022-04-30125,000SouthSkipped if filtering May
P52022-05-012022-05-31140,000EastSelected for query
P62022-06-012022-06-30110,000WestSkipped if filtering May
P72022-07-012022-07-31118,000NorthSkipped if filtering May
P82022-08-012022-08-31122,000SouthSkipped if filtering May
P92022-09-012022-09-30119,000EastSkipped if filtering May
P102022-10-012022-10-31121,000WestSkipped if filtering May
P112022-11-012022-11-30117,000NorthSkipped if filtering May
P122022-12-012022-12-31123,000SouthSkipped if filtering May

Clustering

Clustering improves pruning by organizing how data is stored across partitions.

  • Similar values are grouped together
  • Improves filter performance
  • Works best on frequently queried columns

If data is loaded in order, such as by date, pruning works naturally. If values are scattered, pruning becomes less effective.

We can also define a clustering key to optimize future queries on specific columns. This is especially useful when the column is not naturally clustered and pruning is inefficient.

When to Use a Clustering Key

A clustering key should only be used in the following scenarios:

  1. Table is very large (100GB+)
  2. Queries frequently filter the same column
  3. Existing pruning is inefficient

Clustering is not useful for small tables or rarely queried columns. It should be applied only when it clearly improves performance.

The sweet spot for a clustering key is moderate cardinality: enough distinct values to create meaningful partition boundaries, but not so many that every row lands in its own group. Columns like dates and country codes hit that balance.

Cardinality and Cost Considerations

Note that not all columns are good clustering candidates.

  • Low distinct values reduce effectiveness
  • High cardinality improves clustering results
  • Re-clustering adds compute cost

Snowflake automatically maintains clustering in the background using its own compute resources.

This improves performance over time but also adds cost, so clustering should be used intentionally rather than by default.