Sale: Use codesave50for 50% off
Back to Blog

How to Fix Slow ClickHouse Queries

JayJay

ClickHouse will scan a billion rows in two seconds on hardware that would crush Postgres. It will also run the same query for fifteen minutes if you give it the wrong primary key. The difference between those two outcomes is almost entirely about how you laid out the table, and almost none about hardware. ClickHouse is the most performance-sensitive database to schema design that most engineers will ever touch.

This is what makes ClickHouse performance work different. There's no "add an index and the query gets fast" escape hatch. The primary index is the ORDER BY of the table, baked in at create time. If your queries don't match it, you're in for a refactor, not a tuning session. The good news: the system gives you exceptional visibility into what it's doing, with system.query_log recording every query's costs in microsecond detail. This guide walks through the diagnostic tools, the design choices that matter most, and the patterns that move queries from minutes to milliseconds.

Why ClickHouse queries get slow

Four patterns drive almost every slow-query report:

  1. The query doesn't match the primary key. ORDER BY (customer_id, created_at) lets you efficiently filter on customer_id. A query filtering on email scans every part. There is no equivalent of "add an index" that fixes this cheaply.

  2. Partition pruning isn't kicking in. A query against a time-partitioned table that doesn't include the partition column in the filter scans every partition. Wrap a date in a function and partition pruning disappears.

  3. Too many small parts. Frequent small inserts produce a flood of parts that the merge thread can't keep up with. Queries that should consider 50 parts consider 5000.

  4. A join that should have been denormalized. ClickHouse's default hash join loads the right side into memory. Joining two large tables either blows up memory or runs sequentially. Dictionaries and denormalization are the answer, not better tuning.

The diagnostic process is figuring out which of those four you're hitting, then either redesigning the table, fixing the data layout, or restructuring the query. The rest of this post walks through each in detail.

Step 1: The diagnostic stack

ClickHouse exposes everything through system tables. There's no extension to install, no profiler to enable. The visibility is the most generous of any database in this comparison series.

system.query_log

Every query the server runs lands in system.query_log with full details: the SQL, the duration, the rows and bytes read, the memory used, the ProfileEvents map. Enable it if it isn't already (it's on by default in modern versions):

SQL
SELECT * FROM system.tables WHERE name = 'query_log';

The basic top-N slow queries:

SQL
SELECT
  event_time,
  query_duration_ms,
  read_rows,
  formatReadableSize(read_bytes) AS read_size,
  formatReadableSize(memory_usage) AS mem,
  result_rows,
  query
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
  AND type = 'QueryFinish'
ORDER BY query_duration_ms DESC
LIMIT 20;

The type column distinguishes successful runs from failures. Worth checking:

  • QueryStart: query started.
  • QueryFinish: completed successfully.
  • ExceptionBeforeStart: failed during parsing or analysis.
  • ExceptionWhileProcessing: failed during execution.

For aggregate analysis across normalized query shapes:

SQL
SELECT
  normalized_query_hash,
  any(query) AS sample_query,
  count() AS calls,
  sum(query_duration_ms) AS total_ms,
  round(avg(query_duration_ms), 1) AS avg_ms,
  formatReadableSize(sum(read_bytes)) AS total_read
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 DAY
  AND type = 'QueryFinish'
GROUP BY normalized_query_hash
ORDER BY total_ms DESC
LIMIT 20;

normalized_query_hash (since 21.2 or thereabouts) groups queries that differ only in literal values. So all variants of SELECT ... WHERE id = ? collapse into one row. Sort by total time to find queries that are individually fast but cumulatively expensive.

ProfileEvents and Settings columns

The ProfileEvents column in system.query_log is a map of detailed counters. Useful ones:

  • Query: number of queries.
  • SelectQuery, InsertQuery: by type.
  • SelectedParts, SelectedRanges, SelectedMarks: how much of the table was scanned. This is the closest equivalent of "buffers" in Postgres.
  • FileOpen, OpenedFileCacheHits, OpenedFileCacheMisses: file I/O.
  • OSReadBytes, OSReadChars: real disk reads.
  • MarkCacheHits, MarkCacheMisses: index cache effectiveness.
  • UncompressedCacheHits, UncompressedCacheMisses: data cache.

For a specific query:

SQL
SELECT ProfileEvents
FROM system.query_log
WHERE query_id = 'abc-123' AND type = 'QueryFinish';

The Settings column shows which settings were in effect for that query. Useful when a query is fast in one client and slow in another; check whether max_threads or max_memory_usage differ.

system.processes for live queries

Queries running right now:

SQL
SELECT
  query_id,
  elapsed,
  read_rows,
  total_rows_approx,
  formatReadableSize(memory_usage) AS mem,
  substring(query, 1, 80) AS query
FROM system.processes
ORDER BY elapsed DESC;

total_rows_approx is ClickHouse's estimate of how many rows the query will end up reading. Compare to read_rows to get a rough progress percentage.

Kill a runaway:

SQL
KILL QUERY WHERE query_id = 'abc-123';

The kill is best-effort. Some operations (the synchronous part of an insert, for instance) can't be killed mid-flight.

system.metric_log for time-series telemetry

For longer-term performance analysis, enable metric_log:

XML
<metric_log>
    <database>system</database>
    <table>metric_log</table>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
    <collect_interval_milliseconds>1000</collect_interval_milliseconds>
</metric_log>

It records server-level metrics every second. Useful for "what was happening around the time the slow queries spiked?" Cross-reference timestamps with query_log.

trace_log for the deepest dive

The trace_log records sampled stack traces. Useful when you have a slow query and the per-query stats don't explain why:

SQL
SELECT count(), arrayStringConcat(arrayMap(x -> demangle(addressToSymbol(x)), trace), '\n')
FROM system.trace_log
WHERE event_time > now() - INTERVAL 5 MINUTE
  AND trace_type = 'CPU'
GROUP BY trace
ORDER BY count() DESC
LIMIT 10;

You get the most-frequently-sampled call stacks. This is closer to a profiler than most databases offer. Useful for "ClickHouse itself is bottlenecking" investigations.

Send-logs-level trace from the client

For an individual query, the client can tell the server to stream back its internal logs:

BASH
clickhouse-client --send_logs_level=trace --query="SELECT ..."

You'll see which parts and granules ClickHouse opened, which skip indexes fired, where time went stage-by-stage. Verbose, but unmatched for debugging a single query end-to-end.

clickhouse-benchmark

For comparing query performance across runs or settings:

BASH
clickhouse-benchmark --concurrency 4 --iterations 100 \
  --query "SELECT count() FROM events WHERE customer_id = 12345"

You get percentile latencies, throughput, and basic statistics. Useful for confirming an optimization helped, rather than guessing from a single timed run.

Step 2: EXPLAIN in depth

ClickHouse has multiple EXPLAIN modes. Each tells you something different.

EXPLAIN AST and EXPLAIN SYNTAX

For confirming what the parser saw and how it rewrote your query:

SQL
EXPLAIN SYNTAX
SELECT customer_id, sum(total)
FROM orders
WHERE status = 'shipped'
GROUP BY customer_id;

Useful when a query is misbehaving and you suspect a rewrite the optimizer applied isn't what you intended. Rarely the first stop for performance, but worth knowing.

EXPLAIN PLAN

The logical plan:

SQL
EXPLAIN PLAN
SELECT customer_id, sum(total)
FROM orders
WHERE status = 'shipped'
GROUP BY customer_id;

Shows the sequence of logical steps. The default doesn't include estimates; pass actions = 1 for more detail.

EXPLAIN PIPELINE

The physical execution pipeline, showing how many threads each stage runs on:

SQL
EXPLAIN PIPELINE
SELECT customer_id, sum(total)
FROM orders
WHERE status = 'shipped'
GROUP BY customer_id;

If a stage runs on one thread when you expected parallelism, that's a clue. Common causes: the table is too small to parallelize, max_threads is too low, or the query needs sorted output that prevents parallel scan.

EXPLAIN indexes = 1

The most useful EXPLAIN mode for performance work. Shows which parts and granules the query selects, broken down by which index pruned them:

SQL
EXPLAIN indexes = 1
SELECT * FROM orders
WHERE customer_id = 12345
  AND created_at >= '2026-06-01';
ReadFromMergeTree (default.orders)
  Indexes:
    PrimaryKey
      Keys:
        customer_id
      Condition: (customer_id in [12345, 12345])
      Parts: 4/40
      Granules: 18/12000
    Partition
      Keys:
        toYYYYMM(created_at)
      Condition: (toYYYYMM(created_at) in [202606, 202606])
      Parts: 4/40
      Granules: 18/120
    Skip
      Name: idx_email
      Description: bloom_filter GRANULARITY 4
      Parts: 2/4
      Granules: 6/18

The two numbers per index: 4/40 parts means 4 of 40 were used. 18/12000 granules means 18 of 12000 granules (~8192 rows each by default) were read. Closer to the total means less pruning.

If the primary key shows Parts: 40/40, Granules: 12000/12000, the primary key didn't prune anything. Either the query doesn't filter on the leftmost columns of ORDER BY, or the filter is in a form ClickHouse can't push down to the index.

EXPLAIN ESTIMATE

For a quick cost estimate without running the query:

SQL
EXPLAIN ESTIMATE
SELECT count() FROM events WHERE customer_id = 12345;
default | events | parts: 4 | rows: 147456 | marks: 18

Useful for "if I add this filter, how much smaller will the read be?"

Graph format for complex pipelines

SQL
EXPLAIN PIPELINE graph = 1, compact = 0
SELECT ...;

Outputs a DOT graph that you can render with graphviz. For pipelines with 20+ stages, the graph is much more readable than the text:

BASH
clickhouse-client --query="EXPLAIN PIPELINE graph = 1 SELECT ..." | dot -Tsvg > plan.svg

Step 3: Primary key design (this is the index)

In ClickHouse, the ORDER BY of a MergeTree table is the primary key, and it's the closest equivalent of an SQL composite index. There is no separate CREATE INDEX step for the primary index. It's baked into the table at create time and changing it later means rewriting the table.

So you get one shot at this. Make it count.

How the primary index works

ClickHouse sorts rows on disk by the ORDER BY columns and creates a sparse index: one entry per 8,192 rows (one "granule") by default. To find data, the server consults the sparse index, identifies which granules might contain matches, and reads those.

Crucially: the index granularity is sparse, not per-row. The smallest unit ClickHouse reads is a granule (8K rows). For high-selectivity equality queries (returning a single row), Postgres beats ClickHouse handily because Postgres has per-row index entries. ClickHouse's strength is OLAP queries that touch many rows but a small fraction of the table.

ORDER BY equals your most common filter and join key

Pick ORDER BY to match how you query the table. For a workload that filters by customer_id and ranges over created_at:

SQL
CREATE TABLE orders (
  order_id UInt64,
  customer_id UInt64,
  status LowCardinality(String),
  created_at DateTime,
  total Decimal(10, 2)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (customer_id, created_at);

Queries that filter by customer_id find the matching granules immediately. Queries that filter by created_at alone don't, but partition pruning saves you (because the partition key includes created_at).

Column order matters, as in a composite SQL index

Like a SQL composite index, the leftmost columns of ORDER BY are most useful for filtering. With ORDER BY (customer_id, created_at):

  • WHERE customer_id = 12345: full primary-key pruning. Fast.
  • WHERE customer_id = 12345 AND created_at >= '2026-06-01': full pruning on both. Fast.
  • WHERE created_at >= '2026-06-01': primary key doesn't help, but partition pruning does. Still fast.
  • WHERE order_id = 78910: primary key doesn't help, partition pruning doesn't help. Scans every part.

Order columns by: highest-cardinality equality column first if it's the most common filter, else the column most queries filter on.

LowCardinality is huge

For string columns with limited distinct values (statuses, country codes, event types), wrap them in LowCardinality:

SQL
status LowCardinality(String)

ClickHouse dictionary-encodes the column. Memory usage drops by 10x or more, and operations on the column (filtering, grouping) get faster because they work on integer codes instead of strings.

The "limited distinct values" threshold is roughly 10,000. Above that, the dictionary becomes large and the win disappears. Don't LowCardinality an email or user ID column.

Granule size

index_granularity defaults to 8,192 rows per granule. The trade-off:

  • Smaller granularity (say, 1024): finer-grained pruning, smaller reads per granule, but larger primary index in memory.
  • Larger granularity (say, 16384): coarser pruning, larger reads per granule, smaller index.

For point-lookup-ish workloads, smaller is better. For full-scan analytics, larger is fine. The default works for most cases. Tune only if you have a specific reason and you've measured.

Don't put high-cardinality columns first

A common mistake: ORDER BY (user_id, event_time) on a table that has billions of unique user_ids and is mostly queried by event_time. The primary index becomes huge (one entry per granule, and granules are small because user_ids are unique), and the queries don't even use it. The right structure depends on access patterns.

Step 4: Partitioning

Partitions are the unit of bulk operations (DROP PARTITION, MOVE PARTITION, attaching from another table) and the unit of partition pruning. Choose deliberately.

Partition by something coarse

The standard pattern is monthly:

SQL
PARTITION BY toYYYYMM(created_at)

That gives you 12 partitions per year. Each partition holds about 1/12 of the data. Queries with a date filter prune to the relevant partitions.

A common mistake: partitioning too finely. Daily partitions on a high-volume table give you 365 partitions per year. Hourly gives you 8,760. Each partition has overhead. The merge thread has to manage more parts. You'll hit "too many parts" errors or slow inserts.

For most workloads, monthly is right. For low-volume tables (under ~100M rows total), don't partition at all. For high-volume time-series with retention policies, daily can be worth it if you DROP PARTITION daily to age out data.

Partition pruning requires direct column comparison

The trap that catches new users:

SQL
-- Partition pruning works
SELECT count() FROM orders
WHERE created_at >= '2026-06-01' AND created_at < '2026-07-01';

-- Partition pruning fails: function on the column
SELECT count() FROM orders
WHERE toYYYYMM(created_at) = 202606;

The second form should logically use the partition. ClickHouse's planner doesn't always recognize the equivalence. Write filters on the raw column. The same applies to formatDateTime, toDate, and other wrappers.

Drop partitions to age out data

The cheapest possible delete in ClickHouse:

SQL
ALTER TABLE orders DROP PARTITION '202601';

Instant. No row-by-row work. This is one of the big reasons to partition by retention boundary. If you keep 12 months of data, partition by month and drop the 13th-month partition nightly.

ALTER TABLE ... DELETE WHERE ... does work, but it's an async, asynchronous mutation that rewrites parts. Slow, I/O-heavy. Avoid for bulk deletes.

Move partitions for storage tiering

SQL
ALTER TABLE orders MOVE PARTITION '202605' TO DISK 'cold_storage';

For tables with hot and cold data, this lets you move old partitions to cheaper storage while keeping them queryable.

Step 5: Skip indexes

For columns that aren't in the ORDER BY but get frequent equality or range filters, skip indexes prune granules before reading. They're stored alongside the primary index but at a coarser level (one entry per N granules).

SQL
ALTER TABLE orders ADD INDEX idx_status status TYPE set(0) GRANULARITY 4;
ALTER TABLE orders ADD INDEX idx_email email TYPE bloom_filter(0.01) GRANULARITY 4;
ALTER TABLE orders MATERIALIZE INDEX idx_status;
ALTER TABLE orders MATERIALIZE INDEX idx_email;

MATERIALIZE INDEX is required for the index to be applied to existing data. New writes get the index automatically.

Skip index types

TypeGood forNotes
minmaxNumeric or date columns with localityAlmost free; usually worth adding
set(N)Low-cardinality columnsN is max distinct values per index granule; 0 means unbounded
bloom_filter(p)High-cardinality equality lookupsp is false positive rate
ngrambf_v1(n, size, hashes, seed)LIKE '%substring%' matchesLarge index; use sparingly
tokenbf_v1(size, hashes, seed)Word-based searchTokenizes on word boundaries

GRANULARITY 4 means one skip index entry per 4 granules, or about 32,768 rows. Smaller granularity is finer-grained but bigger; larger is coarser but smaller.

The defaults are conservative. For bloom_filter on a frequently-filtered column, GRANULARITY 1 plus a tight false-positive rate (0.001) gives the fastest pruning at the cost of more index size.

When skip indexes help and when they don't

Skip indexes help when:

  • The filter is selective (returns a small fraction of granules).
  • The column has locality (similar values clustered together physically).

They don't help when:

  • The filter matches a high fraction of rows.
  • Values are scattered uniformly across the table (the skip index can't prune anything).

Always check after adding: EXPLAIN indexes = 1 should show the skip index in the indexes list with non-zero pruning.

Step 6: Projections (alternate physical orderings)

Projections let one logical table maintain multiple physical orderings. Added in 21.6, mature since 22.8 or so. The use case: when one workload wants ORDER BY (customer_id, created_at) and another wants ORDER BY (status, created_at), you can have both:

SQL
ALTER TABLE orders ADD PROJECTION by_status (
  SELECT *
  ORDER BY (status, created_at)
);
ALTER TABLE orders MATERIALIZE PROJECTION by_status;

Queries that filter primarily on status will transparently use the projection. The plan shows it:

SQL
EXPLAIN
SELECT * FROM orders WHERE status = 'pending'
ORDER BY created_at;

Trade-offs:

  • Writes go to every projection. Twice the projections, twice the write cost.
  • Materialized projections take disk space proportional to the size of the projection's columns.
  • Adding a projection to an existing large table is expensive (it has to rewrite).

Projections are powerful but they're not free. Add them when a specific query pattern is critical and the table's primary key can't accommodate it.

Step 7: PREWHERE optimization

PREWHERE is a ClickHouse-specific clause that evaluates a filter on a small column first, then only reads the other columns for matching rows. Because ClickHouse is columnar, this saves I/O on wide tables:

SQL
SELECT order_id, total, large_blob
FROM orders
PREWHERE status = 'shipped'
WHERE created_at >= '2026-06-01';

status is a small LowCardinality(String) column, so reading it for every row is cheap. large_blob is only read for the rows that survive the filter.

The optimizer often does this automatically (the optimize_move_to_prewhere setting controls it, on by default). You'd expect to use PREWHERE explicitly when:

  • You have a small, selective filter and a wide projection.
  • The optimizer chose poorly (verify with EXPLAIN PIPELINE).
  • You want to force the rewrite for clarity.

Don't PREWHERE on a large column or a column that requires expensive decoding. The whole point is to filter cheaply before reading expensive columns.

Step 8: Materialized views for pre-aggregation

If a dashboard runs the same GROUP BY ... SUM(...) every 30 seconds, don't compute it from scratch. Pre-aggregate with a materialized view backed by an aggregating merge tree:

SQL
CREATE MATERIALIZED VIEW orders_daily_mv
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (day, customer_id)
AS SELECT
  toDate(created_at) AS day,
  customer_id,
  count() AS orders_count,
  sum(total) AS total_sum
FROM orders
GROUP BY day, customer_id;

The view is populated incrementally as inserts arrive on the source table. The dashboard reads from orders_daily_mv instead of orders. Aggregation cost is amortized across writes.

Engine families

For different aggregation semantics:

  • SummingMergeTree: sums numeric columns when merging rows with the same ORDER BY key. Use for additive metrics.
  • AggregatingMergeTree: stores intermediate aggregation states; use for non-additive aggregations (counts of distinct, quantiles).
  • ReplacingMergeTree: keeps only the latest row per ORDER BY key. Use for "current state" views.
  • CollapsingMergeTree, VersionedCollapsingMergeTree: for CDC-style upserts and deletes.

The trap: rows merge eventually, not immediately. A SELECT from a SummingMergeTree may see un-merged rows until the background merge runs. Use SELECT ... FINAL to force the merge at read time (slow), or use the SimpleAggregateFunction types and sumMerge-style aggregation in the query.

When to use materialized views vs projections

Both pre-compute. The difference:

  • Materialized views are separate tables with their own schema. You read from them by name. They support different aggregations and shapes.
  • Projections are alternate orderings of the same table. The optimizer uses them transparently.

For "I need this query 1000x faster," materialized view. For "I need this table queryable in two different sort orders," projection.

Step 9: Joins (the hard one)

ClickHouse joins are not Postgres joins. The default JOIN algorithm is hash: the right table is loaded entirely into memory, then the left is streamed against it. This is fine when the right side is small. It blows up when the right side is large.

Algorithms

  • hash (default): right side in memory. Fast for small right side, memory-bound for large.
  • parallel_hash: like hash but parallelized. Default in newer versions for some configurations.
  • partial_merge: sort-merge join with sorted streams. Can spill to disk; slower than hash when memory permits.
  • grace_hash: hash join with disk spilling. Newer; useful for joins that don't fit in memory.
  • direct: lookup against a dictionary (see below).

Override the algorithm per query:

SQL
SET join_algorithm = 'parallel_hash';

Dictionaries for reference data

The right answer for joining a fact table to small reference data: dictionaries. A dictionary is a memory-resident lookup table, loaded once from another source (a table, file, MySQL, HTTP), and queried via dictGet:

SQL
CREATE DICTIONARY countries_dict (
  id UInt32, name String, region String
) PRIMARY KEY id
SOURCE(CLICKHOUSE(TABLE 'countries'))
LIFETIME(MIN 3600 MAX 7200)
LAYOUT(HASHED());

SELECT
  customer_id,
  dictGet('countries_dict', 'name', country_id) AS country
FROM customers;

Dictionaries are dramatically faster than the equivalent JOIN. For lookup data under a few million entries, this is the right pattern. Layouts include HASHED, FLAT, CACHE, COMPLEX_KEY_HASHED, each with different memory and lookup trade-offs.

Denormalization

The other right answer: store the joined field on the fact row at write time. Disk is cheap, joins are expensive, ClickHouse is fine with redundancy. Denormalize aggressively at the source-table level. If joins on the hot path are showing up in your slow query log, the schema is asking to be denormalized.

GLOBAL JOIN for distributed queries

On a distributed cluster, a regular JOIN is evaluated locally on each shard. For most queries that's wrong: each shard only knows its own data. GLOBAL JOIN materializes the right side on the initiator and ships it to each shard:

SQL
SELECT * FROM distributed_orders
GLOBAL ANY LEFT JOIN dim_customers USING customer_id;

GLOBAL is the right default for joins on distributed tables. Without it, results can be wrong.

Use ANY LEFT JOIN when you can

ANY LEFT JOIN returns at most one matching row per left-side row, even if multiple right-side rows match. It's faster than ALL LEFT JOIN because it can short-circuit. Use it when you know the right side is unique (a dimension table), or when you genuinely want "any one matching row":

SQL
SELECT o.*, c.name
FROM orders o
ANY LEFT JOIN customers c ON o.customer_id = c.id;

If you need every match, ALL LEFT JOIN. Pick the one that fits your semantics; don't default to ALL if ANY works.

Step 10: Query rewrites

Mechanical patterns that change runtimes.

Never SELECT * on wide tables

ClickHouse is columnar. SELECT * on a 50-column table reads 50 columns from disk and decompresses them. Even if you're going to discard most columns in the client, the work has been done. Project only what you need:

SQL
-- Slow: reads 50 columns
SELECT * FROM events WHERE customer_id = 12345;

-- Fast: reads three columns
SELECT event_id, event_type, created_at
FROM events WHERE customer_id = 12345;

This is the single highest-impact rewrite for wide-table workloads.

Use sampling for ad-hoc analysis

For exploratory queries on huge tables, sampling gives statistically meaningful results in a tiny fraction of the time:

SQL
SELECT count(), avg(total)
FROM orders SAMPLE 0.1
WHERE status = 'shipped';

Requires the table to be created with a sampling expression, like SAMPLE BY cityHash64(customer_id). The 0.1 reads 10% of granules.

For "what's the rough distribution?" questions, sampling is the difference between a 30-second query and a 3-second query. Reserved for ad-hoc work; production aggregations should hit materialized views.

FINAL is a code smell

SELECT ... FINAL forces merging at query time on ReplacingMergeTree or CollapsingMergeTree. It's correct but slow because it can't parallelize as well. For high-volume queries:

  • Design the schema so aggregations tolerate duplicates.
  • Run OPTIMIZE TABLE ... FINAL periodically in background and read without FINAL (accepting brief inconsistency).
  • Use a different engine that doesn't require merge-time resolution.

Liberal use of FINAL is a sign the data model needs revisiting.

COUNT vs COUNT DISTINCT

  • count(): fast, counts rows.
  • count(column): fast, counts non-null values.
  • count(DISTINCT column): slow, requires deduplication.
  • uniqExact(column): same as count distinct, slow, exact.
  • uniq(column): approximate (HyperLogLog), much faster.
  • uniqCombined(column): similar to uniq, slightly different algorithm.
  • uniqHLL12(column): pure HLL, slightly less accurate but faster.

For dashboards, use uniq unless you absolutely need exact counts. The error is typically under 1% on cardinalities above a thousand.

Avoid OR; use IN where possible

Like Postgres, OR across different columns can confuse the planner. IN for the same column is always cheaper:

SQL
-- Slow
SELECT * FROM orders WHERE customer_id = 1 OR customer_id = 2 OR customer_id = 3;

-- Fast
SELECT * FROM orders WHERE customer_id IN (1, 2, 3);

For OR across columns, sometimes splitting into a UNION ALL helps. Verify with EXPLAIN indexes = 1 whether the primary key is used.

Subqueries vs JOIN

For "rows in table A where matching column is in a small set from table B," a subquery with IN is often faster than a join:

SQL
SELECT * FROM events
WHERE customer_id IN (
  SELECT id FROM customers WHERE plan = 'enterprise'
);

The right-hand subquery materializes and becomes a hash set. Compare to the JOIN equivalent and pick the faster.

Step 11: Insert patterns

Insert behavior is where many ClickHouse deployments first get into trouble.

Batch inserts, not row-by-row

The single most important insert rule: never insert one row at a time. ClickHouse creates one part per INSERT statement. A million single-row inserts creates a million parts. The merge thread can't keep up, and you'll see:

Too many parts (300). Merges are processing significantly slower than inserts.

Batch inserts at the application level:

  • Buffer rows in memory.
  • Insert in batches of at least 10,000 rows (preferably 100,000 or more).
  • Flush on time or size threshold.

The standard guidance is "1 insert per second per table" at most, with the largest possible batch each time.

Async inserts

If you can't batch on the client side, async inserts let ClickHouse buffer them server-side:

SQL
SET async_insert = 1;
SET wait_for_async_insert = 0;

The server buffers incoming inserts and flushes them in larger blocks. Trade-off: a brief delay between insert acknowledgment and data being queryable, and the small risk of data loss if the server crashes before the buffer flushes (mitigated by wait_for_async_insert = 1, which is safer but slower).

Async inserts have improved a lot in recent ClickHouse versions. They're now the default-recommended path for OLTP-style write patterns.

The Buffer table engine

The legacy alternative: a Buffer table is a wrapper that holds inserts in memory and flushes to a destination table on conditions:

SQL
CREATE TABLE events_buffer AS events
ENGINE = Buffer(default, events, 16, 10, 100, 10000, 1000000, 10000000, 100000000);

Apps insert into events_buffer; the buffer flushes to events. Buffer tables have edge cases (data isn't queryable during merges in some configurations, schema changes are awkward), so async inserts are generally preferred today. Still useful for legacy setups.

Monitor part counts

SQL
SELECT
  database,
  table,
  count() AS active_parts
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY active_parts DESC
LIMIT 10;

A healthy MergeTree table has parts in the hundreds, not thousands. If a table has thousands of active parts, inserts are too small or too frequent. Tune the batch size.

Step 12: Configuration

A handful of settings move the needle.

max_threads

How many threads a single query uses. Default is the number of physical CPU cores. The trade-off:

  • Higher: faster individual queries, more memory consumption, fewer concurrent queries possible.
  • Lower: slower individual queries, more concurrent capacity.

For a few-big-users workload, raise it. For a many-small-queries workload, lower it. Set per-query if both patterns coexist:

SQL
SET max_threads = 4;

max_memory_usage

Per-query memory cap. Default is 10GB. Big GROUP BY aggregations on high-cardinality columns blow through this:

SQL
SET max_memory_usage = 20000000000;

If individual queries need more memory than the host has, you have a different problem.

Spilling to disk for GROUP BY

max_bytes_before_external_group_by and max_bytes_before_external_sort let aggregations and sorts spill to disk when they exceed memory:

SQL
SET max_bytes_before_external_group_by = 10000000000;
SET max_bytes_before_external_sort = 10000000000;

The query gets slower but completes instead of failing. Set in profile settings, not per query, since you usually want this consistently.

optimize_read_in_order

Lets ClickHouse take advantage of ORDER BY matching the sorting key, avoiding a re-sort:

SQL
SET optimize_read_in_order = 1;

On by default in modern versions. Verify with EXPLAIN PIPELINE that the query stage isn't doing a redundant sort.

parts_to_throw_insert

This setting controls when ClickHouse refuses new inserts because there are too many active parts:

SQL
parts_to_throw_insert = 300

If you're hitting this regularly, bumping the threshold is a band-aid. The real fix is bigger batches.

Settings profiles

For production, define settings profiles in users.xml rather than SET-ing per session. A readonly profile, a dashboard profile with low max_threads, an etl profile with high max_memory_usage. Each user is bound to a profile, and queries inherit its limits.

Step 13: Common slow query patterns

Too many small parts

The defining ClickHouse anti-pattern. Symptom: insert errors, slow queries, the merge thread consistently behind. Cause: inserts that aren't batched. Fix: batch inserts of at least 10K rows, or enable async_insert.

Wrong ORDER BY for the access pattern

A table created with ORDER BY id doesn't help a query that filters by customer_id. EXPLAIN indexes = 1 shows full part/granule scanning. Fixes:

  • Add a projection for the alternate access pattern.
  • Rewrite the table with a better key. Painful but sometimes necessary.

Missing partition pruning

The query doesn't filter on the partition column, or filters on a function of it:

SQL
-- Partition pruning fails
WHERE toYYYYMM(created_at) = 202606

-- Partition pruning works
WHERE created_at >= '2026-06-01' AND created_at < '2026-07-01'

Always filter on raw partition columns directly. Check with EXPLAIN indexes = 1.

High-cardinality GROUP BY

A GROUP BY user_id on a billion-row table with 100M distinct users needs huge memory. Options:

  • Pre-aggregate via a materialized view.
  • Use uniqCombined instead of uniqExact for cardinality.
  • Allow spill with max_bytes_before_external_group_by.
  • Shard by the group key on a Distributed table so each shard's group set is smaller.

LowCardinality misuse

LowCardinality is a big win for under ~10,000 distinct values. It's a loss for high-cardinality columns: the dictionary grows huge, and operations on the column slow down. Check system.parts_columns for column sizes; a LowCardinality column that's larger than its raw equivalent is misconfigured.

Querying a Distributed table when a local one would do

On a sharded cluster, Distributed tables fan out to every shard. If your application knows which shard a query targets (because the data is sharded on a known key), query the local MergeTree directly. For aggregations where the data is pre-sharded by the group key, set distributed_group_by_no_merge.

Too many concurrent queries

ClickHouse threads per query are configurable, but the total thread budget is finite. A workload of many concurrent queries with max_threads = 32 each will fight for CPU. Lower max_threads for high-concurrency workloads, or queue queries at the application layer.

Joining two large tables

If you're joining two billion-row tables and the join can't be replaced with a dictionary or denormalization, the query is going to be slow. Options:

  • grace_hash algorithm for spilling.
  • partial_merge if both sides are sorted by the join key.
  • Pre-aggregate one side via a materialized view to shrink it.

There's no shortcut; large joins are expensive in ClickHouse and the design encourages denormalization for a reason.

Mutations on hot tables

ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE are async mutations that rewrite parts. They're not transactional, they're not fast, and they don't lock. For high-volume tables, design schemas to avoid mutations. For occasional fixes, monitor system.mutations for completion:

SQL
SELECT * FROM system.mutations WHERE is_done = 0;

Replication lag on ReplicatedMergeTree

If you're running a ReplicatedMergeTree cluster, check replication status:

SQL
SELECT database, table, replica_name, log_max_index, log_pointer,
       queue_size, future_parts, absolute_delay
FROM system.replicas
WHERE absolute_delay > 0;

absolute_delay is seconds behind the leader. Persistent lag points at network issues, slow merges, or undersized hardware on the replica.

Step 14: When tuning isn't enough

Sharding via Distributed tables

For data sizes that exceed one machine, shard. A Distributed table fronts shards stored on separate hosts. Pick a shard key that gives even distribution and that your queries can use for routing.

The trade-offs are real: cross-shard joins, scatter-gather queries, increased operational complexity. Don't shard prematurely; ClickHouse on one big box handles a lot.

Replication via ReplicatedMergeTree

For HA and read scaling, use ReplicatedMergeTree with ZooKeeper or ClickHouse Keeper coordinating replicas. Most production ClickHouse deployments are replicated.

ClickHouse Cloud and managed services

For teams that don't want to operate clusters, ClickHouse Cloud, Aiven, and Altinity offer managed ClickHouse. Worth considering for teams without dedicated database operations.

The tools worth installing

ToolWhat it doesCost
system.* tablesBuilt-in introspection (query_log, parts, mutations, metrics, etc.)Free, built in
clickhouse-clientOfficial CLI with rich query supportFree
clickhouse-benchmarkLoad testingFree
clickhouse-copierBulk copy between clustersFree
clickhouse-backupPoint-in-time backupsFree (Altinity)
Grafana ClickHouse data sourceDashboards and metricsFree
ClickHouse CloudManaged servicePaid
Altinity toolsOperator, support, tooling for self-hostedPaid
DBeaverGUI clientFree
TabixWeb UI for ClickHouseFree

For most teams the right starting set is: system.query_log enabled with retention, Grafana scraping system.metric_log or the Prometheus endpoint, clickhouse-backup running nightly, and the official clickhouse-client for ad-hoc work.

Quick checklist

When you encounter a slow ClickHouse query:

  1. Check system.query_log for the duration, rows read, memory used, and ProfileEvents.
  2. Run EXPLAIN indexes = 1 and look at parts/granules read vs total.
  3. Verify the query filters on the ORDER BY prefix for primary-key pruning. If not, the primary key doesn't fit the query.
  4. Check partition pruning by looking at the Partition section in EXPLAIN indexes. Functions on partition columns disable it.
  5. Check the SELECT list. Are you reading columns you don't need? Columnar databases hate SELECT *.
  6. Consider PREWHERE for a small selective filter feeding a wide projection.
  7. Look at system.parts. Thousands of active parts means inserts are too small.
  8. For joins, ask whether a dictionary or denormalization works instead.
  9. For repeated aggregations, build a materialized view.
  10. For "this query type doesn't fit the primary key," add a projection or rewrite the table.

Most slow ClickHouse queries come from a small set of structural issues: the primary key doesn't match the workload, partition pruning isn't kicking in, too many small parts, or a join that should have been denormalized. The system tables make diagnosis fast. The fixes are usually schema-level, not configuration-level. ClickHouse rewards careful design more than any other database in common use; spend the time at the table-creation step and you'll rarely need to revisit performance later.

Keep Reading