How to Fix Slow ClickHouse Queries
ClickHouse will scan a billion rows in two seconds on hardware that would crush Postgres. It will also run the same query for fifteen minutes if you give it the wrong primary key. The difference between those two outcomes is almost entirely about how you laid out the table, and almost none about hardware. ClickHouse is the most performance-sensitive database to schema design that most engineers will ever touch.
This is what makes ClickHouse performance work different. There's no "add an index and the query gets fast" escape hatch. The primary index is the ORDER BY of the table, baked in at create time. If your queries don't match it, you're in for a refactor, not a tuning session. The good news: the system gives you exceptional visibility into what it's doing, with system.query_log recording every query's costs in microsecond detail. This guide walks through the diagnostic tools, the design choices that matter most, and the patterns that move queries from minutes to milliseconds.
Why ClickHouse queries get slow
Four patterns drive almost every slow-query report:
-
The query doesn't match the primary key.
ORDER BY (customer_id, created_at)lets you efficiently filter oncustomer_id. A query filtering onemailscans every part. There is no equivalent of "add an index" that fixes this cheaply. -
Partition pruning isn't kicking in. A query against a time-partitioned table that doesn't include the partition column in the filter scans every partition. Wrap a date in a function and partition pruning disappears.
-
Too many small parts. Frequent small inserts produce a flood of parts that the merge thread can't keep up with. Queries that should consider 50 parts consider 5000.
-
A join that should have been denormalized. ClickHouse's default hash join loads the right side into memory. Joining two large tables either blows up memory or runs sequentially. Dictionaries and denormalization are the answer, not better tuning.
The diagnostic process is figuring out which of those four you're hitting, then either redesigning the table, fixing the data layout, or restructuring the query. The rest of this post walks through each in detail.
Step 1: The diagnostic stack
ClickHouse exposes everything through system tables. There's no extension to install, no profiler to enable. The visibility is the most generous of any database in this comparison series.
system.query_log
Every query the server runs lands in system.query_log with full details: the SQL, the duration, the rows and bytes read, the memory used, the ProfileEvents map. Enable it if it isn't already (it's on by default in modern versions):
The basic top-N slow queries:
The type column distinguishes successful runs from failures. Worth checking:
QueryStart: query started.QueryFinish: completed successfully.ExceptionBeforeStart: failed during parsing or analysis.ExceptionWhileProcessing: failed during execution.
For aggregate analysis across normalized query shapes:
normalized_query_hash (since 21.2 or thereabouts) groups queries that differ only in literal values. So all variants of SELECT ... WHERE id = ? collapse into one row. Sort by total time to find queries that are individually fast but cumulatively expensive.
ProfileEvents and Settings columns
The ProfileEvents column in system.query_log is a map of detailed counters. Useful ones:
Query: number of queries.SelectQuery,InsertQuery: by type.SelectedParts,SelectedRanges,SelectedMarks: how much of the table was scanned. This is the closest equivalent of "buffers" in Postgres.FileOpen,OpenedFileCacheHits,OpenedFileCacheMisses: file I/O.OSReadBytes,OSReadChars: real disk reads.MarkCacheHits,MarkCacheMisses: index cache effectiveness.UncompressedCacheHits,UncompressedCacheMisses: data cache.
For a specific query:
The Settings column shows which settings were in effect for that query. Useful when a query is fast in one client and slow in another; check whether max_threads or max_memory_usage differ.
system.processes for live queries
Queries running right now:
total_rows_approx is ClickHouse's estimate of how many rows the query will end up reading. Compare to read_rows to get a rough progress percentage.
Kill a runaway:
The kill is best-effort. Some operations (the synchronous part of an insert, for instance) can't be killed mid-flight.
system.metric_log for time-series telemetry
For longer-term performance analysis, enable metric_log:
It records server-level metrics every second. Useful for "what was happening around the time the slow queries spiked?" Cross-reference timestamps with query_log.
trace_log for the deepest dive
The trace_log records sampled stack traces. Useful when you have a slow query and the per-query stats don't explain why:
You get the most-frequently-sampled call stacks. This is closer to a profiler than most databases offer. Useful for "ClickHouse itself is bottlenecking" investigations.
Send-logs-level trace from the client
For an individual query, the client can tell the server to stream back its internal logs:
You'll see which parts and granules ClickHouse opened, which skip indexes fired, where time went stage-by-stage. Verbose, but unmatched for debugging a single query end-to-end.
clickhouse-benchmark
For comparing query performance across runs or settings:
You get percentile latencies, throughput, and basic statistics. Useful for confirming an optimization helped, rather than guessing from a single timed run.
Step 2: EXPLAIN in depth
ClickHouse has multiple EXPLAIN modes. Each tells you something different.
EXPLAIN AST and EXPLAIN SYNTAX
For confirming what the parser saw and how it rewrote your query:
Useful when a query is misbehaving and you suspect a rewrite the optimizer applied isn't what you intended. Rarely the first stop for performance, but worth knowing.
EXPLAIN PLAN
The logical plan:
Shows the sequence of logical steps. The default doesn't include estimates; pass actions = 1 for more detail.
EXPLAIN PIPELINE
The physical execution pipeline, showing how many threads each stage runs on:
If a stage runs on one thread when you expected parallelism, that's a clue. Common causes: the table is too small to parallelize, max_threads is too low, or the query needs sorted output that prevents parallel scan.
EXPLAIN indexes = 1
The most useful EXPLAIN mode for performance work. Shows which parts and granules the query selects, broken down by which index pruned them:
ReadFromMergeTree (default.orders)
Indexes:
PrimaryKey
Keys:
customer_id
Condition: (customer_id in [12345, 12345])
Parts: 4/40
Granules: 18/12000
Partition
Keys:
toYYYYMM(created_at)
Condition: (toYYYYMM(created_at) in [202606, 202606])
Parts: 4/40
Granules: 18/120
Skip
Name: idx_email
Description: bloom_filter GRANULARITY 4
Parts: 2/4
Granules: 6/18
The two numbers per index: 4/40 parts means 4 of 40 were used. 18/12000 granules means 18 of 12000 granules (~8192 rows each by default) were read. Closer to the total means less pruning.
If the primary key shows Parts: 40/40, Granules: 12000/12000, the primary key didn't prune anything. Either the query doesn't filter on the leftmost columns of ORDER BY, or the filter is in a form ClickHouse can't push down to the index.
EXPLAIN ESTIMATE
For a quick cost estimate without running the query:
default | events | parts: 4 | rows: 147456 | marks: 18
Useful for "if I add this filter, how much smaller will the read be?"
Graph format for complex pipelines
Outputs a DOT graph that you can render with graphviz. For pipelines with 20+ stages, the graph is much more readable than the text:
Step 3: Primary key design (this is the index)
In ClickHouse, the ORDER BY of a MergeTree table is the primary key, and it's the closest equivalent of an SQL composite index. There is no separate CREATE INDEX step for the primary index. It's baked into the table at create time and changing it later means rewriting the table.
So you get one shot at this. Make it count.
How the primary index works
ClickHouse sorts rows on disk by the ORDER BY columns and creates a sparse index: one entry per 8,192 rows (one "granule") by default. To find data, the server consults the sparse index, identifies which granules might contain matches, and reads those.
Crucially: the index granularity is sparse, not per-row. The smallest unit ClickHouse reads is a granule (8K rows). For high-selectivity equality queries (returning a single row), Postgres beats ClickHouse handily because Postgres has per-row index entries. ClickHouse's strength is OLAP queries that touch many rows but a small fraction of the table.
ORDER BY equals your most common filter and join key
Pick ORDER BY to match how you query the table. For a workload that filters by customer_id and ranges over created_at:
Queries that filter by customer_id find the matching granules immediately. Queries that filter by created_at alone don't, but partition pruning saves you (because the partition key includes created_at).
Column order matters, as in a composite SQL index
Like a SQL composite index, the leftmost columns of ORDER BY are most useful for filtering. With ORDER BY (customer_id, created_at):
WHERE customer_id = 12345: full primary-key pruning. Fast.WHERE customer_id = 12345 AND created_at >= '2026-06-01': full pruning on both. Fast.WHERE created_at >= '2026-06-01': primary key doesn't help, but partition pruning does. Still fast.WHERE order_id = 78910: primary key doesn't help, partition pruning doesn't help. Scans every part.
Order columns by: highest-cardinality equality column first if it's the most common filter, else the column most queries filter on.
LowCardinality is huge
For string columns with limited distinct values (statuses, country codes, event types), wrap them in LowCardinality:
ClickHouse dictionary-encodes the column. Memory usage drops by 10x or more, and operations on the column (filtering, grouping) get faster because they work on integer codes instead of strings.
The "limited distinct values" threshold is roughly 10,000. Above that, the dictionary becomes large and the win disappears. Don't LowCardinality an email or user ID column.
Granule size
index_granularity defaults to 8,192 rows per granule. The trade-off:
- Smaller granularity (say, 1024): finer-grained pruning, smaller reads per granule, but larger primary index in memory.
- Larger granularity (say, 16384): coarser pruning, larger reads per granule, smaller index.
For point-lookup-ish workloads, smaller is better. For full-scan analytics, larger is fine. The default works for most cases. Tune only if you have a specific reason and you've measured.
Don't put high-cardinality columns first
A common mistake: ORDER BY (user_id, event_time) on a table that has billions of unique user_ids and is mostly queried by event_time. The primary index becomes huge (one entry per granule, and granules are small because user_ids are unique), and the queries don't even use it. The right structure depends on access patterns.
Step 4: Partitioning
Partitions are the unit of bulk operations (DROP PARTITION, MOVE PARTITION, attaching from another table) and the unit of partition pruning. Choose deliberately.
Partition by something coarse
The standard pattern is monthly:
That gives you 12 partitions per year. Each partition holds about 1/12 of the data. Queries with a date filter prune to the relevant partitions.
A common mistake: partitioning too finely. Daily partitions on a high-volume table give you 365 partitions per year. Hourly gives you 8,760. Each partition has overhead. The merge thread has to manage more parts. You'll hit "too many parts" errors or slow inserts.
For most workloads, monthly is right. For low-volume tables (under ~100M rows total), don't partition at all. For high-volume time-series with retention policies, daily can be worth it if you DROP PARTITION daily to age out data.
Partition pruning requires direct column comparison
The trap that catches new users:
The second form should logically use the partition. ClickHouse's planner doesn't always recognize the equivalence. Write filters on the raw column. The same applies to formatDateTime, toDate, and other wrappers.
Drop partitions to age out data
The cheapest possible delete in ClickHouse:
Instant. No row-by-row work. This is one of the big reasons to partition by retention boundary. If you keep 12 months of data, partition by month and drop the 13th-month partition nightly.
ALTER TABLE ... DELETE WHERE ... does work, but it's an async, asynchronous mutation that rewrites parts. Slow, I/O-heavy. Avoid for bulk deletes.
Move partitions for storage tiering
For tables with hot and cold data, this lets you move old partitions to cheaper storage while keeping them queryable.
Step 5: Skip indexes
For columns that aren't in the ORDER BY but get frequent equality or range filters, skip indexes prune granules before reading. They're stored alongside the primary index but at a coarser level (one entry per N granules).
MATERIALIZE INDEX is required for the index to be applied to existing data. New writes get the index automatically.
Skip index types
| Type | Good for | Notes |
|---|---|---|
minmax | Numeric or date columns with locality | Almost free; usually worth adding |
set(N) | Low-cardinality columns | N is max distinct values per index granule; 0 means unbounded |
bloom_filter(p) | High-cardinality equality lookups | p is false positive rate |
ngrambf_v1(n, size, hashes, seed) | LIKE '%substring%' matches | Large index; use sparingly |
tokenbf_v1(size, hashes, seed) | Word-based search | Tokenizes on word boundaries |
GRANULARITY 4 means one skip index entry per 4 granules, or about 32,768 rows. Smaller granularity is finer-grained but bigger; larger is coarser but smaller.
The defaults are conservative. For bloom_filter on a frequently-filtered column, GRANULARITY 1 plus a tight false-positive rate (0.001) gives the fastest pruning at the cost of more index size.
When skip indexes help and when they don't
Skip indexes help when:
- The filter is selective (returns a small fraction of granules).
- The column has locality (similar values clustered together physically).
They don't help when:
- The filter matches a high fraction of rows.
- Values are scattered uniformly across the table (the skip index can't prune anything).
Always check after adding: EXPLAIN indexes = 1 should show the skip index in the indexes list with non-zero pruning.
Step 6: Projections (alternate physical orderings)
Projections let one logical table maintain multiple physical orderings. Added in 21.6, mature since 22.8 or so. The use case: when one workload wants ORDER BY (customer_id, created_at) and another wants ORDER BY (status, created_at), you can have both:
Queries that filter primarily on status will transparently use the projection. The plan shows it:
Trade-offs:
- Writes go to every projection. Twice the projections, twice the write cost.
- Materialized projections take disk space proportional to the size of the projection's columns.
- Adding a projection to an existing large table is expensive (it has to rewrite).
Projections are powerful but they're not free. Add them when a specific query pattern is critical and the table's primary key can't accommodate it.
Step 7: PREWHERE optimization
PREWHERE is a ClickHouse-specific clause that evaluates a filter on a small column first, then only reads the other columns for matching rows. Because ClickHouse is columnar, this saves I/O on wide tables:
status is a small LowCardinality(String) column, so reading it for every row is cheap. large_blob is only read for the rows that survive the filter.
The optimizer often does this automatically (the optimize_move_to_prewhere setting controls it, on by default). You'd expect to use PREWHERE explicitly when:
- You have a small, selective filter and a wide projection.
- The optimizer chose poorly (verify with
EXPLAIN PIPELINE). - You want to force the rewrite for clarity.
Don't PREWHERE on a large column or a column that requires expensive decoding. The whole point is to filter cheaply before reading expensive columns.
Step 8: Materialized views for pre-aggregation
If a dashboard runs the same GROUP BY ... SUM(...) every 30 seconds, don't compute it from scratch. Pre-aggregate with a materialized view backed by an aggregating merge tree:
The view is populated incrementally as inserts arrive on the source table. The dashboard reads from orders_daily_mv instead of orders. Aggregation cost is amortized across writes.
Engine families
For different aggregation semantics:
- SummingMergeTree: sums numeric columns when merging rows with the same
ORDER BYkey. Use for additive metrics. - AggregatingMergeTree: stores intermediate aggregation states; use for non-additive aggregations (counts of distinct, quantiles).
- ReplacingMergeTree: keeps only the latest row per
ORDER BYkey. Use for "current state" views. - CollapsingMergeTree, VersionedCollapsingMergeTree: for CDC-style upserts and deletes.
The trap: rows merge eventually, not immediately. A SELECT from a SummingMergeTree may see un-merged rows until the background merge runs. Use SELECT ... FINAL to force the merge at read time (slow), or use the SimpleAggregateFunction types and sumMerge-style aggregation in the query.
When to use materialized views vs projections
Both pre-compute. The difference:
- Materialized views are separate tables with their own schema. You read from them by name. They support different aggregations and shapes.
- Projections are alternate orderings of the same table. The optimizer uses them transparently.
For "I need this query 1000x faster," materialized view. For "I need this table queryable in two different sort orders," projection.
Step 9: Joins (the hard one)
ClickHouse joins are not Postgres joins. The default JOIN algorithm is hash: the right table is loaded entirely into memory, then the left is streamed against it. This is fine when the right side is small. It blows up when the right side is large.
Algorithms
hash(default): right side in memory. Fast for small right side, memory-bound for large.parallel_hash: like hash but parallelized. Default in newer versions for some configurations.partial_merge: sort-merge join with sorted streams. Can spill to disk; slower than hash when memory permits.grace_hash: hash join with disk spilling. Newer; useful for joins that don't fit in memory.direct: lookup against a dictionary (see below).
Override the algorithm per query:
Dictionaries for reference data
The right answer for joining a fact table to small reference data: dictionaries. A dictionary is a memory-resident lookup table, loaded once from another source (a table, file, MySQL, HTTP), and queried via dictGet:
Dictionaries are dramatically faster than the equivalent JOIN. For lookup data under a few million entries, this is the right pattern. Layouts include HASHED, FLAT, CACHE, COMPLEX_KEY_HASHED, each with different memory and lookup trade-offs.
Denormalization
The other right answer: store the joined field on the fact row at write time. Disk is cheap, joins are expensive, ClickHouse is fine with redundancy. Denormalize aggressively at the source-table level. If joins on the hot path are showing up in your slow query log, the schema is asking to be denormalized.
GLOBAL JOIN for distributed queries
On a distributed cluster, a regular JOIN is evaluated locally on each shard. For most queries that's wrong: each shard only knows its own data. GLOBAL JOIN materializes the right side on the initiator and ships it to each shard:
GLOBAL is the right default for joins on distributed tables. Without it, results can be wrong.
Use ANY LEFT JOIN when you can
ANY LEFT JOIN returns at most one matching row per left-side row, even if multiple right-side rows match. It's faster than ALL LEFT JOIN because it can short-circuit. Use it when you know the right side is unique (a dimension table), or when you genuinely want "any one matching row":
If you need every match, ALL LEFT JOIN. Pick the one that fits your semantics; don't default to ALL if ANY works.
Step 10: Query rewrites
Mechanical patterns that change runtimes.
Never SELECT * on wide tables
ClickHouse is columnar. SELECT * on a 50-column table reads 50 columns from disk and decompresses them. Even if you're going to discard most columns in the client, the work has been done. Project only what you need:
This is the single highest-impact rewrite for wide-table workloads.
Use sampling for ad-hoc analysis
For exploratory queries on huge tables, sampling gives statistically meaningful results in a tiny fraction of the time:
Requires the table to be created with a sampling expression, like SAMPLE BY cityHash64(customer_id). The 0.1 reads 10% of granules.
For "what's the rough distribution?" questions, sampling is the difference between a 30-second query and a 3-second query. Reserved for ad-hoc work; production aggregations should hit materialized views.
FINAL is a code smell
SELECT ... FINAL forces merging at query time on ReplacingMergeTree or CollapsingMergeTree. It's correct but slow because it can't parallelize as well. For high-volume queries:
- Design the schema so aggregations tolerate duplicates.
- Run
OPTIMIZE TABLE ... FINALperiodically in background and read withoutFINAL(accepting brief inconsistency). - Use a different engine that doesn't require merge-time resolution.
Liberal use of FINAL is a sign the data model needs revisiting.
COUNT vs COUNT DISTINCT
count(): fast, counts rows.count(column): fast, counts non-null values.count(DISTINCT column): slow, requires deduplication.uniqExact(column): same as count distinct, slow, exact.uniq(column): approximate (HyperLogLog), much faster.uniqCombined(column): similar touniq, slightly different algorithm.uniqHLL12(column): pure HLL, slightly less accurate but faster.
For dashboards, use uniq unless you absolutely need exact counts. The error is typically under 1% on cardinalities above a thousand.
Avoid OR; use IN where possible
Like Postgres, OR across different columns can confuse the planner. IN for the same column is always cheaper:
For OR across columns, sometimes splitting into a UNION ALL helps. Verify with EXPLAIN indexes = 1 whether the primary key is used.
Subqueries vs JOIN
For "rows in table A where matching column is in a small set from table B," a subquery with IN is often faster than a join:
The right-hand subquery materializes and becomes a hash set. Compare to the JOIN equivalent and pick the faster.
Step 11: Insert patterns
Insert behavior is where many ClickHouse deployments first get into trouble.
Batch inserts, not row-by-row
The single most important insert rule: never insert one row at a time. ClickHouse creates one part per INSERT statement. A million single-row inserts creates a million parts. The merge thread can't keep up, and you'll see:
Too many parts (300). Merges are processing significantly slower than inserts.
Batch inserts at the application level:
- Buffer rows in memory.
- Insert in batches of at least 10,000 rows (preferably 100,000 or more).
- Flush on time or size threshold.
The standard guidance is "1 insert per second per table" at most, with the largest possible batch each time.
Async inserts
If you can't batch on the client side, async inserts let ClickHouse buffer them server-side:
The server buffers incoming inserts and flushes them in larger blocks. Trade-off: a brief delay between insert acknowledgment and data being queryable, and the small risk of data loss if the server crashes before the buffer flushes (mitigated by wait_for_async_insert = 1, which is safer but slower).
Async inserts have improved a lot in recent ClickHouse versions. They're now the default-recommended path for OLTP-style write patterns.
The Buffer table engine
The legacy alternative: a Buffer table is a wrapper that holds inserts in memory and flushes to a destination table on conditions:
Apps insert into events_buffer; the buffer flushes to events. Buffer tables have edge cases (data isn't queryable during merges in some configurations, schema changes are awkward), so async inserts are generally preferred today. Still useful for legacy setups.
Monitor part counts
A healthy MergeTree table has parts in the hundreds, not thousands. If a table has thousands of active parts, inserts are too small or too frequent. Tune the batch size.
Step 12: Configuration
A handful of settings move the needle.
max_threads
How many threads a single query uses. Default is the number of physical CPU cores. The trade-off:
- Higher: faster individual queries, more memory consumption, fewer concurrent queries possible.
- Lower: slower individual queries, more concurrent capacity.
For a few-big-users workload, raise it. For a many-small-queries workload, lower it. Set per-query if both patterns coexist:
max_memory_usage
Per-query memory cap. Default is 10GB. Big GROUP BY aggregations on high-cardinality columns blow through this:
If individual queries need more memory than the host has, you have a different problem.
Spilling to disk for GROUP BY
max_bytes_before_external_group_by and max_bytes_before_external_sort let aggregations and sorts spill to disk when they exceed memory:
The query gets slower but completes instead of failing. Set in profile settings, not per query, since you usually want this consistently.
optimize_read_in_order
Lets ClickHouse take advantage of ORDER BY matching the sorting key, avoiding a re-sort:
On by default in modern versions. Verify with EXPLAIN PIPELINE that the query stage isn't doing a redundant sort.
parts_to_throw_insert
This setting controls when ClickHouse refuses new inserts because there are too many active parts:
If you're hitting this regularly, bumping the threshold is a band-aid. The real fix is bigger batches.
Settings profiles
For production, define settings profiles in users.xml rather than SET-ing per session. A readonly profile, a dashboard profile with low max_threads, an etl profile with high max_memory_usage. Each user is bound to a profile, and queries inherit its limits.
Step 13: Common slow query patterns
Too many small parts
The defining ClickHouse anti-pattern. Symptom: insert errors, slow queries, the merge thread consistently behind. Cause: inserts that aren't batched. Fix: batch inserts of at least 10K rows, or enable async_insert.
Wrong ORDER BY for the access pattern
A table created with ORDER BY id doesn't help a query that filters by customer_id. EXPLAIN indexes = 1 shows full part/granule scanning. Fixes:
- Add a projection for the alternate access pattern.
- Rewrite the table with a better key. Painful but sometimes necessary.
Missing partition pruning
The query doesn't filter on the partition column, or filters on a function of it:
Always filter on raw partition columns directly. Check with EXPLAIN indexes = 1.
High-cardinality GROUP BY
A GROUP BY user_id on a billion-row table with 100M distinct users needs huge memory. Options:
- Pre-aggregate via a materialized view.
- Use
uniqCombinedinstead ofuniqExactfor cardinality. - Allow spill with
max_bytes_before_external_group_by. - Shard by the group key on a Distributed table so each shard's group set is smaller.
LowCardinality misuse
LowCardinality is a big win for under ~10,000 distinct values. It's a loss for high-cardinality columns: the dictionary grows huge, and operations on the column slow down. Check system.parts_columns for column sizes; a LowCardinality column that's larger than its raw equivalent is misconfigured.
Querying a Distributed table when a local one would do
On a sharded cluster, Distributed tables fan out to every shard. If your application knows which shard a query targets (because the data is sharded on a known key), query the local MergeTree directly. For aggregations where the data is pre-sharded by the group key, set distributed_group_by_no_merge.
Too many concurrent queries
ClickHouse threads per query are configurable, but the total thread budget is finite. A workload of many concurrent queries with max_threads = 32 each will fight for CPU. Lower max_threads for high-concurrency workloads, or queue queries at the application layer.
Joining two large tables
If you're joining two billion-row tables and the join can't be replaced with a dictionary or denormalization, the query is going to be slow. Options:
grace_hashalgorithm for spilling.partial_mergeif both sides are sorted by the join key.- Pre-aggregate one side via a materialized view to shrink it.
There's no shortcut; large joins are expensive in ClickHouse and the design encourages denormalization for a reason.
Mutations on hot tables
ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE are async mutations that rewrite parts. They're not transactional, they're not fast, and they don't lock. For high-volume tables, design schemas to avoid mutations. For occasional fixes, monitor system.mutations for completion:
Replication lag on ReplicatedMergeTree
If you're running a ReplicatedMergeTree cluster, check replication status:
absolute_delay is seconds behind the leader. Persistent lag points at network issues, slow merges, or undersized hardware on the replica.
Step 14: When tuning isn't enough
Sharding via Distributed tables
For data sizes that exceed one machine, shard. A Distributed table fronts shards stored on separate hosts. Pick a shard key that gives even distribution and that your queries can use for routing.
The trade-offs are real: cross-shard joins, scatter-gather queries, increased operational complexity. Don't shard prematurely; ClickHouse on one big box handles a lot.
Replication via ReplicatedMergeTree
For HA and read scaling, use ReplicatedMergeTree with ZooKeeper or ClickHouse Keeper coordinating replicas. Most production ClickHouse deployments are replicated.
ClickHouse Cloud and managed services
For teams that don't want to operate clusters, ClickHouse Cloud, Aiven, and Altinity offer managed ClickHouse. Worth considering for teams without dedicated database operations.
The tools worth installing
| Tool | What it does | Cost |
|---|---|---|
system.* tables | Built-in introspection (query_log, parts, mutations, metrics, etc.) | Free, built in |
clickhouse-client | Official CLI with rich query support | Free |
clickhouse-benchmark | Load testing | Free |
clickhouse-copier | Bulk copy between clusters | Free |
clickhouse-backup | Point-in-time backups | Free (Altinity) |
| Grafana ClickHouse data source | Dashboards and metrics | Free |
| ClickHouse Cloud | Managed service | Paid |
| Altinity tools | Operator, support, tooling for self-hosted | Paid |
| DBeaver | GUI client | Free |
| Tabix | Web UI for ClickHouse | Free |
For most teams the right starting set is: system.query_log enabled with retention, Grafana scraping system.metric_log or the Prometheus endpoint, clickhouse-backup running nightly, and the official clickhouse-client for ad-hoc work.
Quick checklist
When you encounter a slow ClickHouse query:
- Check system.query_log for the duration, rows read, memory used, and
ProfileEvents. - Run EXPLAIN indexes = 1 and look at parts/granules read vs total.
- Verify the query filters on the ORDER BY prefix for primary-key pruning. If not, the primary key doesn't fit the query.
- Check partition pruning by looking at the
Partitionsection inEXPLAIN indexes. Functions on partition columns disable it. - Check the SELECT list. Are you reading columns you don't need? Columnar databases hate
SELECT *. - Consider PREWHERE for a small selective filter feeding a wide projection.
- Look at system.parts. Thousands of active parts means inserts are too small.
- For joins, ask whether a dictionary or denormalization works instead.
- For repeated aggregations, build a materialized view.
- For "this query type doesn't fit the primary key," add a projection or rewrite the table.
Most slow ClickHouse queries come from a small set of structural issues: the primary key doesn't match the workload, partition pruning isn't kicking in, too many small parts, or a join that should have been denormalized. The system tables make diagnosis fast. The fixes are usually schema-level, not configuration-level. ClickHouse rewards careful design more than any other database in common use; spend the time at the table-creation step and you'll rarely need to revisit performance later.
Keep Reading
How to Fix Slow Redis Queries
A practitioner's guide to diagnosing and fixing slow Redis workloads. Why Redis slowness is fundamentally different from SQL, the big-key and hot-key problems, SLOWLOG and LATENCY, picking the right data structure, and the configuration that prevents outages.
How to Fix Slow MongoDB Queries
A practitioner's guide to diagnosing and fixing slow MongoDB queries. The diagnostic stack, reading explain output, indexing with the ESR rule, schema design that holds up at scale, aggregation pipeline optimization, and the tools worth knowing.
How to Fix Slow Postgres Queries
A practitioner's guide to diagnosing and fixing slow Postgres queries. The diagnostic stack, reading EXPLAIN like a power user, index design, the configuration knobs that change query plans, and the tools worth installing.