Back to Blog

DuckDB vs ClickHouse: Embedded vs Distributed Analytics

JayJay

DuckDB and ClickHouse are both columnar analytical databases, but they solve different problems. DuckDB is embedded. It runs in your process, no server required. ClickHouse is distributed. It runs on servers and handles massive scale. Choosing between them isn't about which is "better" but about what you're building.

The Fundamental Difference

DuckDB is "SQLite for analytics." It's an embedded database that runs inside your application. No server process, no network connections, no configuration. Import it as a library and start querying.

PYTHON
import duckdb

# Just works. No server, no setup.
result = duckdb.sql("""
    SELECT region, SUM(sales) as total
    FROM 'sales_data.parquet'
    GROUP BY region
    ORDER BY total DESC
""").fetchall()

ClickHouse is a distributed database system. It runs as a server (or cluster of servers) and handles concurrent queries from multiple clients at scale.

PYTHON
from clickhouse_driver import Client

client = Client('clickhouse-server')
result = client.execute("""
    SELECT region, SUM(sales) as total
    FROM sales_data
    GROUP BY region
    ORDER BY total DESC
""")

Use Case Fit

When DuckDB Shines

Data science and analysis workflows:

PYTHON
import duckdb
import pandas as pd

# Query Parquet files directly
df = duckdb.sql("""
    SELECT *
    FROM 'logs/*.parquet'
    WHERE event_date >= '2024-01-01'
""").df()

# Query pandas DataFrames with SQL
users_df = pd.read_csv('users.csv')
orders_df = pd.read_csv('orders.csv')

result = duckdb.sql("""
    SELECT u.name, COUNT(o.id) as order_count
    FROM users_df u
    JOIN orders_df o ON u.id = o.user_id
    GROUP BY u.name
""")

Local data processing:

DuckDB handles gigabytes of data on a laptop efficiently. For one-off analysis, data exploration, or ETL scripts, you don't need a server.

Embedded analytics in applications:

Build analytical features directly into your application without external dependencies:

PYTHON
# Desktop app with built-in analytics
class AnalyticsEngine:
    def __init__(self, data_path):
        self.conn = duckdb.connect(data_path)

    def get_daily_metrics(self, start_date, end_date):
        return self.conn.execute("""
            SELECT date, SUM(revenue), COUNT(DISTINCT user_id)
            FROM events
            WHERE date BETWEEN ? AND ?
            GROUP BY date
        """, [start_date, end_date]).fetchall()

When ClickHouse Shines

Real-time analytics at scale:

ClickHouse handles billions of rows with sub-second query latency:

SQL
-- Dashboard query on 10 billion rows
SELECT
    toStartOfHour(timestamp) as hour,
    count() as events,
    uniq(user_id) as unique_users,
    quantile(0.95)(latency_ms) as p95_latency
FROM events
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY hour
ORDER BY hour
-- Returns in ~100ms

Multi-user concurrent access:

When multiple users or services need to query simultaneously, ClickHouse's client-server architecture handles it:

PYTHON
# Multiple dashboards, multiple users, same cluster
# ClickHouse handles concurrency

Production data infrastructure:

For always-on analytical systems (customer-facing dashboards, monitoring systems, real-time reporting), ClickHouse provides reliability, replication, and scalability.

Performance Comparison

Single-Machine Performance

On a single machine with data that fits in memory, DuckDB and ClickHouse perform similarly. Both use:

  • Columnar storage
  • Vectorized execution
  • Aggressive compression

DuckDB sometimes wins on small-to-medium datasets because there's no network overhead. ClickHouse sometimes wins on larger datasets due to more mature query optimization.

Scale Characteristics

DuckDB:

  • Single machine only
  • Handles 10s-100s of GB well
  • Performance degrades at TB+ scale
  • No replication or distribution

ClickHouse:

  • Scales horizontally (add servers)
  • Handles petabytes
  • Built-in replication for HA
  • Sharding for distributed queries

Concurrency

DuckDB:

  • Single writer, multiple readers
  • In-process queries only
  • Not designed for concurrent users

ClickHouse:

  • Many concurrent queries
  • Connection pooling
  • Resource management (quotas, priorities)

Query Language

Both support standard SQL with analytical extensions.

DuckDB follows PostgreSQL syntax closely and adds convenience features:

SQL
-- DuckDB: Read files directly in queries
SELECT * FROM read_parquet('data/*.parquet');
SELECT * FROM read_csv('data.csv', header=true);

-- List comprehensions
SELECT list_transform([1, 2, 3], x -> x * 2);

-- ASOF joins for time-series
SELECT * FROM trades ASOF JOIN quotes USING (symbol, timestamp);

ClickHouse has its own SQL dialect with powerful analytical functions:

SQL
-- ClickHouse: Specialized aggregations
SELECT
    quantilesTDigest(0.5, 0.9, 0.99)(latency) as percentiles,
    uniqExact(user_id) as exact_unique,
    sumMap(event_type, count) as event_counts
FROM events;

-- Windowing with syntax differences
SELECT
    user_id,
    runningAccumulate(sum(amount)) OVER (ORDER BY date) as running_total
FROM orders;

Data Ingestion

DuckDB reads files directly. No ingestion step needed:

PYTHON
# Just query the files
duckdb.sql("SELECT * FROM 'data/*.parquet'")
duckdb.sql("SELECT * FROM 'data.csv'")
duckdb.sql("SELECT * FROM 's3://bucket/data/*.parquet'")

ClickHouse typically ingests data into tables:

SQL
-- Create table
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type String
) ENGINE = MergeTree()
ORDER BY (timestamp);

-- Insert data
INSERT INTO events SELECT * FROM s3('bucket/*.parquet');

-- Or stream inserts
INSERT INTO events VALUES (...);

For iterative analysis where data changes frequently, DuckDB's file-based approach is more convenient. For production systems with streaming data, ClickHouse's insert model makes more sense.

Deployment

DuckDB:

BASH
pip install duckdb
# Done. No server, no configuration.

ClickHouse:

BASH
# Self-hosted
docker run -d clickhouse/clickhouse-server

# Or ClickHouse Cloud
# Create cluster in web console

ClickHouse requires infrastructure decisions: server sizing, replication, sharding strategy, monitoring. DuckDB requires none of this.

Cost Considerations

DuckDB:

  • Free and open source
  • Runs on existing hardware
  • No infrastructure costs
  • Cost scales with compute time

ClickHouse:

  • Free if self-hosted (+ infrastructure costs)
  • ClickHouse Cloud: ~$0.20/hour minimum
  • Scales with data volume and query load
  • Operational overhead has cost

For ad-hoc analysis and small-scale work, DuckDB is essentially free. For production systems serving many users, ClickHouse's infrastructure costs are justified by its capabilities.

Decision Framework

Choose DuckDB if:

  • You're doing data analysis on a laptop
  • Data fits on one machine (up to ~100GB comfortably)
  • You want SQL without infrastructure
  • Building embedded analytics into an application
  • Processing files (Parquet, CSV) without a database
  • Development and prototyping

Choose ClickHouse if:

  • You need to serve concurrent users
  • Data exceeds what one machine handles well
  • Real-time dashboards with sub-second latency
  • Production system requiring high availability
  • Streaming data ingestion at scale
  • Multi-tenant analytical SaaS

Interesting hybrid:

Some teams use both:

  • DuckDB for development and local analysis
  • ClickHouse for production deployment

Since both support SQL and Parquet, workflows can transfer between them.

The Bottom Line

DuckDB and ClickHouse aren't really competitors. They're complements.

DuckDB makes analytical SQL accessible everywhere. Embed it in Python scripts, Jupyter notebooks, desktop applications, or serverless functions. It's the pandas alternative for people who think in SQL.

ClickHouse makes analytical SQL scalable. Run it on a cluster, serve dashboards to thousands of users, ingest billions of events per day. It's infrastructure for analytical systems.

If you're asking "which should I use for my data science work?", start with DuckDB.

If you're asking "which should I use to power my analytics product?", start with ClickHouse.

The good news: learning one teaches you patterns that transfer to the other. Columnar databases, vectorized execution, and analytical SQL work the same way in both.

Keep Reading