How We Uncovered a Hidden ClickHouse Slowdown in Our Petabyte-Scale Billing System

The Architecture: A Unified Analytics Platform

At Cloudflare, ClickHouse powers a billing pipeline that handles hundreds of millions of dollars in revenue. Every day, millions of queries determine how much each customer owes. When this pipeline slowed dramatically after a migration, the usual suspects—I/O, memory, rows scanned—looked normal. The bottleneck turned out to be buried deep inside ClickHouse’s internals, and we had to write three patches to fix it.

How We Uncovered a Hidden ClickHouse Slowdown in Our Petabyte-Scale Billing System — Source: blog.cloudflare.com

To understand the problem, you first need to know our setup. We store over a hundred petabytes across dozens of ClickHouse clusters. In early 2022, we introduced “Ready-Analytics,” a system that lets internal teams stream data into a single massive table instead of designing custom schemas. Each dataset is identified by a namespace, and every record follows a standard schema with 20 float fields, 20 string fields, a timestamp, and an indexID.

Data sorting is critical for query performance. The primary key is (namespace, indexID, timestamp). This allows each namespace to define its own optimal sort order. By December 2024, Ready-Analytics already held over 2 PiB of data and ingested millions of rows per second. Hundreds of applications relied on it.

The Problem: One Retention Policy to Rule Them All

Cloudflare has used ClickHouse for years, long before it supported native TTL features. We built our own retention system using partitioning. The Ready-Analytics table was partitioned by day, and a background job simply dropped partitions older than 31 days.

This one-size-fits-all approach severely limited adoption. Some teams needed to retain data for years due to legal or contractual obligations; others needed only a few days. Because of the 31-day cap, those use cases couldn’t use Ready-Analytics and had to fall back to a much more complex conventional setup. We needed a new system that allowed per-namespace retention.

The Hidden Bottleneck in ClickHouse’s Internals

When the billing aggregation jobs slowed down, we initially checked all common metrics: I/O wait, memory pressure, rows scanned, parts read. Everything looked normal. The slowdown was intermittent, making it especially frustrating. Eventually, we dug into ClickHouse’s internal query execution and found the culprit: an inefficient merging step during partition pruning.

With the old 31-day retention, every partition contained data from all namespaces. After we introduced per-namespace retention, some namespaces had data spanning months, while others had only days. The partition pruning logic had to examine many more partition granules than before, and a hidden merge operation inside the query engine became a bottleneck. It wasn’t visible in standard profiling because it was part of ClickHouse’s internal memory management, not a separate IO or CPU spike.

The Three Patches We Wrote

Patch 1: Optimizing Granule Merging

The first patch addressed how ClickHouse merges granules when scanning partitions for multi-namespace queries. By changing the merge strategy from a naïve concatenation to a lazy merge, we reduced the number of temporary data structures created per query.

Patch 2: Improved Partition Pruning

The second patch improved partition pruning to skip irrelevant namespaces earlier. We added a bloom filter index on the namespace column. This allowed ClickHouse to discard entire partitions that didn’t contain the required namespace, cutting down the number of granules needing merge.

Patch 3: Memory Pool Tuning

The third patch was a low-level change to ClickHouse’s memory allocator. We discovered that the merge operation was causing excessive fragmentation because it allocated many small blocks. By pre-allocating larger chunks for common granule sizes, we reduced overhead and improved cache locality.

Results and Lessons Learned

After deploying these patches, the billing pipeline returned to its expected speed. Query latency dropped by over 60% for mixed-namespace queries, and the intermittent slowdowns disappeared. More importantly, the per-namespace retention feature finally became usable, unlocking Ready-Analytics for dozens more internal teams.

This experience taught us that hidden bottlenecks often live where standard monitoring doesn’t look. ClickHouse is a powerful tool, but its internal algorithms sometimes need tuning for specific workloads—especially at petabyte scale. We contributed our patches back to the open-source community, hoping others can avoid similar pitfalls.

Tags: