Portfolio | Platform Engineer

The Invisible Wall

Our system was hitting an invisible wall. Our RabbitMQ cluster had plenty of CPU and memory, yet message throughput had plateaued. Worse, our p99 latencies were spiking under load, causing downstream timeouts. The culprit? A single, high-traffic queue.

In RabbitMQ, a single queue is constrained to a single CPU core on a single node. No matter how powerful the server, you can only push so many messages through one queue. We weren't scaling up anymore; we needed to scale out.

The Solution: Divide and Conquer with Sharding

Instead of one massive queue, we partitioned our data across multiple smaller queues, a technique known as sharding. This allowed us to parallelize message processing across multiple cores and even multiple nodes.

Here’s the architecture:

graph TD
    subgraph Producers
        P1(Producer 1)
        P2(Producer 2)
    end

    subgraph RabbitMQ Cluster
        X(Exchange);
        subgraph Sharded Queues
            Q1(Shard 1);
            Q2(Shard 2);
            Q3(Shard 3);
            Q4(...);
        end
    end

    subgraph Consumers
        C1(Consumer Group 1) --> Q1;
        C2(Consumer Group 2) --> Q2;
        C3(Consumer Group 3) --> Q3;
        C4(Consumer Group 4) --> Q4;
    end

    P1 --> X;
    P2 --> X;

    X -- routing_key ends in .1 --> Q1;
    X -- routing_key ends in .2 --> Q2;
    X -- routing_key ends in .3 --> Q3;
    X -- routing_key ends in .N --> Q4;

How it works:

Producers: Producers publish messages to a single exchange as usual. The magic is in the routing_key.
Routing: We use a consistent hashing algorithm on an entity ID (e.g., user_id) to generate a shard number. This number is appended to the routing key (e.g., event.created.5).
Queues: We create a fixed number of queues (event_queue_1, event_queue_2, ...). The exchange routes the message to the correct queue based on the shard number in the routing key.
Consumers: Each consumer (or a group of consumers for parallel processing) subscribes to a single shard. This ensures messages for a given entity are still processed in order while the overall system processes messages in parallel.

Ensuring High Availability with Quorum Queues

Sharding solves the performance bottleneck, but what about reliability? For this, we use Quorum Queues, RabbitMQ's modern, Raft-based solution for data safety.

Unlike classic mirrored queues, Quorum Queues provide better data consistency and are designed for modern HA systems. We define a policy to ensure all our sharded queues are created as quorum queues with a replication factor of 3.

# This policy applies to all queues with names starting with 'event_queue_'
# It makes them Quorum Queues with a replication factor of 3.
rabbitmqctl set_policy ha-shards "^event_queue_" \
  '{"ha-mode":"quorum", "ha-params":3}' --apply-to queues

Fine-Tuning for Peak Performance

With the new architecture in place, we also tuned our consumers:

Prefetch Count: We adjusted the consumer prefetch count to find the sweet spot between keeping the consumer busy and preventing it from holding too many unacknowledged messages.
Concurrency: We scaled the number of consumer instances per shard to match the processing capacity needed for that shard's workload.

The Impact: Throughput Unlocked

2x Throughput: We immediately doubled our sustainable message throughput.
Stable Latency: p95 and p99 latencies remained stable, even under peak load.
No More Hotspots: The load was evenly distributed across the cluster, eliminating the single-queue bottleneck.
Resilient by Design: With Quorum Queues, we could lose a node without losing data or availability.

By moving from a single-queue model to a sharded architecture, we built a message bus that can scale horizontally with our business needs.

graph TD subgraph Producers P1(Producer 1) P2(Producer 2) end subgraph RabbitMQ Cluster X(Exchange); subgraph Sharded Queues Q1(Shard 1); Q2(Shard 2); Q3(Shard 3); Q4(...); end end subgraph Consumers C1(Consumer Group 1) --> Q1; C2(Consumer Group 2) --> Q2; C3(Consumer Group 3) --> Q3; C4(Consumer Group 4) --> Q4; end P1 --> X; P2 --> X; X -- routing_key ends in .1 --> Q1; X -- routing_key ends in .2 --> Q2; X -- routing_key ends in .3 --> Q3; X -- routing_key ends in .N --> Q4;

# This policy applies to all queues with names starting with 'event_queue_' # It makes them Quorum Queues with a replication factor of 3. rabbitmqctl set_policy ha-shards "^event_queue_" \ '{"ha-mode":"quorum", "ha-params":3}' --apply-to queues

Breaking the Bottleneck: How We Doubled RabbitMQ Throughput with Sharding

The Invisible Wall

The Solution: Divide and Conquer with Sharding

Ensuring High Availability with Quorum Queues

Fine-Tuning for Peak Performance

The Impact: Throughput Unlocked

Breaking the Bottleneck: How We Doubled RabbitMQ Throughput with Sharding

The Invisible Wall

The Solution: Divide and Conquer with Sharding

Ensuring High Availability with Quorum Queues

Fine-Tuning for Peak Performance

The Impact: Throughput Unlocked