Buraq: Engineering a Resilient Distributed Task Queue with Go and Redis Streams

The Problem with Off-the-Shelf Queues

Every distributed system eventually needs asynchronous job processing. You could reach for Celery, BullMQ, or Sidekiq — but each comes with trade-offs: Python's GIL bottlenecks, Node's single-threaded event loop, or Ruby's memory overhead. I wanted something purpose-built with Go's concurrency primitives and Redis Streams' consumer group guarantees.

Buraq is the result: a highly concurrent, resilient distributed task queue that processes tasks with automatic retries, dead-letter isolation, and full observability — all in ~500 lines of Go.

Architecture Overview

┌─────────────┐     XADD      ┌─────────────────────┐
│  Producer   │──────────────▶│   Redis Stream       │
│  (XADD)    │               │   buraq:tasks        │
└─────────────┘               └──────────┬──────────┘
                                         │ XREADGROUP
                                         ▼
                              ┌─────────────────────┐
                              │   Consumer Group     │
                              │   (N goroutines)     │
                              └──────────┬──────────┘
                                    ┌────┴────┐
                                    │ Success? │
                                    └────┬────┘
                                   Yes   │   No
                                   ▼     │    ▼
                              ┌──────┐   │ ┌──────────┐
                              │ XACK │   │ │ Retry ≤3 │
                              └──────┘   │ └────┬─────┘
                                         │   Exceeded
                                         │      ▼
                                         │ ┌──────────┐
                                         │ │  DLQ     │
                                         │ │ buraq:dlq│
                                         │ └──────────┘
                                         ▼
                              ┌─────────────────────┐
                              │  Prometheus Metrics  │
                              │  :2112/metrics       │
                              └─────────────────────┘

Core Design Decisions

1. Redis Streams over Pub/Sub

I chose Redis Streams (XADD, XREADGROUP, XACK) over Redis Pub/Sub for one critical reason: message durability. With Streams, messages persist in the log even if no consumer is online. Consumer Groups ensure each message is processed exactly once across a pool of workers — true competing consumer semantics.

2. Goroutine Worker Pool

The consumer spins up a configurable pool of goroutines, each running an independent fetch-process-acknowledge loop:

func (c *Consumer) Start(ctx context.Context) {
    for i := 0; i < c.poolSize; i++ {
        go c.worker(ctx, i)
    }
}

func (c *Consumer) worker(ctx context.Context, id int) {
    for {
        select {
        case <-ctx.Done():
            return
        default:
            msgs := c.readStream(ctx)
            for _, msg := range msgs {
                c.process(msg)
            }
        }
    }
}

Each goroutine is lightweight (~8KB stack), so spinning up 100+ concurrent workers costs almost nothing compared to OS thread-based alternatives.

3. Automatic Retry & Dead-Letter Queue

When a worker returns an error, Buraq automatically re-queues the task. Each task carries a retry counter in its metadata. After 3 failures, the task is moved to an isolation Dead-Letter Queue (buraq:dlq) for manual inspection — preventing "poison pill" messages from blocking the entire pipeline.

4. Graceful Shutdown

Buraq intercepts SIGINT and SIGTERM signals to gracefully drain in-flight tasks:

The fetch loop stops accepting new messages
Currently executing goroutines finish their work
All pending XACKs are flushed
No ghost tasks, no lost work

Observability: Prometheus Metrics

Buraq exposes a /metrics endpoint on port 2112 with Prometheus-compatible telemetry:

| Metric | Type | Description | |-------------------------------|-----------|--------------------------------------| | buraq_tasks_processed_total | Counter | Total tasks successfully processed | | buraq_tasks_failed_total | Counter | Total task failures (before retry) | | buraq_tasks_dlq_total | Counter | Tasks routed to Dead-Letter Queue | | buraq_task_duration_seconds | Histogram | Processing time per task |

Wire this into Grafana with the included docker-compose.yml for real-time dashboards showing throughput, failure rates, and processing latency.

The Visual Dashboard

The Enterprise Buraq distribution includes a dark-mode visual dashboard built with Next.js 15, Framer Motion, and Shadcn/UI. It provides real-time visualization of:

Task throughput over time
Active worker count and utilization
DLQ depth and failure patterns
Benchmark results with live charting

Project Structure

buraq/
├── task/        # Core Task struct + JSON serialization
├── producer/    # Reliable XADD job enqueuing
├── consumer/    # Worker pool with Consumer Groups
├── metrics/     # Prometheus metric collectors
├── docs/        # Architecture decisions & teaching guides
└── docker-compose.yml  # Redis + Prometheus + Grafana stack

Roadmap

Timeouts: Define expiration windows to auto-fail hung task executions
Cron Logic: Support delayed and recurring scheduled jobs

What I Learned

Redis Streams are production-ready — Consumer Groups provide exactly the semantics you need for a task queue
Go's concurrency is a superpower — 100 goroutines cost less than 1MB of memory
DLQ is non-negotiable — Without it, one bad message can block your entire pipeline
Graceful shutdown prevents data loss — Signal handling isn't optional in production systems

View the source code on GitHub →