Back to Blog
11 min read
Distributed Systems

Buraq: Engineering a Resilient Distributed Task Queue with Go and Redis Streams

A deep-dive into Buraq — a highly concurrent, self-healing task queue built on Go goroutines and Redis Streams with automatic retries, Dead-Letter Queue isolation, and Prometheus telemetry.

The Problem with Off-the-Shelf Queues

Every distributed system eventually needs asynchronous job processing. You could reach for Celery, BullMQ, or Sidekiq — but each comes with trade-offs: Python's GIL bottlenecks, Node's single-threaded event loop, or Ruby's memory overhead. I wanted something purpose-built with Go's concurrency primitives and Redis Streams' consumer group guarantees.

Buraq is the result: a highly concurrent, resilient distributed task queue that processes tasks with automatic retries, dead-letter isolation, and full observability — all in ~500 lines of Go.

Architecture Overview

┌─────────────┐     XADD      ┌─────────────────────┐
│  Producer   │──────────────▶│   Redis Stream       │
│  (XADD)    │               │   buraq:tasks        │
└─────────────┘               └──────────┬──────────┘
                                         │ XREADGROUP
                                         ▼
                              ┌─────────────────────┐
                              │   Consumer Group     │
                              │   (N goroutines)     │
                              └──────────┬──────────┘
                                    ┌────┴────┐
                                    │ Success? │
                                    └────┬────┘
                                   Yes   │   No
                                   ▼     │    ▼
                              ┌──────┐   │ ┌──────────┐
                              │ XACK │   │ │ Retry ≤3 │
                              └──────┘   │ └────┬─────┘
                                         │   Exceeded
                                         │      ▼
                                         │ ┌──────────┐
                                         │ │  DLQ     │
                                         │ │ buraq:dlq│
                                         │ └──────────┘
                                         ▼
                              ┌─────────────────────┐
                              │  Prometheus Metrics  │
                              │  :2112/metrics       │
                              └─────────────────────┘

Core Design Decisions

1. Redis Streams over Pub/Sub

I chose Redis Streams (XADD, XREADGROUP, XACK) over Redis Pub/Sub for one critical reason: message durability. With Streams, messages persist in the log even if no consumer is online. Consumer Groups ensure each message is processed exactly once across a pool of workers — true competing consumer semantics.

2. Goroutine Worker Pool

The consumer spins up a configurable pool of goroutines, each running an independent fetch-process-acknowledge loop:

func (c *Consumer) Start(ctx context.Context) {
    for i := 0; i < c.poolSize; i++ {
        go c.worker(ctx, i)
    }
}

func (c *Consumer) worker(ctx context.Context, id int) {
    for {
        select {
        case <-ctx.Done():
            return
        default:
            msgs := c.readStream(ctx)
            for _, msg := range msgs {
                c.process(msg)
            }
        }
    }
}

Each goroutine is lightweight (~8KB stack), so spinning up 100+ concurrent workers costs almost nothing compared to OS thread-based alternatives.

3. Automatic Retry & Dead-Letter Queue

When a worker returns an error, Buraq automatically re-queues the task. Each task carries a retry counter in its metadata. After 3 failures, the task is moved to an isolation Dead-Letter Queue (buraq:dlq) for manual inspection — preventing "poison pill" messages from blocking the entire pipeline.

4. Graceful Shutdown

Buraq intercepts SIGINT and SIGTERM signals to gracefully drain in-flight tasks:

  • The fetch loop stops accepting new messages
  • Currently executing goroutines finish their work
  • All pending XACKs are flushed
  • No ghost tasks, no lost work

Observability: Prometheus Metrics

Buraq exposes a /metrics endpoint on port 2112 with Prometheus-compatible telemetry:

| Metric | Type | Description | |-------------------------------|-----------|--------------------------------------| | buraq_tasks_processed_total | Counter | Total tasks successfully processed | | buraq_tasks_failed_total | Counter | Total task failures (before retry) | | buraq_tasks_dlq_total | Counter | Tasks routed to Dead-Letter Queue | | buraq_task_duration_seconds | Histogram | Processing time per task |

Wire this into Grafana with the included docker-compose.yml for real-time dashboards showing throughput, failure rates, and processing latency.

The Visual Dashboard

The Enterprise Buraq distribution includes a dark-mode visual dashboard built with Next.js 15, Framer Motion, and Shadcn/UI. It provides real-time visualization of:

  • Task throughput over time
  • Active worker count and utilization
  • DLQ depth and failure patterns
  • Benchmark results with live charting

Project Structure

buraq/
├── task/        # Core Task struct + JSON serialization
├── producer/    # Reliable XADD job enqueuing
├── consumer/    # Worker pool with Consumer Groups
├── metrics/     # Prometheus metric collectors
├── docs/        # Architecture decisions & teaching guides
└── docker-compose.yml  # Redis + Prometheus + Grafana stack

Roadmap

  • Timeouts: Define expiration windows to auto-fail hung task executions
  • Cron Logic: Support delayed and recurring scheduled jobs

What I Learned

  1. Redis Streams are production-ready — Consumer Groups provide exactly the semantics you need for a task queue
  2. Go's concurrency is a superpower — 100 goroutines cost less than 1MB of memory
  3. DLQ is non-negotiable — Without it, one bad message can block your entire pipeline
  4. Graceful shutdown prevents data loss — Signal handling isn't optional in production systems

View the source code on GitHub →