Buraq: Engineering a Resilient Distributed Task Queue with Go and Redis Streams
A deep-dive into Buraq — a highly concurrent, self-healing task queue built on Go goroutines and Redis Streams with automatic retries, Dead-Letter Queue isolation, and Prometheus telemetry.
The Problem with Off-the-Shelf Queues
Every distributed system eventually needs asynchronous job processing. You could reach for Celery, BullMQ, or Sidekiq — but each comes with trade-offs: Python's GIL bottlenecks, Node's single-threaded event loop, or Ruby's memory overhead. I wanted something purpose-built with Go's concurrency primitives and Redis Streams' consumer group guarantees.
Buraq is the result: a highly concurrent, resilient distributed task queue that processes tasks with automatic retries, dead-letter isolation, and full observability — all in ~500 lines of Go.
Architecture Overview
┌─────────────┐ XADD ┌─────────────────────┐
│ Producer │──────────────▶│ Redis Stream │
│ (XADD) │ │ buraq:tasks │
└─────────────┘ └──────────┬──────────┘
│ XREADGROUP
▼
┌─────────────────────┐
│ Consumer Group │
│ (N goroutines) │
└──────────┬──────────┘
┌────┴────┐
│ Success? │
└────┬────┘
Yes │ No
▼ │ ▼
┌──────┐ │ ┌──────────┐
│ XACK │ │ │ Retry ≤3 │
└──────┘ │ └────┬─────┘
│ Exceeded
│ ▼
│ ┌──────────┐
│ │ DLQ │
│ │ buraq:dlq│
│ └──────────┘
▼
┌─────────────────────┐
│ Prometheus Metrics │
│ :2112/metrics │
└─────────────────────┘
Core Design Decisions
1. Redis Streams over Pub/Sub
I chose Redis Streams (XADD, XREADGROUP, XACK) over Redis Pub/Sub for one critical reason: message durability. With Streams, messages persist in the log even if no consumer is online. Consumer Groups ensure each message is processed exactly once across a pool of workers — true competing consumer semantics.
2. Goroutine Worker Pool
The consumer spins up a configurable pool of goroutines, each running an independent fetch-process-acknowledge loop:
func (c *Consumer) Start(ctx context.Context) {
for i := 0; i < c.poolSize; i++ {
go c.worker(ctx, i)
}
}
func (c *Consumer) worker(ctx context.Context, id int) {
for {
select {
case <-ctx.Done():
return
default:
msgs := c.readStream(ctx)
for _, msg := range msgs {
c.process(msg)
}
}
}
}
Each goroutine is lightweight (~8KB stack), so spinning up 100+ concurrent workers costs almost nothing compared to OS thread-based alternatives.
3. Automatic Retry & Dead-Letter Queue
When a worker returns an error, Buraq automatically re-queues the task. Each task carries a retry counter in its metadata. After 3 failures, the task is moved to an isolation Dead-Letter Queue (buraq:dlq) for manual inspection — preventing "poison pill" messages from blocking the entire pipeline.
4. Graceful Shutdown
Buraq intercepts SIGINT and SIGTERM signals to gracefully drain in-flight tasks:
- The fetch loop stops accepting new messages
- Currently executing goroutines finish their work
- All pending
XACKs are flushed - No ghost tasks, no lost work
Observability: Prometheus Metrics
Buraq exposes a /metrics endpoint on port 2112 with Prometheus-compatible telemetry:
| Metric | Type | Description |
|-------------------------------|-----------|--------------------------------------|
| buraq_tasks_processed_total | Counter | Total tasks successfully processed |
| buraq_tasks_failed_total | Counter | Total task failures (before retry) |
| buraq_tasks_dlq_total | Counter | Tasks routed to Dead-Letter Queue |
| buraq_task_duration_seconds | Histogram | Processing time per task |
Wire this into Grafana with the included docker-compose.yml for real-time dashboards showing throughput, failure rates, and processing latency.
The Visual Dashboard
The Enterprise Buraq distribution includes a dark-mode visual dashboard built with Next.js 15, Framer Motion, and Shadcn/UI. It provides real-time visualization of:
- Task throughput over time
- Active worker count and utilization
- DLQ depth and failure patterns
- Benchmark results with live charting
Project Structure
buraq/
├── task/ # Core Task struct + JSON serialization
├── producer/ # Reliable XADD job enqueuing
├── consumer/ # Worker pool with Consumer Groups
├── metrics/ # Prometheus metric collectors
├── docs/ # Architecture decisions & teaching guides
└── docker-compose.yml # Redis + Prometheus + Grafana stack
Roadmap
- Timeouts: Define expiration windows to auto-fail hung task executions
- Cron Logic: Support delayed and recurring scheduled jobs
What I Learned
- Redis Streams are production-ready — Consumer Groups provide exactly the semantics you need for a task queue
- Go's concurrency is a superpower — 100 goroutines cost less than 1MB of memory
- DLQ is non-negotiable — Without it, one bad message can block your entire pipeline
- Graceful shutdown prevents data loss — Signal handling isn't optional in production systems