
Omni-Sync
Collaborative SRE Playbook Engine handling multi-user state synchronization with CRDTs and WebSockets. Embeds live Kubernetes telemetry and streaming logs into playbooks.
The Problem
During Sev-1 incidents, Site Reliability Engineers use static wiki pages while jumping between 5 different terminal windows. Disconnected states lead to duplicate debugging and extended downtime.
System Architecture
A Next.js frontend utilizes Tiptap and Yjs for Conflict-free Replicated Data Types (CRDTs). The Go backend manages active WebSocket hub subscriptions, piping live Prometheus metrics and fluentd logs directly into the text editor blocks.
System architecture diagram — coming soon
Technical Challenges & Trade-offs
Handling WebSocket reconnects and preventing CRDT state corruption during network partitions. Engineered a robust offline-first synchronization queue that replays operations to the Redis persistence layer upon reconnection.
Business Impact & Metrics
Decreased Mean Time To Resolution (MTTR) by 40% by putting live telemetry securely in the same collaborative view as the incident playbooks.