Note: This project is not deployed in production due to bandwidth costs. A video demonstration is available below.
| Metric | Latency |
|---|
| P50 | 9Β΅s |
| Mean | 20Β΅s |
| P95 | 63Β΅s |
| P99 | 195Β΅s |
Throughput: 500-1000 msg/s on 4 exchanges Γ 9 symbols (36 simultaneous streams)
Demonstration
Real-time interface: multi-exchange aggregation, order books, trade streams and performance metrics.
Low-Latency Optimizations
1. CPU Cache-Optimized Order Book
Classic implementations use BTreeMap<Price, Quantity> for O(log n) operations. FlowRS uses a fixed-size Vec:
BTreeMap (Classic) Vec (FlowRS)
ββ Pointer chasing ββ Contiguous memory block
ββ Scattered heap allocations ββ Single allocation
ββ Cache misses on every node ββ Prefetch-friendly
ββ O(log n) but cache-cold ββ O(n) but L1/L2 cache-hot
Why it wins: With 25 price levels (200 bytes), the entire book fits in L1 cache. CPUs iterate contiguous memory ~
100x faster than pointer chasing. The theoretical O(log n) advantage disappears when each node access is a cache miss.
2. Fixed-Point Integer Arithmetic
All prices and quantities use u64 with a 1e8 scale factor instead of Decimal or f64:
const SCALE_FACTOR: u64 = 100_000_000; // 8 decimals
// "97234.56" β 9_723_456_000_000
fn fast_parse_u64(s: &str) -> Option<u64> {
// Pure integer math, zero allocation
// Inline ASCII β digit conversion
}
Advantages:
- No floating-point errors - Exact decimal representation
- 5-10x faster than Decimal - Native CPU operations
- 8 bytes vs 16 bytes - Better cache utilization
- SIMD-friendly - Vectorizable comparisons
3. Lock-Free Metrics Pipeline
Latency tracking uses a ring buffer with atomic operations. Percentile calculation runs in background task:
// Hot path: O(1) atomic write
fn record(&self, latency_us: u64) {
let idx = self.write_index.fetch_add(1, Relaxed) & 0x7FF; // Bitwise AND
self.samples[idx].store(latency_us, Relaxed);
}
// Background (every 900ms): O(n) partial selection
fn update_percentiles(&self) {
// select_nth_unstable() instead of full sort
// Pre-allocated scratch buffer
}
Techniques:
index & MASK instead of index % SIZE - Avoids expensive integer division
select_nth_unstable() - O(n) partial selection vs O(n log n) sort
- Pre-allocated buffers - Zero allocation in hot path
4. Concurrency Model
DashMap (sharded concurrent HashMap) instead of RwLock<HashMap>:
- 16 independent shards with fine-grained locking
- Multiple exchanges update different symbols without contention
Isolated Exchange Tasks:
- Each exchange runs in its own Tokio task
- Independent reconnection logic (one outage doesnβt affect others)
- Automatic state recovery on disconnection
Optimization Impact
| Optimization | Before | After |
|---|
| Percentile calculation | In hot path (1.26ms P99) | Background task (195Β΅s P99) |
| Price storage | Decimal | u64 - 5-10x faster |
| Order book | BTreeMap | Vec - Cache-friendly |
| Metrics buffer | Mutex<Vec> | Lock-free ring buffer |
Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Exchange Manager β
β (Tokio task per exchange) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β Binance β β Bybit β β Kraken β βCoinbase β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β β β β
βββββββββββββ΄ββββββ¬ββββββ΄ββββββββββββ
βΌ
ββββββββββββββββββββββββββ
β OrderBook Manager β
β (DashMap per symbol) β
ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββ
β Broadcast Channel β
ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββ
β WebSocket Clients β
ββββββββββββββββββββββββββ
Plugin Architecture
Adding a new exchange requires ~200 lines:
pub struct NewExchangeConnector { symbols: Vec<String> }
impl NewExchangeConnector {
pub fn build_subscription_url(&self) -> String { ... }
pub fn parse_message(&self, raw: &str) -> Option<MarketMessage> { ... }
pub fn get_subscription_messages(&self) -> Vec<String> { ... }
}
Tech Stack
Backend (Rust)
- Tokio async runtime
- tokio-tungstenite (WebSocket)
- DashMap (concurrent HashMap)
- serde/serde_json
- jemalloc allocator
Frontend (Vue 3 + TypeScript)
- Vite build tool
- Composition API
- Native WebSocket
Key Learnings
- Cache locality beats algorithmic complexity - O(n) with L1 cache > O(log n) with cache misses
- Integers beat decimals - Fixed-point arithmetic is faster and avoids precision errors
- Move work out of hot path - Background tasks for expensive calculations
- Avoid allocations - Pre-allocated buffers,
Option instead of Box<dyn Error>
- Lock-free when possible - Atomics for metrics, sharded maps for data