Lookout.
A distributed multi-camera video RAG system. Every camera embeds locally with Gemma 4 on Cactus; only vectors cross the network. Ask a question in plain English and get back the exact clips, cameras, and seconds the event happened.
Watching many cameras at once, without shipping a single raw frame to the cloud.
Surveillance, robotics, and smart-space deployments keep adding cameras. Operators keep asking the same kinds of questions: when did the package arrive? · who entered the lab between 3 and 5? · which camera last saw the red backpack?
Today they get two options. Ship every raw frame to a cloud vision model — expensive, privacy-hostile, bandwidth-bound, and useless the moment the network blinks. Or stay single-camera with a cheap NVR and lose every cross-feed narrative. Neither holds up the second privacy, latency, or offline operation matters.
Lookout treats every camera as a small edge computer. It captures locally, embeds locally with Gemma 4 on Cactus, and ships only lightweight vectors over a peer-to-peer QUIC link to a central leader. The leader fuses them into a single searchable index you can query in English. Raw clips stay on the follower that recorded them, and are only ever pulled when the operator hits play.
Three pieces. One job: turn cameras into a queryable memory.
A follower lives on every camera. A leader lives on one server. A UI lives in a browser. Everything between them is QUIC.
The pixel gate.
A process per camera. Captures 1080p @ 15fps and
16 kHz audio, chunks it into 5-second windows, runs
Gemma 4 on Cactus to produce one multimodal vector plus a caption,
and ships the result to the leader over iroh QUIC.
Raw MP4 sits in a rolling 24-hour local cache. Nothing else leaves.
The index and the answer.
A single Rust binary. Accepts follower connections on
cactus/ingest/v1. Persists vectors + metadata into
ChromaDB (with a DuckDB sightings sidecar). Serves an HTTP API
for the UI and hosts a DVR-style HLS endpoint for replay.
Query synthesis runs Gemma 4 on Cactus on the leader itself — no cloud call for the common path.
Ask. Watch. Cite.
React + Vite. Voice or text query, a live tile per connected follower, and a result view that renders ranked clips with camera, timestamp, LLM-synthesized answer, and inline jump-to-seconds citations.
The UI only knows about the leader. It never has to think about the P2P mesh underneath.
System anatomy — three processes, three continents, one mesh.
Followers dial the leader over iroh, an authenticated
QUIC transport with built-in NAT traversal. No port-forwarding, no VPN,
no CA to manage. Every peer is identified by an Ed25519
NodeId; the leader prints a dial ticket at startup and
that's the whole bootstrap. Bad networks fall back to a public relay
automatically.
Capture, chunk, embed, package, transport — all on the edge device.
Each follower runs a tight loop. The embedding step is the interesting one: Gemma 4 on Cactus produces a multimodal vector over a short window of frames in a few hundred milliseconds on Apple Silicon. The same model produces a one-sentence caption, which gets embedded separately so retrieval can fuse dense video similarity with sparse caption matches.
Capture
Webcam and mic feed into a 10-second sliding buffer. Video at
1080p @ 15fps, audio mono at 16 kHz.
Monotonic clock, UTC timestamps — cross-camera ordering is never
ambiguous.
Chunk
Buffer sliced into 5-second windows. Each chunk samples
K=4 evenly-spaced frames plus the matching audio
segment. This is the atomic unit for embedding, indexing, and
playback.
Embed (Gemma 4 on Cactus)
Frames and audio go through Gemma 4 via Cactus, entirely on-device. Vision-tower pool gives a video vector; audio pathway gives an audio vector. Concatenate, L2-normalize, done. In parallel, Gemma produces a one-sentence caption used as a sparse channel for hybrid retrieval.
Package
{ embedding, caption, camera_id, start_ts, end_ts, chunk_id }
is postcard-encoded into a length-prefixed frame. Raw clips stay
on disk under a 24-hour rolling cache, keyed by
chunk_id so the leader can fetch them later by
content hash.
Transport
One long-lived iroh QUIC stream pushes frames on the
cactus/ingest/v1 ALPN. TLS 1.3 throughout; each peer
is authenticated by its Ed25519 NodeId. If the
leader drops, the follower switches to a bounded on-disk ring
buffer and drains it oldest-first on reconnect.
Engineering notes
mmap + shared weights.
Gemma 4 on Cactus uses mmap and zero-copy weight loading. A 4B multimodal model runs comfortably on an M-series Mac mini with headroom for the capture pipeline on the same box.
Drop samples before chunks.
If inference falls behind real-time, the follower drops frame samples first, captions second, and entire chunks last — emitting a metric every time. The pipeline never silently loses fidelity.
At-least-once is cheap on disk.
When the iroh stream errors, chunks are written to a bounded ring
buffer on disk and drained oldest-first on reconnect, deduped by
chunk_id. Followers on flaky Wi-Fi or LTE recover
transparently.
Demo-in-two-minutes.
A --synthetic flag short-circuits Gemma and emits
random vectors on the same cadence, so the transport layer and
leader can be exercised end-to-end without spinning up the model
at all.
What actually bit us
- Mutex contention under load. Perception and query embedding shared the same Gemma model. We made perception fire-and-forget and gave user queries absolute priority — otherwise investigation latency ballooned into the tens of seconds.
-
Gemma's tool-call format. Cactus's JSON parser
counts braces; any
{inside a tool description truncated the prompt silently. Flat, brace-free descriptions unblocked the whole agent path. - Network is never flat. iroh handles NAT traversal and relay fallback, but the on-disk spool is what keeps a follower honest when the leader or Wi-Fi blinks. We tuned it to 4 GB by default.
Three ALPNs on one QUIC endpoint.
Every node-to-node link is a single iroh QUIC connection identified by
an ALPN string, carrying length-prefixed postcard-encoded
frames. The leader is the accepting side; followers dial it using a
ticket printed at startup. After a Hello, the stream is
long-lived and bidirectional.
EmbeddingChunk
frames with at-least-once + chunk_id acks.
GetConfig, SetConfig,
Ping, RotateKeys. Health-checked every
10 s.
Natural-language retrieval — five stages, under five seconds.
A query comes in as plain English. The leader embeds it with Gemma on
Cactus in text-only mode, runs filtered ANN search across the
per-modality collections in ChromaDB, fuses the results with
reciprocal rank fusion, and hands the top-K captions plus their
(camera, timestamp) metadata back to Gemma for synthesis.
Stages, in detail
Parse the question
Surveillance questions come with implicit filters —
"between 3 and 5pm", "on cam-front". Before
touching ChromaDB, the leader asks Gemma to rewrite the query
into a JSON envelope { time_start_ms, time_end_ms, camera_ids, top_k }.
Defaults: last 30 minutes, top_k = 20, cap at 50.
Harmony-style thinking markers
(<|channel|>, <|message|>) are
stripped before JSON extraction.
Embed the query
Same Gemma 4 that captions incoming chunks, invoked text-only. L2-normalize, reject non-finite and zero-norm vectors, pass through a 128-entry LRU keyed on the raw query string. During demos and iteration the cache turns repeat queries into a hash lookup.
Modality-aware ANN
Two ChromaDB collections — video-clips and
audio-clips — indexed under cosine distance. The
where filter is assembled from the parsed envelope;
each collection is queried with n_results = top_k * 2
to leave headroom for fusion. If ChromaDB is unreachable, the
leader degrades to a brute-force cosine scan over the in-memory
store. The API contract doesn't change.
{
"start_ts_ms": { "$lte": end_ms },
"end_ts_ms": { "$gte": start_ms },
"camera_id": { "$in": ["cam-front", "cam-lab"] }
}
Fuse · RRF + caption boost
Video and audio hit-lists are merged with Reciprocal Rank Fusion
(RRF_K = 60), scaled by
LEADER_RRF_WEIGHT (default 2.0). A
caption-overlap boost (capped at 0.15) keeps exact
phrases like "red backpack" from drowning in near-neighbors.
score = modality_aware_cosine(q, chunk)
+ rrf_weight * Σ 1 / (RRF_K + rank_in_modality)
+ caption_boost
Synthesize the answer
Top 10 chunks flatten into a numbered caption list and get
passed to Gemma with max_tokens = 256,
temperature = 0.2, and a terse system prompt
requiring citations in the form (#2, cam-lab).
Only captions go in — an earlier version re-attached JPEGs and
Gemma started describing the scene instead of answering the
question. Text-only was faster and more accurate.
Latency budget
Targets we held through the demo, on a single Mac mini against a week of 8-camera data.
The UI splits the request: /api/search returns hits only
(~100–500 ms) so the citation grid lights up immediately, then
/api/answer runs synthesis in the background. Perceived
latency is the cheap half.
What 24 hours taught us.
Shipping a distributed multimodal RAG system in a weekend forced decisions that would normally get over-engineered. Four we'll carry forward.
On-device multimodal is genuinely ready.
Gemma 4 on Cactus produces useful video and audio embeddings on consumer hardware, fast enough for soft-real-time ingest. The "send raw frames to a hosted vision model" era of video understanding is ending. The interesting systems push inference to the edge and only centralize vectors.
P2P transport is a superpower for enterprise demos.
Plugging four laptops into four different networks and watching them all connect to one leader with zero configuration — no port forwarding, no VPN, no firewall rules — was the single most compelling moment of the demo. Judges and B2B buyers latched onto it fastest.
The hard part is the retrieval, not the model.
Swapping embedding cadence or the synthetic backend was easy. Getting multi-camera, time-filtered, multi-modal retrieval to produce good top-K results took the most iteration. Window size, caption quality, and fusion weights all mattered more than any single knob on the model.
Feature flags are how you demo fast.
Having --synthetic, --no-camera, and a
cactus feature flag meant the system ran end-to-end
on any machine in the room within two minutes. That flexibility
was worth more during the hackathon than any single piece of the
pipeline.
Three tracks. One weekend.
Lookout was built in ~24 hours at the Cactus × DeepMind × YC hackathon. It took every prize it was eligible for.
Highest commercial viability for offline enterprise tools. Judged on real-world problem fit and privacy-first architecture — the second the "raw video never leaves the follower" slide went up, buyers leaned in.
For pushing the hardware/software stack — on-device multimodal embeddings, custom P2P wire protocol, cross-language leader/follower, and an agent that runs fully offline when it has to.
Winner takes all — correctness of the MVP, quality of the demo, and venture-scale potential across the whole judging rubric.