24-hour build · Cactus × DeepMind × YC hackathon · April 2026

Lookout.

A distributed multi-camera video RAG system. Every camera embeds locally with Gemma 4 on Cactus; only vectors cross the network. Ask a question in plain English and get back the exact clips, cameras, and seconds the event happened.

Timeline
24 hours · April 2026
Stack
Rust iroh Cactus Gemma 4 React
Focus
On-device AI · Multimodal RAG · P2P
Team
4 builders · 1 weekend
4+cams
concurrent followers
<5s p95
query → answer
~4.5KB
on the wire per clip
0bytes
raw-video egress default
01 · Motivation

Watching many cameras at once, without shipping a single raw frame to the cloud.

Surveillance, robotics, and smart-space deployments keep adding cameras. Operators keep asking the same kinds of questions: when did the package arrive? · who entered the lab between 3 and 5? · which camera last saw the red backpack?

Today they get two options. Ship every raw frame to a cloud vision model — expensive, privacy-hostile, bandwidth-bound, and useless the moment the network blinks. Or stay single-camera with a cheap NVR and lose every cross-feed narrative. Neither holds up the second privacy, latency, or offline operation matters.

Lookout treats every camera as a small edge computer. It captures locally, embeds locally with Gemma 4 on Cactus, and ships only lightweight vectors over a peer-to-peer QUIC link to a central leader. The leader fuses them into a single searchable index you can query in English. Raw clips stay on the follower that recorded them, and are only ever pulled when the operator hits play.

02 · Overview

Three pieces. One job: turn cameras into a queryable memory.

A follower lives on every camera. A leader lives on one server. A UI lives in a browser. Everything between them is QUIC.

Follower · edge

The pixel gate.

A process per camera. Captures 1080p @ 15fps and 16 kHz audio, chunks it into 5-second windows, runs Gemma 4 on Cactus to produce one multimodal vector plus a caption, and ships the result to the leader over iroh QUIC.

Raw MP4 sits in a rolling 24-hour local cache. Nothing else leaves.

Leader · server

The index and the answer.

A single Rust binary. Accepts follower connections on cactus/ingest/v1. Persists vectors + metadata into ChromaDB (with a DuckDB sightings sidecar). Serves an HTTP API for the UI and hosts a DVR-style HLS endpoint for replay.

Query synthesis runs Gemma 4 on Cactus on the leader itself — no cloud call for the common path.

UI · browser

Ask. Watch. Cite.

React + Vite. Voice or text query, a live tile per connected follower, and a result view that renders ranked clips with camera, timestamp, LLM-synthesized answer, and inline jump-to-seconds citations.

The UI only knows about the leader. It never has to think about the P2P mesh underneath.

03 · Architecture

System anatomy — three processes, three continents, one mesh.

Followers dial the leader over iroh, an authenticated QUIC transport with built-in NAT traversal. No port-forwarding, no VPN, no CA to manage. Every peer is identified by an Ed25519 NodeId; the leader prints a dial ticket at startup and that's the whole bootstrap. Bad networks fall back to a public relay automatically.

04 · Follower pipeline

Capture, chunk, embed, package, transport — all on the edge device.

Each follower runs a tight loop. The embedding step is the interesting one: Gemma 4 on Cactus produces a multimodal vector over a short window of frames in a few hundred milliseconds on Apple Silicon. The same model produces a one-sentence caption, which gets embedded separately so retrieval can fuse dense video similarity with sparse caption matches.

A follower capturing and embedding a live webcam feed in real time. Only the resulting vectors leave the device.

Capture

Webcam and mic feed into a 10-second sliding buffer. Video at 1080p @ 15fps, audio mono at 16 kHz. Monotonic clock, UTC timestamps — cross-camera ordering is never ambiguous.

Chunk

Buffer sliced into 5-second windows. Each chunk samples K=4 evenly-spaced frames plus the matching audio segment. This is the atomic unit for embedding, indexing, and playback.

Embed (Gemma 4 on Cactus)

Frames and audio go through Gemma 4 via Cactus, entirely on-device. Vision-tower pool gives a video vector; audio pathway gives an audio vector. Concatenate, L2-normalize, done. In parallel, Gemma produces a one-sentence caption used as a sparse channel for hybrid retrieval.

Package

{ embedding, caption, camera_id, start_ts, end_ts, chunk_id } is postcard-encoded into a length-prefixed frame. Raw clips stay on disk under a 24-hour rolling cache, keyed by chunk_id so the leader can fetch them later by content hash.

Transport

One long-lived iroh QUIC stream pushes frames on the cactus/ingest/v1 ALPN. TLS 1.3 throughout; each peer is authenticated by its Ed25519 NodeId. If the leader drops, the follower switches to a bounded on-disk ring buffer and drains it oldest-first on reconnect.

Engineering notes

Zero-copy embedding

mmap + shared weights.

Gemma 4 on Cactus uses mmap and zero-copy weight loading. A 4B multimodal model runs comfortably on an M-series Mac mini with headroom for the capture pipeline on the same box.

Graceful degradation

Drop samples before chunks.

If inference falls behind real-time, the follower drops frame samples first, captions second, and entire chunks last — emitting a metric every time. The pipeline never silently loses fidelity.

Disk spool on disconnect

At-least-once is cheap on disk.

When the iroh stream errors, chunks are written to a bounded ring buffer on disk and drained oldest-first on reconnect, deduped by chunk_id. Followers on flaky Wi-Fi or LTE recover transparently.

Synthetic mode

Demo-in-two-minutes.

A --synthetic flag short-circuits Gemma and emits random vectors on the same cadence, so the transport layer and leader can be exercised end-to-end without spinning up the model at all.

What actually bit us

  • Mutex contention under load. Perception and query embedding shared the same Gemma model. We made perception fire-and-forget and gave user queries absolute priority — otherwise investigation latency ballooned into the tens of seconds.
  • Gemma's tool-call format. Cactus's JSON parser counts braces; any { inside a tool description truncated the prompt silently. Flat, brace-free descriptions unblocked the whole agent path.
  • Network is never flat. iroh handles NAT traversal and relay fallback, but the on-disk spool is what keeps a follower honest when the leader or Wi-Fi blinks. We tuned it to 4 GB by default.
05 · Wire protocol

Three ALPNs on one QUIC endpoint.

Every node-to-node link is a single iroh QUIC connection identified by an ALPN string, carrying length-prefixed postcard-encoded frames. The leader is the accepting side; followers dial it using a ticket printed at startup. After a Hello, the stream is long-lived and bidirectional.

Ingest
cactus/ingest/v1
Long-lived bidi stream. Postcard EmbeddingChunk frames with at-least-once + chunk_id acks.
Control
cactus/control/v1
RPC-style. GetConfig, SetConfig, Ping, RotateKeys. Health-checked every 10 s.
Clip fetch
cactus/clip/v1
On-demand. Backed by iroh-blobs for resumable, content-addressed (BLAKE3) transfer of the original MP4.
Identity
Ed25519 NodeId
Every peer is a public key. Mutual auth, pinned allowlist, no CA.
NAT traversal
pkarr + relay fallback
Direct P2P when possible, public relay when the network is hostile. Self-hosted relay is a config flag.
Resilience
On-disk ring buffer
4 GB default spool per follower. Draining is oldest-first, deduped on reconnect.
06 · Query pipeline

Natural-language retrieval — five stages, under five seconds.

A query comes in as plain English. The leader embeds it with Gemma on Cactus in text-only mode, runs filtered ANN search across the per-modality collections in ChromaDB, fuses the results with reciprocal rank fusion, and hands the top-K captions plus their (camera, timestamp) metadata back to Gemma for synthesis.

A natural-language question across every connected camera. Answer, citations, and clip playback in one pass.

Stages, in detail

Parse the question

Surveillance questions come with implicit filters — "between 3 and 5pm", "on cam-front". Before touching ChromaDB, the leader asks Gemma to rewrite the query into a JSON envelope { time_start_ms, time_end_ms, camera_ids, top_k }. Defaults: last 30 minutes, top_k = 20, cap at 50. Harmony-style thinking markers (<|channel|>, <|message|>) are stripped before JSON extraction.

Embed the query

Same Gemma 4 that captions incoming chunks, invoked text-only. L2-normalize, reject non-finite and zero-norm vectors, pass through a 128-entry LRU keyed on the raw query string. During demos and iteration the cache turns repeat queries into a hash lookup.

Modality-aware ANN

Two ChromaDB collections — video-clips and audio-clips — indexed under cosine distance. The where filter is assembled from the parsed envelope; each collection is queried with n_results = top_k * 2 to leave headroom for fusion. If ChromaDB is unreachable, the leader degrades to a brute-force cosine scan over the in-memory store. The API contract doesn't change.

{
  "start_ts_ms": { "$lte": end_ms },
  "end_ts_ms":   { "$gte": start_ms },
  "camera_id":   { "$in": ["cam-front", "cam-lab"] }
}

Fuse · RRF + caption boost

Video and audio hit-lists are merged with Reciprocal Rank Fusion (RRF_K = 60), scaled by LEADER_RRF_WEIGHT (default 2.0). A caption-overlap boost (capped at 0.15) keeps exact phrases like "red backpack" from drowning in near-neighbors.

score = modality_aware_cosine(q, chunk)
      + rrf_weight * Σ 1 / (RRF_K + rank_in_modality)
      + caption_boost

Synthesize the answer

Top 10 chunks flatten into a numbered caption list and get passed to Gemma with max_tokens = 256, temperature = 0.2, and a terse system prompt requiring citations in the form (#2, cam-lab). Only captions go in — an earlier version re-attached JPEGs and Gemma started describing the scene instead of answering the question. Text-only was faster and more accurate.

Latency budget

Targets we held through the demo, on a single Mac mini against a week of 8-camera data.

POST /api/query
p95 · under 5s
parse + embed
~200 ms
ANN + filter
~300 ms
RRF fuse
~100 ms
Gemma synth
~3.8 s
total · p95
~4.9 s

The UI splits the request: /api/search returns hits only (~100–500 ms) so the citation grid lights up immediately, then /api/answer runs synthesis in the background. Perceived latency is the cheap half.

07 · Key decisions

What 24 hours taught us.

Shipping a distributed multimodal RAG system in a weekend forced decisions that would normally get over-engineered. Four we'll carry forward.

I

On-device multimodal is genuinely ready.

Gemma 4 on Cactus produces useful video and audio embeddings on consumer hardware, fast enough for soft-real-time ingest. The "send raw frames to a hosted vision model" era of video understanding is ending. The interesting systems push inference to the edge and only centralize vectors.

II

P2P transport is a superpower for enterprise demos.

Plugging four laptops into four different networks and watching them all connect to one leader with zero configuration — no port forwarding, no VPN, no firewall rules — was the single most compelling moment of the demo. Judges and B2B buyers latched onto it fastest.

III

The hard part is the retrieval, not the model.

Swapping embedding cadence or the synthetic backend was easy. Getting multi-camera, time-filtered, multi-modal retrieval to produce good top-K results took the most iteration. Window size, caption quality, and fusion weights all mattered more than any single knob on the model.

IV

Feature flags are how you demo fast.

Having --synthetic, --no-camera, and a cactus feature flag meant the system ran end-to-end on any machine in the room within two minutes. That flexibility was worth more during the hackathon than any single piece of the pipeline.

08 · Recognition

Three tracks. One weekend.

Lookout was built in ~24 hours at the Cactus × DeepMind × YC hackathon. It took every prize it was eligible for.

Track · B2B
Best On-Device Enterprise Agent

Highest commercial viability for offline enterprise tools. Judged on real-world problem fit and privacy-first architecture — the second the "raw video never leaves the follower" slide went up, buyers leaned in.

Track · Technical
Deepest Technical Integration

For pushing the hardware/software stack — on-device multimodal embeddings, custom P2P wire protocol, cross-language leader/follower, and an agent that runs fully offline when it has to.

Overall · Grand Prize
Best Overall Project

Winner takes all — correctness of the MVP, quality of the demo, and venture-scale potential across the whole judging rubric.

Grand-prize benefit
Guaranteed Y Combinator interview
Awarded to the winning team. Plus GCP credits across all three tracks.
S26 track