screen-capture-plugin

screen-capture-plugin is the operator's macOS desktop screen-capture system: a Swift capture daemon producing per-window HEIC bundles, plus a Python harness that batches frames and summarizes them through a pluggable provider (currently Gemini via AI Studio, formerly a Codex shell-out). It is the desktop sibling of Phonecapture inside the egocentric capture stack, and the operator's own implementation deliberately distinct from Screenpipe. As of April 24, 2026, the system runs in production at --fps 0.05, --scale 3072, --min-edge 1280, OCR off, HEIC q=0.8, with topology-then-pixel as the perception architecture and per-window batches as the canonical evidence shape.

Origin

screen-capture-plugin was scaffolded inside the demo sprint on April 22, 2026, in a worktree headered "Demo in 2 hours". The repo lives at /Users/paulhan/dev/screen-capture-plugin-public. The opening 03:00 framing on April 23 — "great we were able to show the demo and it helped a lot ... now we need to quickly prototype the whole system and expand this demo" — set the pacing for the day's eight-of-nine-hours push that turned the worktree into a working capture pipeline. Three architectural commitments anchored that day: per-window batches (not full-screen frames as the canonical unit), Apple Vision feature-print dedup (not dHash), and Gemini AI Studio (not Vertex, after Vertex returned 404 across 14 model names × 4 regions).

Role in 1Context

screen-capture-plugin produces the desktop capture stream that 1Context's memory pipeline ingests. It is one of three capture surfaces (alongside Phonecapture for Android-via-HDMI and the Mac side of Puter for screen / audio / UI events), distinguished by being the operator's own implementation with specific architectural commitments rather than a fork of an existing recorder. Where Screenpipe is the inherited upstream comparand and Puter is the productized desktop recorder, screen-capture-plugin is the experimentation surface where capture-pipeline architecture is tested before any of those decisions cross into the productized layers.

The system embodies the broader 1Context disposition of "cheap inputs, expensive processing": cheap topology stream at 5–20 Hz infers attention without invasive permissions (no Accessibility, no event taps, no keystroke logging), then sparse semantic-pixel and Gemini calls fire only on what topology flags as meaningful.

History

April 23 was the load-bearing day. Hour 02 introduced harness/providers.py as a two-backend abstraction (CodexProvider / GeminiProvider); hour 03 integrated Gemini 3.1 Flash-Lite Preview via AI Studio API key (commit 3f6402d, 84 frames in 46s wall vs ~144s serial — 3.1× speedup); hour 04 rewrote harness/dedupe.py around VNGenerateImageFeaturePrintRequest; hour 05 raised the resolution default 1080 → 1920 longest-edge after Paul caught captures were 1080×555 because the scale knob was misnamed (commit 590dc02, 4 files +651 −112).

Hour 06 added scripts/debug_frame.py (per-stage dumps: raw HEIC, decoded PNG, gallery render, model-input bytes) and reframed resolution. Hour 07 landed the integer-ratio resolution policy in swift/screen-capture/screen_capture.swift (backingScaleFor, max_edge=3072, min_edge=1280); the original "blurry capture" complaint that triggered the resolution work was found to be a 5K-display compositor artifact. Hour 08 chose Pipeline C (straight Gemini from HEIC) after Apple Vision OCR's recall ceiling on dense GUI text was diagnosed (commit bae9662). Hour 09 rewrote harness/prompts/desktop_summary.md example-driven after rejecting a typed-schema v2 (commit 0660607, 3 files +528 −279).

April 24 closed with a fresh Codex session validating Chrome capture end-to-end, including the validate_chrome_gemini_loop.py harness scoring three 2940×1658 HEICs of Chrome fixture windows with perfect fixture scores plus the librarian codeword. The 04-24 hour 05 perception-architecture decision — no Accessibility, no event taps, no raw keystrokes; per-window pixels as canonical evidence; M-series Apple Silicon as target; 1fps as a hard upper bound — pinned the topology-then-pixel shape as production discipline rather than an early prototype choice.

Current State

As of April 24, 2026, the system runs in production with the post-04-23 defaults: --fps 0.05, --scale 3072, --min-edge 1280, OCR off, HEIC q=0.8. The .capture-perception-5min topology trial (5Hz topology / 0.2Hz pixel) was running at end of day on April 24; results are uninspected. Per-window batches are the canonical evidence unit. Pipeline C (straight Gemini from HEIC) is the production summarization path; Pipeline B (Apple OCR transcript → Gemini) is documented and rejected.

Relationship to Other Subjects

screen-capture-plugin sits in the capture layer of 1Context alongside Phonecapture (Android-via-HDMI sibling) and Puter (productized desktop recorder; different lineage and different product contract). It depends on Apple Vision (VNGenerateImageFeaturePrintRequest for dedup) and Gemini (semantic summarization via AI Studio). It is deliberately distinct from Screenpipe, which it does not fork or inherit from. Adjacent open-source comparands include screencapturekit and Apple's SCShareableContent APIs the Swift daemon wraps.

Open Questions

The SCShareableContent.windows ordering gotcha — not z-order, not focus order — broke the first ranked-scheduler pass and was patched but not fully characterized; future ranking changes need to re-test against the same Chrome/YouTube failure case. The 5Hz topology / 0.2Hz pixel trial result was uninspected at end-of-day April 24. If Apple Vision's OCR ceiling lifts in a future macOS or non-demo cost pressure forces a Pipeline B revival, the reversal should be tracked here. The <FOR LIBRARIAN>-requested screencapture-system sub-article that Paul asked for at 05:56 on April 23 has not yet been written; only concept-page proposals exist downstream.