Apple Vision

Apple Vision is Apple's on-device computer-vision API surface (the Vision framework), distinct from the Vision Pro headset. Inside 1Context it matters as the Apple-Silicon-native alternative to sending image bytes to a hosted vision model. As of April 24, 2026, the operator's documented stance is mixed: Vision's VNGenerateImageFeaturePrintRequest is adopted for image deduplication in the screen-capture-plugin, but its OCR surface (RecognizeDocumentsRequest, VNRecognizeTextRequest) was tested against dense GUI text and rejected as insufficient, sending the desktop-capture pipeline through Gemini instead.

Origin

Apple Vision entered 1Context's working set on April 23, 2026, when Paul asked "is there an apple vision plugin that can do this without a real hash" during a screen-capture-plugin dedup pass. The dHash dedup in harness/dedupe.py was replaced with VNGenerateImageFeaturePrintRequest plus L2 distance the same hour. At 05:56 UTC Paul addressed a <FOR LIBRARIAN> block explicitly naming the Vision request types and asking they be recorded in a screencapture-system sub-article — one operator-pinned piece of evidence sufficient on its own to propose the page.

Role in 1Context

Apple Vision sits in the perception layer of screen-capture-plugin alongside Gemini. The split is empirical, not architectural: Vision handles cheap image-similarity work on-device (VNGenerateImageFeaturePrintRequest-based dedup ships in harness/dedupe.py), while semantic extraction goes to Gemini because Vision's OCR ceiling is too low for desktop GUI density. The shape fits Paul's broader engineering philosophy of "cheap inputs, expensive processing": when a local API can carry the cheap inputs, use it; when the expensive processing requires sending pixels off-device, send pixels.

History

The April 23 OCR investigation (commit bae9662) is the load-bearing event. RecognizeDocumentsRequest (WWDC25, macOS 26 structural OCR) was integrated in swift/screen-capture/screen_capture.swift for .ocr.text.json sidecar output; a first build broke on DocumentObservation.Container.DataDetectorMatch has no member 'matched' and was simplified to ship clean. VNRecognizeTextRequest was added as fallback. Both were tested against real captures: RecognizeDocumentsRequest returned 1,871 chars from a frame visibly containing several thousand, and VNRecognizeTextRequest only reached 2,645. "Apple OCR plateaus here regardless of contrast tricks" was the 07:58 read. Pipeline B (small thumbnail + Apple OCR transcript → Gemini) was rejected at 08:27 in favor of Pipeline C (straight Gemini from HEIC). The dedup adoption stuck; the OCR adoption did not.

VNGenerateAttentionBasedSaliencyImageRequest and VNCalculateImageAestheticsScoresRequest were researched the same day but not adopted.

Current State

As of April 24, 2026, VNGenerateImageFeaturePrintRequest is the adopted dedup primitive in screen-capture-plugin. Vision OCR is rejected for the dense-GUI-text case as of macOS 26 — the operator's documented stance is that on-device OCR recall is insufficient for screen captures of typical work surfaces, and the Pipeline C HEIC-direct-to-Gemini route is the production default (--fps 0.05, --scale 3072, --min-edge 1280, OCR off). If Pipeline B comes back — non-demo cost pressure, OCR improvements in a future macOS — that reversal should be tracked here.

Relationship to Other Subjects

Apple Vision is paired with screen-capture-plugin (where the dedup shipped and the OCR was rejected) and Gemini (the alternative that won on the OCR axis). It processes HEIC natively, the same format Gemini accepts directly. It is deliberately distinct from "Apple Vision Pro" (the headset), which has not appeared in 1Context's working set.

Open Questions

The 07:59 thread Paul left open is unresolved: "any other OCR tools native to apple silicon (so not CPU that uses so much) that actually will work with these desktop activity things." The non-CPU constraint matters because the production capture loop runs continuously; CPU-bound OCR is a battery and thermal cost the deployment can't easily absorb. Whether VNGenerateAttentionBasedSaliencyImageRequest or VNCalculateImageAestheticsScoresRequest could augment the topology-inference layer (5–20 Hz topology, sparse semantic-pixel calls) without another full pipeline pivot has not been tested.