---
title: "Apple Vision"
slug: apple-vision
section: reference
access: public
summary: "Apple Vision is Apple's on-device computer-vision API surface (the Vision framework), distinct from the Vision Pro headset. Inside 1Context it matters as the Apple-Silicon-native alternative to sending image bytes to a hosted vision model. As of April 24, 2026, the operator's doc…"
status: published
asset_base: /assets
home_href: /
toc_enabled: true
talk_enabled: false
agent_view_enabled: true
copy_buttons_enabled: true
footer_enabled: true
last_updated: 2026-04-29
categories: [Tools, Engineering]
subject-type: tool
last-reinforced: 2026-04-29
fading-since: null
archived: false
---

## Apple Vision

Apple Vision is Apple's on-device computer-vision API surface (the [Vision framework](https://developer.apple.com/documentation/vision)), distinct from the Vision Pro headset. Inside [1Context](/1context) it matters as the Apple-Silicon-native alternative to sending image bytes to a hosted vision model. As of April 24, 2026, the operator's documented stance is mixed: Vision's `VNGenerateImageFeaturePrintRequest` is adopted for image deduplication in the [screen-capture-plugin](/screen-capture-plugin), but its OCR surface (`RecognizeDocumentsRequest`, `VNRecognizeTextRequest`) was tested against dense GUI text and rejected as insufficient, sending the desktop-capture pipeline through [Gemini](https://ai.google.dev/) instead.

### Origin

Apple Vision entered 1Context's working set on April 23, 2026, when Paul asked *"is there an apple vision plugin that can do this without a real hash"* during a screen-capture-plugin dedup pass. The dHash dedup in `harness/dedupe.py` was replaced with `VNGenerateImageFeaturePrintRequest` plus L2 distance the same hour. At 05:56 UTC Paul addressed a `<FOR LIBRARIAN>` block explicitly naming the Vision request types and asking they be recorded in a screencapture-system sub-article — one operator-pinned piece of evidence sufficient on its own to propose the page.

### Role in 1Context

Apple Vision sits in the perception layer of [screen-capture-plugin](/screen-capture-plugin) alongside [Gemini](https://ai.google.dev/). The split is empirical, not architectural: Vision handles cheap image-similarity work on-device (`VNGenerateImageFeaturePrintRequest`-based dedup ships in `harness/dedupe.py`), while semantic extraction goes to Gemini because Vision's OCR ceiling is too low for desktop GUI density. The shape fits Paul's broader engineering philosophy of "cheap inputs, expensive processing": when a local API can carry the cheap inputs, use it; when the expensive processing requires sending pixels off-device, send pixels.

### History

The April 23 OCR investigation (commit `bae9662`) is the load-bearing event. `RecognizeDocumentsRequest` (WWDC25, macOS 26 structural OCR) was integrated in `swift/screen-capture/screen_capture.swift` for `.ocr.text.json` sidecar output; a first build broke on `DocumentObservation.Container.DataDetectorMatch has no member 'matched'` and was simplified to ship clean. `VNRecognizeTextRequest` was added as fallback. Both were tested against real captures: `RecognizeDocumentsRequest` returned 1,871 chars from a frame visibly containing several thousand, and `VNRecognizeTextRequest` only reached 2,645. *"Apple OCR plateaus here regardless of contrast tricks"* was the 07:58 read. Pipeline B (small thumbnail + Apple OCR transcript → Gemini) was rejected at 08:27 in favor of Pipeline C (straight Gemini from HEIC). The dedup adoption stuck; the OCR adoption did not.

`VNGenerateAttentionBasedSaliencyImageRequest` and `VNCalculateImageAestheticsScoresRequest` were researched the same day but not adopted.

### Current State

As of April 24, 2026, `VNGenerateImageFeaturePrintRequest` is the adopted dedup primitive in `screen-capture-plugin`. Vision OCR is rejected for the dense-GUI-text case as of macOS 26 — the operator's documented stance is that on-device OCR recall is insufficient for screen captures of typical work surfaces, and the Pipeline C HEIC-direct-to-[Gemini](https://ai.google.dev/) route is the production default (`--fps 0.05`, `--scale 3072`, `--min-edge 1280`, OCR off). If Pipeline B comes back — non-demo cost pressure, OCR improvements in a future macOS — that reversal should be tracked here.

### Relationship to Other Subjects

Apple Vision is paired with [screen-capture-plugin](/screen-capture-plugin) (where the dedup shipped and the OCR was rejected) and [Gemini](https://ai.google.dev/) (the alternative that won on the OCR axis). It processes HEIC natively, the same format Gemini accepts directly. It is deliberately distinct from "Apple Vision Pro" (the headset), which has not appeared in 1Context's working set.

### Open Questions

The 07:59 thread Paul left open is unresolved: *"any other OCR tools native to apple silicon (so not CPU that uses so much) that actually will work with these desktop activity things."* The non-CPU constraint matters because the production capture loop runs continuously; CPU-bound OCR is a battery and thermal cost the deployment can't easily absorb. Whether `VNGenerateAttentionBasedSaliencyImageRequest` or `VNCalculateImageAestheticsScoresRequest` could augment the topology-inference layer (5–20 Hz topology, sparse semantic-pixel calls) without another full pipeline pivot has not been tested.
