Archeglyph
Guide · · review · ocr · vlm · how-to

Reviewing a noisy scan

A walkthrough of the review screen on a low-quality scan: what to look for, how to read the confidence tint, and when to re-run a region — or the whole page — with a VLM instead.

By Dipankar · Last updated

On this page
  1. Before you open the review screen
  2. The review screen, at a glance
  3. Reading the signals
  4. The keyboard rhythm
  5. When to re-run a region
  6. When to re-run the whole document
  7. When to give up and re-scan
  8. A suggested workflow on a tough batch

Eventually every digital humanities pipeline meets the scan it cannot quite read. Paper that was already foxed before 1950, microfilm that was printed hot, a colonial-era plate whose register drifted during capture — these documents are the reason a reviewer seat exists at all. This guide walks through how Archeglyph’s review screen handles a bad scan and how to decide, region by region, whether to accept, edit, or re-run.

Before you open the review screen

Open the dataset’s Settings page and check the extraction engine. If the dataset was extracted with Tesseract and you are about to triage a batch of scans you know are noisy, you have two options: leave the default and fix regions individually on the review screen (cheap but slow), or switch the default to a VLM for the whole dataset (expensive but systematic). This guide assumes the first — you’re keeping the default and fixing the worst offenders on a per-region basis.

The review screen, at a glance

When you open a document, the review screen splits into two columns:

  • Left: the scan. The source image with layout regions overlaid as bounding boxes. Hover any box and the corresponding text card on the right scrolls into view and tints. Click a box to activate it.
  • Right: the cards. One card per region, in reading order, with the extracted text in a textarea, the provenance badge below, and an Accept button. Regions the extractor flagged low-confidence render in a warn-orange tint; high-confidence regions stay muted ink.

On a clean scan almost every card is muted; you skim, accept, move on. On a noisy scan the column of warn-orange tints is what you will notice first.

Reading the signals

Three signals together tell you whether a card needs work:

  1. Card tint. Warn-orange = the extractor’s own confidence score dropped below 65%. Muted ink = the extractor thinks it got this one.
  2. Region shape on the image. Layout regions that overlap, clip through a fold, or run at an angle are a layout-assessment failure, not an extraction failure — re-running the text engine won’t help.
  3. The text itself. Look for the failure patterns: run-together words, characters replaced with punctuation (d1e instead of die), lines that start mid-word because the layout pass missed a break.

A region with all three signals lit (orange tint, odd bbox, garbled text) is almost certainly a full-page candidate for re-running with a VLM. A region with only one signal lit (say, orange tint but reasonable-looking text) is usually fixable inline.

The keyboard rhythm

The review screen is designed to be operated from the keyboard. The essential four:

  • j / k — move between regions.
  • e — edit the focused region’s text (focuses the textarea).
  • Enter — accept the focused region.
  • r — open the region re-run popover.

On a noisy scan, the rhythm becomes: j j j, stop on an orange card, press e, fix the text, press Enter, continue. After a few pages you stop thinking about the keys.

When to re-run a region

Press r on a focused region. The popover offers two tabs (OCR, VLM) and a short list of available engines. The rules of thumb:

  • The text is garbled but the bbox is right → re-run with a better OCR engine first. If the dataset’s default is Tesseract and you have a cloud VLM configured, try the VLM anyway; on small regions the cost is negligible.
  • The region is a caption, a figure label, or a stamp → VLMs read these better than Tesseract in almost all cases. Re-run with a VLM and accept the result.
  • The region is a column of a table → neither engine is reliable on table cells in M0. Re-running does not help; correct inline or mark the region for a later pass.

Each re-run produces a new row in the region’s history with its own provenance badge. The previous row is not lost — the row-history disclosure on the left edge of the card shows every attempt, and you can swap back if the re-run was worse.

When to re-run the whole document

If more than roughly a third of a document’s regions are orange, a per-region approach will cost more reviewer time than a single document-level re-run. Open the right-pane “Re-run full document from…” control, pick the extraction stage, and choose a VLM override. This replaces the extraction outputs for all regions at once and leaves the layout assessment intact (unless you also pick the assess stage).

Rule of thumb: document-level re-runs are worth it when you expect to accept most of the new output. If you already know three-quarters of the page will need manual edits either way, save the cloud call and fix inline.

When to give up and re-scan

There is a scan quality below which no pipeline will help you. If the layout pass produces overlapping bboxes that slice through columns, if regions disappear entirely on certain pages, if the VLM comes back with plausible-looking prose that does not match the image — the document is below threshold. Flag it with a review note (the textarea supports a [[rescan]] tag that surfaces on the dataset’s documents table) and move on. Archeglyph does not pretend that a better model will rescue a photograph of a ruined page.

A suggested workflow on a tough batch

  1. Open the first document. j through every region without editing. Note how many orange cards you see per page.
  2. If the ratio is low (< 15%), fix regions inline as you go.
  3. If the ratio is high (> 30%), exit to the dataset level and re-run extraction on the whole batch with a VLM override. Come back to review fresh.
  4. For regions where the new extraction is still wrong, accept the edit inline rather than re-running a third time. At that point, you are the arbiter.

The review screen is designed around the assumption that a researcher’s time is the most expensive thing in the pipeline. Use it for judgement, not for data entry.