Archeglyph
· embeddings · semantic-search · practical

Choosing an embedding model for digital humanities

A practical comparison of MiniLM-L6-v2 and BGE-small-en-v1.5 for DH corpora: what each optimises for, when the extra dimensions earn their keep, and how to decide without running a benchmark you cannot reproduce.

By Dipankar · Last updated

On this page
  1. What “embedding model” means in Archeglyph
  2. What MiniLM-L6-v2 is
  3. What BGE-small-en-v1.5 is
  4. The heuristic we give researchers
  5. Operational notes
  6. The honest caveat

A researcher opening Archeglyph for the first time sees two options under Embedding model: all-MiniLM-L6-v2 and bge-small-en-v1.5. Neither label is self-explanatory; neither choice is obviously wrong. This article is the long-form version of the hover tooltip, for the researcher who wants to make the choice with their eyes open.

We are not going to cite benchmarks. The published MTEB numbers are useful as an orientation for engineers, but re-running them against a 1901 Ottoman-Greek newspaper or a set of colonial-era expedition plates is not a thing any of us have the budget to do honestly. What we can offer is a description of what each model optimises for, the operational consequences of picking one, and the heuristics we use when the researcher asks us.

What “embedding model” means in Archeglyph

An embedding model turns a chunk of text — here, a passage of a few sentences drawn from an extracted region — into a fixed-length numeric vector. Vectors whose cosine similarity is high are, in theory, about similar things. Archeglyph uses those vectors for two jobs:

  1. Semantic search. The researcher types a query, the product embeds it, and ranks chunks by similarity.
  2. Clustering. Chunks that land near each other in vector space form a candidate cluster; the theme-writing LLM is given the top-TF-IDF terms from that cluster and asked for a 4-6 word title.

Both uses depend on the vector space being coherent for the kind of text in the dataset. That is the axis on which these two models differ in practice.

What MiniLM-L6-v2 is

all-MiniLM-L6-v2 is a 384-dimensional model distilled from a larger MiniLM, trained on a broad mix of general English Q&A and paraphrase data. It is small, fast, and has been the default “try this first” open embedding for several years. For Archeglyph it has three practical virtues:

  • Low footprint. 384 dimensions compresses well in zvec; a dataset of a million chunks holds in memory on a single modest server.
  • Fast embedding. On CPU it will out-throughput most alternatives. On a machine without a GPU, this is the difference between waiting an hour for a dataset to embed and waiting a shift.
  • Long production history. Its failure modes are well documented; when a cluster looks odd with MiniLM, there is usually a named reason.

What it is not especially good at: domain-shifted English, archaic spellings, multilingual content, and sentences where the interesting signal is a small number of proper nouns (place names, ship names, officer names). In those regimes it will still produce a vector, but the vector will often cluster on surface features (sentence length, function-word mix) rather than what the researcher cares about.

What BGE-small-en-v1.5 is

bge-small-en-v1.5 is a 384-dimensional model from the BGE family, trained with an explicit instruction-tuning objective on retrieval pairs. It is the same size as MiniLM and embeds at roughly comparable cost. The interesting differences show up qualitatively:

  • Retrieval-shaped training. BGE was trained to make query-document pairs close and negatives far; MiniLM was trained more broadly. For Archeglyph’s two use cases (search, then cluster-as-a-form-of-search), that objective is on-target.
  • Better handling of named entities. In internal dogfooding on a 1900s newspaper corpus, BGE’s top-k search results for a proper-noun query ("wharves of Galata") more consistently surface the narrative contexts around that phrase rather than other sentences of similar shape. We do not have a publishable benchmark for this; we mention it as an intuition to keep.
  • Instruction prefix. BGE expects a short prefix on query embeddings (e.g. "Represent this sentence for retrieval: "). Archeglyph applies this automatically — if you switch to BGE, the query side of the pipeline is handled. You do not need to think about it.

What it is not: multilingual. bge-small-en-v1.5 is English-tuned. For Ottoman-Turkish, Italian, French, Greek, or Arabic sources, neither of these two models is ideal; the researcher should pick whichever they judge less bad and plan on the cross-language failure modes. A future Archeglyph release will surface BGE-M3 or the multilingual E5 family as third and fourth options for exactly this reason.

The heuristic we give researchers

A decision tree that does not require benchmarks:

  • English-only corpus, CPU-bound infrastructure, dataset > 500k chunks → start with MiniLM. The embedding pass is cheap and the search quality is “good enough” for the first exploratory read.
  • English-dominant corpus, quality matters more than throughput, GPU available → start with BGE. The improvement is perceptible in the top-10 search results on the kinds of queries DH researchers actually type.
  • Mixed-language or heavily archaic corpus → either, with the awareness that whichever you pick, you are going to see cross-language leakage. Consider using Archeglyph’s cluster view as the primary reading surface rather than search, because clustering is slightly more forgiving of a noisy vector space than pinpoint retrieval.
  • Actively comparing models → embed the dataset twice. Archeglyph’s snapshots carry the embedding model id per chunk, so a dataset can live in the workspace with two embedding spaces and the provenance badge will keep them straight. This is the honest way to compare on your corpus; it is also the only way that yields a defensible answer.

Operational notes

  • Switching embedding model on a live dataset re-embeds all chunks and rebuilds the index. The settings page surfaces this as a rebuild step with an estimated time before you confirm. A researcher should expect minutes per thousand chunks on CPU, seconds on a modern GPU.
  • The search result UI discloses the embedding model on hover. If you switched models mid-study, this is how you will notice that this result came from the old space.
  • Clustering is not invariant across models. A dataset clustered under MiniLM and then re-clustered under BGE will not produce the same clusters, or even the same number of clusters; treat them as two separate analytic frames, not two views of one truth.

The honest caveat

Picking an embedding model is one of several places in a DH pipeline where the default should usually be try one, read, try the other, read again. We have shipped two defaults because shipping zero is not useful and shipping ten is paralysing. The right reading of this article is not “one of these is better” but “these are the two we shipped, here is how they differ, and here is how Archeglyph helps you tell the difference on your own corpus.” The scholarship is still yours.