Back

Mining Novel Data from Large Unlabeled Corpus

System DesignML System DesignOnsitePhoneMachine Learning EngineerReported Dec, 2025Medium Frequency

Problem Statement

This is an ML system design question for Machine Learning Engineers. Here's all that we know so far:

Mining novel data from a large corpus of unlabeled data

How to mine images with objects of interests


Reference solution

#38 Mining Novel Data from a Large Unlabeled Corpus — Solution

✦ AI-Generated Solution · ML System Design · Comprehensive Open-ended ML problem: from a huge unlabeled corpus (here, images), efficiently surface novel / rare / interesting examples — concretely, "mine images containing objects of interest" — to grow a training set without labeling everything. This is a data engine / active-learning design.


1. Frame the Problem

Labeling is the bottleneck, not data. The goal is a data engine: cheaply score billions of unlabeled images and route only the most valuable ones to human labeling, then loop. "Valuable" = likely to contain the object of interest and novel relative to what the model already knows (high information gain).

Clarifying questions to state up front: Do we have a seed set / a weak detector already? A few example images of the object (query-by-example) or just a class name (text query)? Online or batch? Budget for labeling? These choices change the design; below assumes a small seed model + a few exemplars.

2. Architecture (a closed loop)

Data mining pipeline

Embed everything once, then iterate: dedup → score for novelty/relevance → mine hard examples → human-label a small batch → retrain → re-score. Each loop makes the next mining round sharper.

3. Pipeline Stages

1. Embedding & indexing. Run every image through a strong self-supervised / foundation encoder (CLIP, DINOv2). Store embeddings in an ANN index (FAISS). CLIP also enables text→image search ("a photo of <object>") and image→image search from exemplars.

2. Near-duplicate removal. Web-scale corpora are full of duplicates/near-dups → they waste labeling budget and bias the set. Cluster by embedding cosine similarity (or LSH) and keep representatives.

3. Candidate generation (mining the object of interest). Several complementary signals:

  • Query-by-example / text query: kNN in embedding space around exemplar images or the CLIP text embedding of the class → high-recall candidate pool.
  • Weak detector scoring: run the current (seed) detector; keep medium-confidence hits (likely contains the object but model is unsure).

4. Novelty + uncertainty scoring (what makes a candidate valuable).

  • Uncertainty / active learning: prefer examples where the model is least confident (entropy, margin, or ensemble/MC-dropout disagreement) — high information gain.
  • Novelty / OOD: prefer examples far from existing labeled data in embedding space (low density / large kNN distance) — these expand coverage.
  • Diversity: select a batch that is mutually diverse (core-set / k-center, or cluster-then-sample) so you don't label 1,000 near-identical hard cases.
  • Combine into a score: value = α·relevance + β·uncertainty + γ·novelty, then diversity-aware batch selection.

5. Human-in-the-loop labeling. Send the top batch to annotators (with model pre-labels to speed them up); capture labels + corrections.

6. Retrain & loop. Add newly labeled data, retrain/fine-tune the detector, re-score the pool. Repeat — classic active-learning data engine (this is how large detection datasets are bootstrapped).

4. Why these choices

  • Embeddings as the substrate: one expensive pass enables cheap dedup, retrieval, novelty, and clustering — everything downstream is vector math.
  • Uncertainty × novelty: uncertainty finds decision-boundary cases; novelty finds unseen modes. Using both avoids the failure mode of repeatedly mining the same confusing-but-known cluster.
  • Diversity-aware batches: prevents redundant labeling spend.
  • Human-in-the-loop: keeps precision high; model pre-labels cut annotation cost.

5. Scale & Systems Concerns

  • Billions of images: distributed embedding (Spark/Ray + GPU workers), sharded FAISS (IVF-PQ for memory), batch scoring jobs.
  • Cost: embed once and cache; only the cheap scores recompute each loop; label tiny fractions.
  • Storage: object store for images, vector index for embeddings, metadata DB for labels/scores/provenance.
  • Pseudo-labeling / self-training: for very high-confidence detections, optionally auto-label to augment (with a confidence threshold) — but keep humans for the uncertain frontier.

6. Evaluation

  • Mining precision: fraction of mined candidates that truly contain the object (human-audited sample).
  • Downstream lift: detector mAP improvement per 1,000 labels added — the metric that actually matters (label efficiency).
  • Coverage/diversity: number of distinct visual clusters/modes represented after mining.
  • Dedup rate and novelty hit-rate over loops (should find rarer cases as the loop matures).

7. Summary

StageChoice
RepresentationFoundation encoder (CLIP/DINOv2) embeddings + FAISS
DedupEmbedding-cosine / LSH clustering
Candidate genText/exemplar kNN + weak-detector medium-confidence
Value scoreα·relevance + β·uncertainty + γ·novelty
Batch selectionDiversity-aware (core-set / cluster-sample)
LoopHuman label → retrain → re-score (active learning)
MetricLabel-efficiency: mAP gain per 1k labels
WhiteboardAuto-save enabled
Loading whiteboard…