This is an ML system design question for Machine Learning Engineers. Here's all that we know so far:
Mining novel data from a large corpus of unlabeled data
How to mine images with objects of interests
✦ AI-Generated Solution · ML System Design · Comprehensive Open-ended ML problem: from a huge unlabeled corpus (here, images), efficiently surface novel / rare / interesting examples — concretely, "mine images containing objects of interest" — to grow a training set without labeling everything. This is a data engine / active-learning design.
Labeling is the bottleneck, not data. The goal is a data engine: cheaply score billions of unlabeled images and route only the most valuable ones to human labeling, then loop. "Valuable" = likely to contain the object of interest and novel relative to what the model already knows (high information gain).
Clarifying questions to state up front: Do we have a seed set / a weak detector already? A few example images of the object (query-by-example) or just a class name (text query)? Online or batch? Budget for labeling? These choices change the design; below assumes a small seed model + a few exemplars.

Embed everything once, then iterate: dedup → score for novelty/relevance → mine hard examples → human-label a small batch → retrain → re-score. Each loop makes the next mining round sharper.
1. Embedding & indexing. Run every image through a strong self-supervised / foundation encoder (CLIP, DINOv2). Store embeddings in an ANN index (FAISS). CLIP also enables text→image search ("a photo of <object>") and image→image search from exemplars.
2. Near-duplicate removal. Web-scale corpora are full of duplicates/near-dups → they waste labeling budget and bias the set. Cluster by embedding cosine similarity (or LSH) and keep representatives.
3. Candidate generation (mining the object of interest). Several complementary signals:
4. Novelty + uncertainty scoring (what makes a candidate valuable).
value = α·relevance + β·uncertainty + γ·novelty, then diversity-aware batch selection.5. Human-in-the-loop labeling. Send the top batch to annotators (with model pre-labels to speed them up); capture labels + corrections.
6. Retrain & loop. Add newly labeled data, retrain/fine-tune the detector, re-score the pool. Repeat — classic active-learning data engine (this is how large detection datasets are bootstrapped).
| Stage | Choice |
|---|---|
| Representation | Foundation encoder (CLIP/DINOv2) embeddings + FAISS |
| Dedup | Embedding-cosine / LSH clustering |
| Candidate gen | Text/exemplar kNN + weak-detector medium-confidence |
| Value score | α·relevance + β·uncertainty + γ·novelty |
| Batch selection | Diversity-aware (core-set / cluster-sample) |
| Loop | Human label → retrain → re-score (active learning) |
| Metric | Label-efficiency: mAP gain per 1k labels |