Skip to content

Deduplication

Identifying and removing duplicate or near-duplicate entries from a dataset. In LLM training, deduplication prevents models from memorizing repeated content and improves efficiency. Techniques include MinHash and fuzzy matching.

Related terms

Data CleaningCorpusTraining Data
← Back to glossary