Paraphrase Data

In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation, we showed that paraphrase data together with MultipleNegativesRankingLoss is a powerful combination to learn sentence embeddings models. Read NLI > MultipleNegativesRankingLoss for more information on this loss function.

The training.py script loads various datasets from the Dataset Overview. We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset.

As the dataset sizes are quite different in size, we perform round-robin sampling to train using the same amount of batches from each dataset.

Pre-Trained Models

Have a look at pre-trained models to view all models that were trained on these paraphrase datasets.

paraphrase-MiniLM-L12-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
paraphrase-distilroberta-base-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
paraphrase-distilroberta-base-v1 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split
paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: paraphrase-distilroberta-base-v1, Student: xlm-r-base)