Paraphrase Data

This page is currently work-in-progress and will be extended in the future

In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we showed that paraphrase dataset together with MultipleNegativesRankingLoss is a powerful combination to learn sentence embeddings models.

You can find here: NLI - MultipleNegativesRankingLoss more information how the loss can be used.

In this folder, we collect different datasets and scripts to train using paraphrase data.

Datasets

You can find here: sbert.net/datasets/paraphrases a list of datasets with paraphrases suitable for training.

Name Source #Sentence-Pairs STSb-dev
AllNLI.tsv.gz SNLI + MultiNLI 277,230 86.54
sentence-compression.tsv.gz sentence-compression 180,000 84.36
SimpleWiki.tsv.gz SimpleWiki 102,225 84.26
altlex.tsv.gz altlex 112,696 83.34
msmarco-triplets.tsv.gz MS MARCO Passages 5,028,051 83.12
quora_duplicates.tsv.gz Quora 103,663 82.55
coco_captions-with-guid.tsv.gz COCO 828,395 82.25
flickr30k_captions-with-guid.tsv.gz Flickr 30k 317,695 82.04
yahoo_answers_title_question.tsv.gz Yahoo Answers Dataset 659,896 81.19
S2ORC_citation_pairs.tsv.gz Semantic Scholar Open Research Corpus 52,603,982 81.02
yahoo_answers_title_answer.tsv.gz Yahoo Answers Dataset 1,198,260 80.25
stackexchange_duplicate_questions.tsv.gz Stackexchange 169,438 80.37
yahoo_answers_question_answer.tsv.gz Yahoo Answers Dataset 681,164 79.88
wiki-atomic-edits.tsv.gz wiki-atomic-edits 22,980,185 79.58
wiki-split.tsv.gz wiki-split 929,944 76.59

See the respective linked source website for the dataset license.

All datasets have a sample per line and the individual sentences are seperated by a tab (\t). Some datasets (like AllNLI) has three sentences per line: An anchor, a positive, and a hard negative.

We measure for each dataset the performance on the STSb development dataset after 2k training steps with a distilroberta-base model and a batch size of 256.

Note: We find that the STSb dataset is a suboptimal dataset to evaluate the quality of sentence embedding models. It consists mainly of rather simple sentences, it does not require any domain specific knowledge, and the included sentences are of rather high quality compared to noisy, user-written content. Please do not infer from the above numbers how the approaches will perform on your domain specific dataset.

Training

See training.py for the training script.

The training script allows to load one or multiple files. We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset.

As the dataset sizes are quite different in size, we perform a tempurate controlled sampling from the datasets: Smaller datasets are up-sampled, while larger datasets are down-sampled. This allows an effective training with very large and smaller datasets.

Pre-Trained Models

Have a look at pre-trained models to view all models that were trained on these paraphrase datasets.

  • paraphrase-MiniLM-L12-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits

  • paraphrase-distilroberta-base-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits

  • paraphrase-distilroberta-base-v1 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split

  • paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: paraphrase-distilroberta-base-v1, Student: xlm-r-base)

Work in Progress

Training with this data is currently work-in-progress. Things that will be added in the next time:

  • More datasets: Are you aware of more suitable training datasets? Let me know: info@nils-reimers.de

  • Optimized batching: Currently batches are only drawn from one dataset. Future work might include also batches that are sampled across datasets

  • Optimized loss function: Currently the same parameters of MultipleNegativesRankingLoss is used for all datasets. Future work includes testing if the dataset benefit from individual loss functions.

  • Pre-trained models: Once all datasets are collected, we will train and release respective models.