This page is currently work-in-progress and will be extended in the future
In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we showed that paraphrase dataset together with MultipleNegativesRankingLoss is a powerful combination to learn sentence embeddings models.
You can find here: NLI - MultipleNegativesRankingLoss more information how the loss can be used.
In this folder, we collect different datasets and scripts to train using paraphrase data.
You can find here: sbert.net/datasets/paraphrases a list of datasets with paraphrases suitable for training.
|AllNLI.tsv.gz||SNLI + MultiNLI||277,230||86.54|
|msmarco-triplets.tsv.gz||MS MARCO Passages||5,028,051||83.12|
|yahoo_answers_title_question.tsv.gz||Yahoo Answers Dataset||659,896||81.19|
|S2ORC_citation_pairs.tsv.gz||Semantic Scholar Open Research Corpus||52,603,982||81.02|
|yahoo_answers_title_answer.tsv.gz||Yahoo Answers Dataset||1,198,260||80.25|
|yahoo_answers_question_answer.tsv.gz||Yahoo Answers Dataset||681,164||79.88|
See the respective linked source website for the dataset license.
All datasets have a sample per line and the individual sentences are seperated by a tab (\t). Some datasets (like AllNLI) has three sentences per line: An anchor, a positive, and a hard negative.
We measure for each dataset the performance on the STSb development dataset after 2k training steps with a distilroberta-base model and a batch size of 256.
Note: We find that the STSb dataset is a suboptimal dataset to evaluate the quality of sentence embedding models. It consists mainly of rather simple sentences, it does not require any domain specific knowledge, and the included sentences are of rather high quality compared to noisy, user-written content. Please do not infer from the above numbers how the approaches will perform on your domain specific dataset.
See training.py for the training script.
The training script allows to load one or multiple files. We construct batches by sampling examples from the respective dataset. So far, examples are not mixed between the datasets, i.e., a batch consists only of examples from a single dataset.
As the dataset sizes are quite different in size, we perform a tempurate controlled sampling from the datasets: Smaller datasets are up-sampled, while larger datasets are down-sampled. This allows an effective training with very large and smaller datasets.
Have a look at pre-trained models to view all models that were trained on these paraphrase datasets.
paraphrase-MiniLM-L12-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
paraphrase-distilroberta-base-v2 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, msmarco-triplets, quora_duplicates, coco_captions,flickr30k_captions, yahoo_answers_title_question, S2ORC_citation_pairs, stackexchange_duplicate_questions, wiki-atomic-edits
paraphrase-distilroberta-base-v1 - Trained on the following datasets: AllNLI, sentence-compression, SimpleWiki, altlex, quora_duplicates, wiki-atomic-edits, wiki-split
paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages. (Teacher: paraphrase-distilroberta-base-v1, Student: xlm-r-base)
Work in Progress¶
Training with this data is currently work-in-progress. Things that will be added in the next time:
More datasets: Are you aware of more suitable training datasets? Let me know: email@example.com
Optimized batching: Currently batches are only drawn from one dataset. Future work might include also batches that are sampled across datasets
Optimized loss function: Currently the same parameters of MultipleNegativesRankingLoss is used for all datasets. Future work includes testing if the dataset benefit from individual loss functions.
Pre-trained models: Once all datasets are collected, we will train and release respective models.