Datasets¶

sentence_transformers.datasets contains classes to organize your training input examples.

ParallelSentencesDataset¶

ParallelSentencesDataset is used for multilingual training. For details, see multilingual training.

class sentence_transformers.datasets.ParallelSentencesDataset(student_model: <module 'sentence_transformers.SentenceTransformer' from 'c:\\code\\sentence-transformers\\sentence_transformers\\SentenceTransformer.py'>, teacher_model: <module 'sentence_transformers.SentenceTransformer' from 'c:\\code\\sentence-transformers\\sentence_transformers\\SentenceTransformer.py'>, batch_size: int = 8, use_embedding_cache: bool = True)¶

This dataset reader can be used to read-in parallel sentences, i.e., it reads in a file with tab-seperated sentences with the same sentence in different languages. For example, the file can look like this (EN DE ES): hello world hallo welt hola mundo second sentence zweiter satz segunda oración

The sentence in the first column will be mapped to a sentence embedding using the given the embedder. For example, embedder is a mono-lingual sentence embedding method for English. The sentences in the other languages will also be mapped to this English sentence embedding.

When getting a sample from the dataset, we get one sentence with the according sentence embedding for this sentence.

teacher_model can be any class that implement an encode function. The encode function gets a list of sentences and returns a list of sentence embeddings

Parallel sentences dataset reader to train student model given a teacher model

Parameters
  • student_model – Student sentence embedding model that should be trained

  • teacher_model – Teacher model, that provides the sentence embeddings for the first column in the dataset file

SentenceLabelDataset¶

SentenceLabelDataset can be used if you have labeled sentences and want to train with triplet loss.

class sentence_transformers.datasets.SentenceLabelDataset(examples: List[sentence_transformers.readers.InputExample.InputExample], samples_per_label: int = 2, with_replacement: bool = False)¶

This dataset can be used for some specific Triplet Losses like BATCH_HARD_TRIPLET_LOSS which requires multiple examples with the same label in a batch.

It draws n consecutive, random and unique samples from one label at a time. This is repeated for each label.

Labels with fewer than n unique samples are ignored. This also applied to drawing without replacement, once less than n samples remain for a label, it is skipped.

This DOES NOT check if there are more labels than the batch is large or if the batch size is divisible by the samples drawn per label.

Creates a LabelSampler for a SentenceLabelDataset.

Parameters
  • examples – a list with InputExamples

  • samples_per_label – the number of consecutive, random and unique samples drawn per label. Batch size should be a multiple of samples_per_label

  • with_replacement – if this is True, then each sample is drawn at most once (depending on the total number of samples per label). if this is False, then one sample can be drawn in multiple draws, but still not multiple times in the same drawing.

DenoisingAutoEncoderDataset¶

DenoisingAutoEncoderDataset is used for unsupervised training with the TSDAE method.

class sentence_transformers.datasets.DenoisingAutoEncoderDataset(sentences: List[str], noise_fn=<function DenoisingAutoEncoderDataset.<lambda>>)¶

The DenoisingAutoEncoderDataset returns InputExamples in the format: texts=[noise_fn(sentence), sentence] It is used in combination with the DenoisingAutoEncoderLoss: Here, a decoder tries to re-construct the sentence without noise.

Parameters
  • sentences – A list of sentences

  • noise_fn – A noise function: Given a string, it returns a string with noise, e.g. deleted words

NoDuplicatesDataLoader¶

NoDuplicatesDataLoadercan be used together with MultipleNegativeRankingLoss to ensure that no duplicates are within the same batch.

class sentence_transformers.datasets.NoDuplicatesDataLoader(train_examples, batch_size)¶

A special data loader to be used with MultipleNegativesRankingLoss. The data loader ensures that there are no duplicate sentences within the same batch