sentence_transformers.datasets contains classes to organize your training input examples.
ParallelSentencesDataset is used for multilingual training. For details, see multilingual training.
This dataset reader can be used to read-in parallel sentences, i.e., it reads in a file with tab-seperated sentences with the same sentence in different languages. For example, the file can look like this (EN DE ES): hello world hallo welt hola mundo second sentence zweiter satz segunda oración
The sentence in the first column will be mapped to a sentence embedding using the given the embedder. For example, embedder is a mono-lingual sentence embedding method for English. The sentences in the other languages will also be mapped to this English sentence embedding.
When getting a sample from the dataset, we get one sentence with the according sentence embedding for this sentence.
teacher_model can be any class that implement an encode function. The encode function gets a list of sentences and returns a list of sentence embeddings
Parallel sentences dataset reader to train student model given a teacher model :param student_model: Student sentence embedding model that should be trained :param teacher_model: Teacher model, that provides the sentence embeddings for the first column in the dataset file
SentenceLabelDataset can be used if you have labeled sentences and want to train with triplet loss.
This dataset can be used for some specific Triplet Losses like BATCH_HARD_TRIPLET_LOSS which requires multiple examples with the same label in a batch.
It draws n consecutive, random and unique samples from one label at a time. This is repeated for each label.
Labels with fewer than n unique samples are ignored. This also applied to drawing without replacement, once less than n samples remain for a label, it is skipped.
This DOES NOT check if there are more labels than the batch is large or if the batch size is divisible by the samples drawn per label.
Creates a LabelSampler for a SentenceLabelDataset.
examples – a list with InputExamples
samples_per_label – the number of consecutive, random and unique samples drawn per label. Batch size should be a multiple of samples_per_label
with_replacement – if this is True, then each sample is drawn at most once (depending on the total number of samples per label). if this is False, then one sample can be drawn in multiple draws, but still not multiple times in the same drawing.
DenoisingAutoEncoderDataset is used for unsupervised training with the TSDAE method.
The DenoisingAutoEncoderDataset returns InputExamples in the format: texts=[noise_fn(sentence), sentence] It is used in combination with the DenoisingAutoEncoderLoss: Here, a decoder tries to re-construct the sentence without noise.
sentences – A list of sentences
noise_fn – A noise function: Given a string, it returns a string with noise, e.g. deleted words
NoDuplicatesDataLoadercan be used together with MultipleNegativeRankingLoss to ensure that no duplicates are within the same batch.
A special data loader to be used with MultipleNegativesRankingLoss. The data loader ensures that there are no duplicate sentences within the same batch