Loss Overview

Loss Table

Loss functions play a critical role in the performance of your fine-tuned model. Sadly, there is no “one size fits all” loss function. Ideally, this table should help narrow down your choice of loss function(s) by matching them to your data formats.

Note

You can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, (input_A, input_B) pairs with class labels can be converted into (anchor, positive, negative) triplets by sampling inputs with the same or different classes.

Legend: Loss functions marked with ★ are commonly recommended default choices.

Inputs	Labels	Appropriate Loss Functions
`single inputs`	`class`	`BatchAllTripletLoss` `BatchHardSoftMarginTripletLoss` `BatchHardTripletLoss` `BatchSemiHardTripletLoss`
`single inputs`	`none`	`ContrastiveTensionLoss` `DenoisingAutoEncoderLoss`
`(anchor, anchor) pairs`	`none`	`ContrastiveTensionLossInBatchNegatives`
`(damaged_input, original_input) pairs`	`none`	`DenoisingAutoEncoderLoss`
`(input_A, input_B) pairs`	`class`	`SoftmaxLoss`
`(anchor, positive) pairs`	`none`	`MultipleNegativesRankingLoss` ★ `CachedMultipleNegativesRankingLoss` ★ `MegaBatchMarginLoss` `GISTEmbedLoss` `CachedGISTEmbedLoss`
`(anchor, positive/negative) pairs`	`1 if positive, 0 if negative`	`ContrastiveLoss` `OnlineContrastiveLoss`
`(input_A, input_B) pairs`	`float similarity score between 0 and 1`	`CoSENTLoss` `AnglELoss` `CosineSimilarityLoss`
`(anchor, positive, negative) triplets`	`none`	`MultipleNegativesRankingLoss` ★ `CachedMultipleNegativesRankingLoss` ★ `TripletLoss` `CachedGISTEmbedLoss` `GISTEmbedLoss`
`(anchor, positive, negative_1, ..., negative_n)`	`none`	`MultipleNegativesRankingLoss` ★ `CachedMultipleNegativesRankingLoss` ★ `CachedGISTEmbedLoss`

Loss modifiers

These loss functions can be seen as loss modifiers: they work on top of standard loss functions, but apply those loss functions in different ways to try and instil useful properties into the trained embedding model.

For example, models trained with MatryoshkaLoss produce embeddings whose size can be truncated without notable losses in performance (see the 🪆 Introduction to Matryoshka Embedding Models blogpost for a full walkthrough), and models trained with AdaptiveLayerLoss still perform well when you remove model layers for faster inference.

Inputs	Labels	Appropriate Loss Functions
`any`	`any`	`MatryoshkaLoss` `AdaptiveLayerLoss` `Matryoshka2dLoss`

Regularization

These losses are designed to regularize the embedding space during training, encouraging certain properties in the learned embeddings. They can often be applied to any dataset configuration.

Inputs	Labels	Appropriate Loss Functions
`any`	`none`	`GlobalOrthogonalRegularizationLoss`

Distillation

These loss functions are specifically designed to be used when distilling the knowledge from one model into another. For example, when finetuning a small model to behave more like a larger & stronger one, or when finetuning a model to become multi-lingual.

Inputs	Labels	Appropriate Loss Functions
`input`	`teacher embeddings`	`EmbedDistillLoss` `MSELoss`
`(input_1, input_2, ..., input_N)`	`teacher embeddings` (broadcast across columns)	`EmbedDistillLoss` `MSELoss`
`(input_1, input_2, ..., input_N)`	`per-column teacher embeddings` (one per input column)	`EmbedDistillLoss` `MSELoss`
`(query, document_one, document_two)`	`gold_sim(query, document_one) - gold_sim(query, document_two)`	`MarginMSELoss`
`(query, positive, negative_1, ..., negative_n)`	`[gold_sim(query, positive) - gold_sim(query, negative_i) for i in 1..n]`	`MarginMSELoss`
`(query, positive, negative)`	`[gold_sim(query, positive), gold_sim(query, negative)]`	`DistillKLDivLoss` `MarginMSELoss`
`(query, positive, negative_1, ..., negative_n)`	`[gold_sim(query, positive), gold_sim(query, negative_i)...]`	`DistillKLDivLoss` `MarginMSELoss`

Commonly used Loss Functions

In practice, not all loss functions get used equally often. The most common scenarios are:

(anchor, positive) pairs without any labels: MultipleNegativesRankingLoss (a.k.a. InfoNCE or in-batch negatives loss) is commonly used to train the top performing embedding models. This data is often relatively cheap to obtain, and the models are generally very performant. CachedMultipleNegativesRankingLoss is often used to increase the batch size, resulting in superior performance.
(input_A, input_B) pairs with a float similarity score: CosineSimilarityLoss is traditionally used a lot, though more recently CoSENTLoss and AnglELoss are used as drop-in replacements with superior performance.

Custom Loss Functions

Advanced users can create and train with their own loss functions. Custom loss functions only have a few requirements:

They must be a subclass of torch.nn.Module.
They must have model as the first argument in the constructor.
They must implement a forward method that accepts sentence_features and labels. The former is a list of tokenized batches, one element for each column. These tokenized batches can be fed directly to the model being trained to produce embeddings. The latter is an optional tensor of labels. The method must return a single loss value or a dictionary of loss components (component names to loss values) that will be summed to produce the final loss value. When returning a dictionary, the individual components will be logged separately in addition to the summed loss, allowing you to monitor the individual components of the loss.

To get full support with the automatic model card generation, you may also wish to implement:

a get_config_dict method that returns a dictionary of loss parameters.
a citation property so your work gets cited in all models that train with the loss.

Consider inspecting existing loss functions to get a feel for how loss functions are commonly implemented.