Usage

Characteristics of Sentence Transformer (a.k.a bi-encoder) models:

Calculates a fixed-size vector representation (embedding) given texts, images, audio, video, or combinations thereof (depending on the model).
Embedding calculation is often efficient, embedding similarity calculation is very fast.
Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.

Once you have installed Sentence Transformers, you can easily use Sentence Transformer models:

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

Some Sentence Transformer models support inputs beyond text, such as images, audio, or video. You can check which modalities a model supports using the modalities property and the supports() method. The encode() method accepts different input formats depending on the modality:

Tip

Multimodal models require additional dependencies. Install them with e.g. pip install -U "sentence-transformers[image]" for image support. See Installation for all options.

Text: strings.
Image: PIL images, file paths, URLs, or numpy/torch arrays.
Audio: file paths, numpy/torch arrays, dicts with "array" and "sampling_rate" keys, or (if torchcodec installed) torchcodec.AudioDecoder instances.
Video: file paths, numpy/torch arrays, dicts with "array" and "video_metadata" keys, or (if torchcodec installed) torchcodec.VideoDecoder instances.
Multimodal dicts: a dict mapping modality names to values, e.g. {"text": ..., "audio": ...}. The keys must be "text", "image", "audio", or "video".
Chat messages: a list of dicts with "role" and "content" keys for multimodal models that use an uncommon chat template to combine text and non-text inputs.

The following example loads a multimodal model and computes similarities between text and image embeddings:

from sentence_transformers import SentenceTransformer

# 1. Load a model that supports both text and images
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

# 2. Encode images from URLs
img_embeddings = model.encode([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])

# 3. Encode text queries (one matching + one hard negative per image)
text_embeddings = model.encode([
    "A green car parked in front of a yellow building",
    "A red car driving on a highway",
    "A bee on a pink flower",
    "A wasp on a wooden table",
])

# 4. Compute cross-modal similarities
similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)
# tensor([[0.5115, 0.1078],
#         [0.1999, 0.1108],
#         [0.1255, 0.6749],
#         [0.1283, 0.2704]])

For retrieval tasks, encode_query() and encode_document() are the recommended methods. Many embedding models use different prompts or instructions for queries vs. documents, and these methods handle that automatically:

encode_query() uses the model’s "query" prompt (if available) and sets task="query".
encode_document() uses the first available prompt from "document", "passage", or "corpus", and sets task="document".

These methods accept all the same input types as encode() (text, images, URLs, multimodal dicts, etc.) and pass through all the same parameters. For models without specialized query/document prompts, they behave identically to encode().

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

# Encode text queries with the query prompt
query_embeddings = model.encode_query([
    "Find me a photo of a vehicle parked near a building",
    "Show me an image of a pollinating insect",
])

# Encode document screenshots with the document prompt
doc_embeddings = model.encode_document([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])

# Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.3907, 0.1490],
#         [0.1235, 0.4872]])

Tasks and Advanced Usage