Usage
Characteristics of Sentence Transformer (a.k.a bi-encoder) models:
Calculates a fixed-size vector representation (embedding) given texts, images, audio, video, or combinations thereof (depending on the model).
Embedding calculation is often efficient, embedding similarity calculation is very fast.
Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.
Once you have installed Sentence Transformers, you can easily use Sentence Transformer models:
from sentence_transformers import SentenceTransformer
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
# The sentences to encode
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])
Some Sentence Transformer models support inputs beyond text, such as images, audio, or video. You can check which modalities a model supports using the modalities property and the supports() method. The encode() method accepts different input formats depending on the modality:
Tip
Multimodal models require additional dependencies. Install them with e.g. pip install -U "sentence-transformers[image]" for image support. See Installation for all options.
Text: strings.
Image: PIL images, file paths, URLs, or numpy/torch arrays.
Audio: file paths, numpy/torch arrays, dicts with
"array"and"sampling_rate"keys, or (iftorchcodecinstalled)torchcodec.AudioDecoderinstances.Video: file paths, numpy/torch arrays, dicts with
"array"and"video_metadata"keys, or (iftorchcodecinstalled)torchcodec.VideoDecoderinstances.Multimodal dicts: a dict mapping modality names to values, e.g.
{"text": ..., "audio": ...}. The keys must be"text","image","audio", or"video".Chat messages: a list of dicts with
"role"and"content"keys for multimodal models that use an uncommon chat template to combine text and non-text inputs.
The following example loads a multimodal model and computes similarities between text and image embeddings:
from sentence_transformers import SentenceTransformer
# 1. Load a model that supports both text and images
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")
# 2. Encode images from URLs
img_embeddings = model.encode([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
# 3. Encode text queries (one matching + one hard negative per image)
text_embeddings = model.encode([
"A green car parked in front of a yellow building",
"A red car driving on a highway",
"A bee on a pink flower",
"A wasp on a wooden table",
])
# 4. Compute cross-modal similarities
similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)
# tensor([[0.5115, 0.1078],
# [0.1999, 0.1108],
# [0.1255, 0.6749],
# [0.1283, 0.2704]])
For retrieval tasks, encode_query() and encode_document() are the recommended methods. Many embedding models use different prompts or instructions for queries vs. documents, and these methods handle that automatically:
encode_query()uses the model’s"query"prompt (if available) and setstask="query".encode_document()uses the first available prompt from"document","passage", or"corpus", and setstask="document".
These methods accept all the same input types as encode() (text, images, URLs, multimodal dicts, etc.) and pass through all the same parameters. For models without specialized query/document prompts, they behave identically to encode().
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")
# Encode text queries with the query prompt
query_embeddings = model.encode_query([
"Find me a photo of a vehicle parked near a building",
"Show me an image of a pollinating insect",
])
# Encode document screenshots with the document prompt
doc_embeddings = model.encode_document([
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
# Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.3907, 0.1490],
# [0.1235, 0.4872]])