SparseEncoder

SparseEncoder

class sentence_transformers.sparse_encoder.model.SparseEncoder(model_name_or_path: str | None = None, *, modules: list[Module] | None = None, device: str | None = None, prompts: dict[str, str] | None = None, default_prompt_name: str | None = None, cache_folder: str | None = None, trust_remote_code: bool = False, revision: str | None = None, local_files_only: bool = False, token: bool | str | None = None, model_kwargs: dict[str, Any] | None = None, processor_kwargs: dict[str, Any] | None = None, config_kwargs: dict[str, Any] | None = None, model_card_data: SparseEncoderModelCardData | None = None, backend: Literal['torch', 'onnx', 'openvino'] = 'torch', similarity_fn_name: str | SimilarityFunction | None = None, max_active_dims: int | None = None)[source]

Loads or creates a SparseEncoder model that can be used to map text to sparse embeddings.

Parameters:
  • model_name_or_path (str, optional) – If a filepath on disk, loads the model from that path. Otherwise, tries to download a pre-trained SparseEncoder model. If that fails, tries to construct a model from the Hugging Face Hub with that name. Defaults to None.

  • modules (list[nn.Module], optional) – A list of torch modules that are called sequentially. Can be used to create custom SparseEncoder models from scratch. Defaults to None.

  • device (str, optional) – Device (like "cuda", "cpu", "mps", "npu") that should be used for computation. If None, checks if a GPU can be used. Defaults to None.

  • prompts (dict[str, str], optional) – A dictionary with prompts for the model. The key is the prompt name, the value is the prompt text. The prompt text will be prepended before any text to encode. For example: {"query": "query: ", "passage": "passage: "}. If a model has saved prompts, you can override them by passing your own, or pass {"query": "", "document": ""} to disable them. Defaults to None.

  • default_prompt_name (str, optional) – The name of the prompt that should be used by default. If not set, no prompt will be applied. Defaults to None.

  • cache_folder (str, optional) – Path to store models. Can also be set by the SENTENCE_TRANSFORMERS_HOME environment variable. Defaults to None.

  • trust_remote_code (bool, optional) – Whether to allow for custom models defined on the Hub in their own modeling files. Only set to True for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. Defaults to False.

  • revision (str, optional) – The specific model version to use. It can be a branch name, a tag name, or a commit id, for a stored model on Hugging Face. Defaults to None.

  • local_files_only (bool, optional) – Whether to only look at local files (i.e., do not try to download the model). Defaults to False.

  • token (bool or str, optional) – Hugging Face authentication token to download private models. Defaults to None.

  • model_kwargs (dict[str, Any], optional) –

    Keyword arguments passed to the underlying Hugging Face Transformers model via AutoModel.from_pretrained. Particularly useful options include:

    • torch_dtype: Override the default torch.dtype and load the model under a specific dtype. Can be torch.float16, torch.bfloat16, torch.float32, or "auto" to use the dtype from the model’s config.json.

    • attn_implementation: The attention implementation to use. For example "eager", "sdpa", or "flash_attention_2". If you pip install kernels, then "flash_attention_2" should work without having to install flash_attn. It is frequently the fastest option. Defaults to "sdpa" when available (torch>=2.1.1).

    • device_map: Device map for model parallelism, e.g. "auto".

    • provider: For backend="onnx", the ONNX execution provider (e.g. "CUDAExecutionProvider").

    • file_name: For backend="onnx" or "openvino", the filename to load (e.g. for optimized or quantized models).

    • export: For backend="onnx" or "openvino", whether to export the model to the backend format. Also set automatically if the exported file doesn’t exist.

    See the PreTrainedModel.from_pretrained documentation for more details. Defaults to None.

  • processor_kwargs (dict[str, Any], optional) – Keyword arguments passed to the Hugging Face Transformers processor/tokenizer via AutoProcessor.from_pretrained. See the AutoTokenizer.from_pretrained documentation for more details. Defaults to None.

  • config_kwargs (dict[str, Any], optional) – Keyword arguments passed to the Hugging Face Transformers config via AutoConfig.from_pretrained. See the AutoConfig.from_pretrained documentation for more details. Defaults to None.

  • model_card_data (SparseEncoderModelCardData, optional) – A model card data object that contains information about the model. Used to generate a model card when saving the model. If not set, a default model card data object is created. Defaults to None.

  • backend (str, optional) – The backend to use for inference. Can be "torch" (default), "onnx", or "openvino". Defaults to "torch".

  • similarity_fn_name (str or SimilarityFunction, optional) – The name of the similarity function to use. Valid options are "cosine", "dot", "euclidean", and "manhattan". If not set, it is automatically set to "cosine" when similarity or similarity_pairwise are first accessed. Defaults to None.

  • max_active_dims (int, optional) – The maximum number of active (non-zero) dimensions in the output of the model. None means no limit, which can be slow or memory-intensive if your model wasn’t (yet) finetuned to high sparsity. Defaults to None.

Example

from sentence_transformers import SparseEncoder

# Load a pre-trained SparseEncoder model
model = SparseEncoder('naver/splade-cocondenser-ensembledistil')

# Encode some texts
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)

# Get the similarity scores between all sentences
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   35.629,     9.154,     0.098],
#         [    9.154,    27.478,     0.019],
#         [    0.098,     0.019,    29.553]])

Initialize a BaseModel instance.

Parameters:
  • model_name_or_path (str, optional) – If a filepath on disk, loads the model from that path. Otherwise, tries to download a pre-trained model. If that fails, tries to construct a model from the Hugging Face Hub with that name. Defaults to None.

  • modules (list[nn.Module], optional) – A list of torch modules that are called sequentially. Can be used to create custom models from scratch. Defaults to None.

  • device (str, optional) – Device (like "cuda", "cpu", "mps", "npu") that should be used for computation. If None, checks if a GPU can be used. Defaults to None.

  • prompts (dict[str, str], optional) – A dictionary with prompts for the model. The key is the prompt name, the value is the prompt text. The prompt text will be prepended before any text during inference. For example: {"query": "query: ", "passage": "passage: "}. If a model has saved prompts, you can override them by passing your own, or pass {"query": "", "document": ""} to disable them. Defaults to None.

  • default_prompt_name (str, optional) – The name of the prompt that should be used by default. If not set, no prompt will be applied. Defaults to None.

  • cache_folder (str, optional) – Path to store models. Can also be set by the SENTENCE_TRANSFORMERS_HOME environment variable. Defaults to None.

  • trust_remote_code (bool, optional) – Whether to allow for custom models defined on the Hub in their own modeling files. Only set to True for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. Defaults to False.

  • revision (str, optional) – The specific model version to use. It can be a branch name, a tag name, or a commit id, for a stored model on Hugging Face. Defaults to None.

  • local_files_only (bool, optional) – Whether to only look at local files (i.e., do not try to download the model). Defaults to False.

  • token (bool or str, optional) – Hugging Face authentication token to download private models. Defaults to None.

  • model_kwargs (dict[str, Any], optional) –

    Keyword arguments passed to the underlying Hugging Face Transformers model via AutoModel.from_pretrained. Particularly useful options include:

    • torch_dtype: Override the default torch.dtype and load the model under a specific dtype. Can be torch.float16, torch.bfloat16, torch.float32, or "auto" to use the dtype from the model’s config.json.

    • attn_implementation: The attention implementation to use. For example "eager", "sdpa", or "flash_attention_2". If you pip install kernels, then "flash_attention_2" should work without having to install flash_attn. It is frequently the fastest option. Defaults to "sdpa" when available (torch>=2.1.1).

    • device_map: Device map for model parallelism, e.g. "auto".

    • provider: For backend="onnx", the ONNX execution provider (e.g. "CUDAExecutionProvider").

    • file_name: For backend="onnx" or "openvino", the filename to load (e.g. for optimized or quantized models).

    • export: For backend="onnx" or "openvino", whether to export the model to the backend format. Also set automatically if the exported file doesn’t exist.

    See the PreTrainedModel.from_pretrained documentation for more details. Defaults to None.

  • processor_kwargs (dict[str, Any], optional) –

    Keyword arguments passed to the Hugging Face Transformers processor/tokenizer via AutoProcessor.from_pretrained. See the AutoTokenizer.from_pretrained documentation for more details. Defaults to None.

  • config_kwargs (dict[str, Any], optional) –

    Keyword arguments passed to the Hugging Face Transformers config via AutoConfig.from_pretrained. See the AutoConfig.from_pretrained documentation for more details. Defaults to None.

  • model_card_data (CardData, optional) – A model card data object that contains information about the model. Used to generate a model card when saving the model. If not set, a default model card data object is created. Defaults to None.

  • backend (str, optional) – The backend to use for inference. Can be "torch" (default), "onnx", or "openvino". Defaults to "torch".

active_adapters() list[str][source]

If you are not familiar with adapters and PEFT methods, we invite you to read more about them on the PEFT official documentation: https://huggingface.co/docs/peft

Gets the current active adapters of the model. In case of multi-adapter inference (combining multiple adapters for inference) returns the list of all active adapters so that users can deal with them accordingly.

For previous PEFT versions (that does not support multi-adapter inference), module.active_adapter will return a single string.

add_adapter(*args, **kwargs) None[source]

Adds a fresh new adapter to the current model for training purposes. If no adapter name is passed, a default name is assigned to the adapter to follow the convention of PEFT library (in PEFT we use “default” as the default adapter name).

Requires peft as a backend to load the adapter weights and the underlying model to be compatible with PEFT.

Parameters:
bfloat16() T

Casts all floating point parameters and buffers to bfloat16 datatype.

Note

This method modifies the module in-place.

Returns:

self

Return type:

Module

compile(*args, **kwargs)

Compile this Module’s forward using torch.compile().

This Module’s __call__ method is compiled and all arguments are passed as-is to torch.compile().

See torch.compile() for details on the arguments for this function.

cpu() T

Moves all model parameters and buffers to the CPU.

Note

This method modifies the module in-place.

Returns:

self

Return type:

Module

cuda(device: int | device | None = None) T

Moves all model parameters and buffers to the GPU.

This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.

Note

This method modifies the module in-place.

Parameters:

device (int, optional) – if specified, all parameters will be copied to that device

Returns:

self

Return type:

Module

decode(embeddings: Tensor, top_k: int | None = None) list[tuple[str, float]] | list[list[tuple[str, float]]][source]

Decode a sparse embedding into (token, weight) pairs sorted by descending weight.

Parameters:
  • embeddings (torch.Tensor) – Sparse embedding tensor of shape (vocab_size,) for a single embedding or (batch_size, vocab_size) for a batch.

  • top_k (int, optional) – Maximum number of top-weighted tokens to return per sample. If None, all non-zero tokens are returned. Must be positive. Defaults to None.

Returns:

If the input is 1D, a list of (token, weight) tuples. list[list[tuple[str, float]]]: If the input is 2D, a list (one per sample)

of lists of (token, weight) tuples.

Return type:

list[tuple[str, float]]

delete_adapter(*args, **kwargs) None[source]

If you are not familiar with adapters and PEFT methods, we invite you to read more about them on the PEFT official documentation: https://huggingface.co/docs/peft

Delete an adapter’s LoRA layers from the underlying model.

Parameters:
property device: device

Get torch.device from module, assuming that the whole module has one device. In case there are no PyTorch parameters, fall back to CPU.

disable_adapters() None[source]

Disable all adapters that are attached to the model. This leads to inferring with the base model only.

double() T

Casts all floating point parameters and buffers to double datatype.

Note

This method modifies the module in-place.

Returns:

self

Return type:

Module

property dtype: dtype | None

The dtype of the module (assuming that all the module parameters have the same dtype).

Type:

torch.dtype

enable_adapters() None[source]

Enable adapters that are attached to the model. The model will use self.active_adapter()

encode(inputs: list[str] | str, prompt_name: str | None = None, prompt: str | None = None, batch_size: int = 32, show_progress_bar: bool | None = None, convert_to_tensor: bool = True, convert_to_sparse_tensor: bool = True, save_to_cpu: bool = False, device: str | device | list[str | device] | None = None, max_active_dims: int | None = None, pool: dict[Literal['input', 'output', 'processes'], Any] | None = None, chunk_size: int | None = None, **kwargs: Any) list[Tensor] | Tensor[source]

Computes sparse sentence embeddings.

Tip

If you are unsure whether you should use encode(), encode_query(), or encode_document(), your best bet is to use encode_query() and encode_document() for Information Retrieval tasks with clear query and document/passage distinction, and use encode() for all other tasks.

Note that encode() is the most general method and can be used for any task, including Information Retrieval, and that if the model was not trained with predefined prompts and/or task types, then all three methods will return identical embeddings.

Tip

Adjusting batch_size can significantly improve processing speed. The optimal value depends on your hardware, model size, precision, and input length. Benchmark a few batch sizes on a small subset of your data to find the best value.

Parameters:
  • inputs (Union[str, List[str]]) – The texts to embed.

  • prompt_name (str, optional) – The name of the prompt to use for encoding. Must be a key in the prompts dictionary, which is either set in the constructor or loaded from the model configuration. For example if prompt_name is “query” and the prompts is {“query”: “query: “, …}, then the sentence “What is the capital of France?” will be encoded as “query: What is the capital of France?” because the sentence is appended to the prompt. If prompt is also set, this argument is ignored. Defaults to None.

  • prompt (str, optional) – The prompt to use for encoding. For example, if the prompt is “query: “, then the sentence “What is the capital of France?” will be encoded as “query: What is the capital of France?” because the sentence is appended to the prompt. If prompt is set, prompt_name is ignored. Defaults to None.

  • batch_size (int, optional) – The batch size used for the computation. Defaults to 32.

  • show_progress_bar (bool, optional) – Whether to output a progress bar when encoding. Defaults to None, in which case the progress bar will be shown if the logger’s effective level is INFO or DEBUG.

  • convert_to_tensor (bool, optional) – Whether the output should be a single stacked tensor (True) or a list of individual tensors (False). Sparse tensors may be challenging to slice, so this allows you to output lists of tensors instead. Defaults to True.

  • convert_to_sparse_tensor (bool, optional) – Whether the output should be in the format of a sparse (COO) tensor. Defaults to True.

  • save_to_cpu (bool, optional) – Whether the output should be moved to cpu or stay on the device it has been computed on. Defaults to False.

  • device (str, torch.device, list, or None, optional) –

    Device(s) to use for computation. Can be:

    • A single device string (e.g., “cuda:0”, “cpu”) for single-process encoding

    • A list of device strings (e.g., [“cuda:0”, “cuda:1”], [“cpu”, “cpu”, “cpu”, “cpu”]) to distribute encoding across multiple processes

    • None to auto-detect available device for single-process encoding

    If a list is provided, multi-process encoding will be used. Defaults to None.

  • max_active_dims (int, optional) – The maximum number of active (non-zero) dimensions in the output of the model. None means the value from the model’s config will be used. Defaults to None. If also None in the model’s config, there will be no limit on the number of active dimensions, which can be slow or memory-intensive if your model wasn’t (yet) finetuned to high sparsity.

  • pool (dict, optional) – A pool created by start_multi_process_pool() for multi-process encoding. If provided, the encoding will be distributed across multiple processes. This is recommended for large datasets and when multiple GPUs are available. Defaults to None.

  • chunk_size (int, optional) – Size of chunks for multi-process encoding. Only used with multiprocessing, i.e. when pool is not None or device is a list. If None, a sensible default is calculated. Defaults to None.

Returns:

By default, a 2d torch sparse tensor with shape [num_inputs, output_dimension] is returned. If only one string input is provided, then the output is a 1d tensor with shape [output_dimension]. If convert_to_tensor is False, a list of individual tensors is returned instead.

Return type:

Union[list[Tensor], Tensor]

Example

from sentence_transformers import SparseEncoder

# Load a pre-trained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# Encode some texts
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 30522)
encode_document(inputs: list[str] | str, prompt_name: str | None = None, prompt: str | None = None, batch_size: int = 32, show_progress_bar: bool | None = None, convert_to_tensor: bool = True, convert_to_sparse_tensor: bool = True, save_to_cpu: bool = False, device: str | device | list[str | device] | None = None, max_active_dims: int | None = None, pool: dict[Literal['input', 'output', 'processes'], Any] | None = None, chunk_size: int | None = None, **kwargs: Any) list[Tensor] | Tensor[source]

Computes embeddings specifically optimized for document/passage representation.

This method is a specialized version of encode() that differs in exactly two ways:

  1. If no prompt_name or prompt is provided, it uses the first available prompt from the following candidates: "document", "passage", "corpus" (checked in that order).

  2. It sets the task to “document”. If the model has a Router module, it will use the “document” task type to route the input through the appropriate submodules.

Tip

Adjusting batch_size can significantly improve processing speed. The optimal value depends on your hardware, model size, precision, and input length. Benchmark a few batch sizes on a small subset of your data to find the best value.

All other parameters are identical to encode(). See encode() for the full parameter documentation.

Example

from sentence_transformers import SparseEncoder

# Load a pre-trained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# Encode some texts
sentences = [
    "This research paper discusses the effects of climate change on marine life.",
    "The article explores the history of artificial intelligence development.",
    "This document contains technical specifications for the new product line.",
]
embeddings = model.encode_document(sentences)
print(embeddings.shape)
# (3, 30522)
encode_query(inputs: list[str] | str, prompt_name: str | None = None, prompt: str | None = None, batch_size: int = 32, show_progress_bar: bool | None = None, convert_to_tensor: bool = True, convert_to_sparse_tensor: bool = True, save_to_cpu: bool = False, device: str | device | list[str | device] | None = None, max_active_dims: int | None = None, pool: dict[Literal['input', 'output', 'processes'], Any] | None = None, chunk_size: int | None = None, **kwargs: Any) list[Tensor] | Tensor[source]

Computes embeddings specifically optimized for query representation.

This method is a specialized version of encode() that differs in exactly two ways:

  1. If no prompt_name or prompt is provided, it uses a predefined “query” prompt, if available in the model’s prompts dictionary.

  2. It sets the task to “query”. If the model has a Router module, it will use the “query” task type to route the input through the appropriate submodules.

Tip

Adjusting batch_size can significantly improve processing speed. The optimal value depends on your hardware, model size, precision, and input length. Benchmark a few batch sizes on a small subset of your data to find the best value.

All other parameters are identical to encode(). See encode() for the full parameter documentation.

Example

from sentence_transformers import SparseEncoder

# Load a pre-trained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# Encode some texts
queries = [
    "What are the effects of climate change?",
    "History of artificial intelligence",
    "Technical specifications product XYZ",
]
embeddings = model.encode_query(queries)
print(embeddings.shape)
# (3, 30522)
eval() T

Sets the module in evaluation mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

This is equivalent with self.train(False).

See Locally disabling gradient computation for a comparison between .eval() and several similar mechanisms that may be confused with it.

Returns:

self

Return type:

Module

evaluate(evaluator: BaseEvaluator, output_path: str | None = None) dict[str, float] | float[source]

Evaluate the model based on an evaluator

Parameters:
  • evaluator (BaseEvaluator) – The evaluator used to evaluate the model.

  • output_path (str, optional) – The path where the evaluator can write the results. Defaults to None.

Returns:

The evaluation results.

float() T

Casts all floating point parameters and buffers to float datatype.

Note

This method modifies the module in-place.

Returns:

self

Return type:

Module

get_adapter_state_dict(*args, **kwargs) dict[source]

If you are not familiar with adapters and PEFT methods, we invite you to read more about them on the PEFT official documentation: https://huggingface.co/docs/peft

Gets the adapter state dict that should only contain the weights tensors of the specified adapter_name adapter. If no adapter_name is passed, the active adapter is used.

Parameters:
get_backend() Literal['torch', 'onnx', 'openvino'][source]

Return the backend used for inference, which can be one of “torch”, “onnx”, or “openvino”.

Returns:

The backend used for inference.

Return type:

str

get_embedding_dimension() int | None[source]

Returns the number of dimensions in the output of SparseEncoder.encode().

Unlike SentenceTransformer, sparse encoders do not support truncate_dim, so this returns the raw output dimension from the last module in the pipeline.

Returns:

The number of dimensions in the output of encode. If it’s not known, it’s None.

Return type:

int or None

get_max_seq_length() int | None[source]

Deprecated since version Use: the max_seq_length property instead.

Returns the maximal sequence length that the first module of the model accepts. Longer inputs will be truncated.

Returns:

The maximal sequence length that the model accepts, or None if it is not defined.

Return type:

Optional[int]

get_model_kwargs() list[str][source]

Get the keyword arguments specific to this model for inference methods like encode or predict.

Example

>>> from sentence_transformers import SentenceTransformer, SparseEncoder
>>> SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2").get_model_kwargs()
[]
>>> SentenceTransformer("jinaai/jina-embeddings-v4", trust_remote_code=True).get_model_kwargs()
['task', 'truncate_dim']
>>> SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill").get_model_kwargs()
['task']
Returns:

A list of keyword arguments for the forward pass.

Return type:

list[str]

gradient_checkpointing_enable(gradient_checkpointing_kwargs: dict[str, Any] | None = None) None[source]

Enable gradient checkpointing for the model.

half() T

Casts all floating point parameters and buffers to half datatype.

Note

This method modifies the module in-place.

Returns:

self

Return type:

Module

static intersection(embeddings_1: Tensor, embeddings_2: Tensor) Tensor[source]

Compute the intersection of two sparse embeddings via element-wise multiplication.

For each dimension, the result retains the minimum contribution from both embeddings, keeping only dimensions where both inputs are positive (i.e., shared active dimensions). This is useful for token-level matching and interpretability when combined with decode().

Parameters:
  • embeddings_1 (torch.Tensor) – First embedding tensor of shape (vocab_size,).

  • embeddings_2 (torch.Tensor) – Second embedding tensor of shape (vocab_size,) or (batch_size, vocab_size).

Returns:

Sparse intersection tensor with the same shape as embeddings_2.

Return type:

torch.Tensor

Example

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
query_emb = model.encode_query("What is AI?")
doc_emb = model.encode_document("Artificial intelligence is a branch of computer science.")
shared = model.intersection(query_emb, doc_emb)
print(model.decode(shared, top_k=5))
is_singular_input(inputs: Any) bool[source]

Check if the input represents a single example or a batch of examples.

Parameters:

inputs – The input to check.

Returns:

True if the input is a single example, False if it is a batch.

Return type:

bool

load_adapter(*args, **kwargs) None[source]

Load adapter weights from file or remote Hub folder.” If you are not familiar with adapters and PEFT methods, we invite you to read more about them on PEFT official documentation: https://huggingface.co/docs/peft

Requires peft as a backend to load the adapter weights and the underlying model to be compatible with PEFT.

Parameters:
property max_seq_length: int

Returns the maximal input sequence length for the model. Longer inputs will be truncated.

Returns:

The maximal input sequence length.

Return type:

int

Example

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
print(model.max_seq_length)
# => 512
property modalities: list[Literal['text', 'image', 'audio', 'video', 'message'] | tuple[Literal['text', 'image', 'audio', 'video'], ...]]

Return the list of modalities supported by this model, e.g. ["text"] or ["text", "image", "message"].

model_card_data_class[source]

alias of SparseEncoderModelCardData

preprocess(inputs: list[str | Image | ndarray | Tensor | AudioDict | None | VideoDict | MessageDict | list[MessageDict] | dict[Literal['text', 'image', 'audio', 'video'], str | Image | ndarray | Tensor | AudioDict | None | VideoDict] | tuple[str | Image | ndarray | Tensor | AudioDict | None | VideoDict | dict[Literal['text', 'image', 'audio', 'video'], str | Image | ndarray | Tensor | AudioDict | None | VideoDict], str | Image | ndarray | Tensor | AudioDict | None | VideoDict | dict[Literal['text', 'image', 'audio', 'video'], str | Image | ndarray | Tensor | AudioDict | None | VideoDict]] | list[str | Image | ndarray | Tensor | AudioDict | None | VideoDict | dict[Literal['text', 'image', 'audio', 'video'], str | Image | ndarray | Tensor | AudioDict | None | VideoDict]]], prompt: str | None = None, **kwargs) dict[str, Tensor | Any][source]

Preprocesses the inputs for the model.

Parameters:
  • inputs (list[SingleInput | PairInput]) – A list of inputs to be preprocessed. Each input can be a string, dict, tuple, PIL Image, numpy array, torch Tensor, or other supported modality. If a single input is provided, it must be wrapped in a list.

  • prompt (str, optional) – A prompt string to prepend to text inputs. Defaults to None. If the model supports the message modality, the prompt will be added as a system message to the input messages instead of being prepended to text.

Returns:

A dictionary of tensors with the preprocessed inputs.

Return type:

dict[str, Tensor | Any]

property processor: Any

Property to get the processor that is used by this model

push_to_hub(repo_id: str, token: str | None = None, private: bool | None = None, safe_serialization: bool = True, commit_message: str | None = None, local_model_path: str | None = None, exist_ok: bool = False, replace_model_card: bool = False, train_datasets: list[str] | None = None, revision: str | None = None, create_pr: bool = False) str[source]

Uploads all elements of this model to a HuggingFace Hub repository, creating it if it doesn’t exist.

Parameters:
  • repo_id (str) – Repository name for your model in the Hub, including the user or organization.

  • token (str, optional) – An authentication token (See https://huggingface.co/settings/token)

  • private (bool, optional) – Set to true, for hosting a private model

  • safe_serialization (bool, optional) – If true, save the model using safetensors. If false, save the model the traditional PyTorch way

  • commit_message (str, optional) – Message to commit while pushing.

  • local_model_path (str, optional) – Path of the model locally. If set, this file path will be uploaded. Otherwise, the current model will be uploaded

  • exist_ok (bool, optional) – If true, saving to an existing repository is OK. If false, saving only to a new repository is possible

  • replace_model_card (bool, optional) – If true, replace an existing model card in the hub with the automatically created model card. If false (default), keep the existing model card if one exists in the repository.

  • train_datasets (List[str], optional) – Datasets used to train the model. If set, the datasets will be added to the model card in the Hub.

  • revision (str, optional) – Branch to push the uploaded files to

  • create_pr (bool, optional) – If True, create a pull request instead of pushing directly to the main branch

Returns:

The url of the commit of your model in the repository on the Hugging Face Hub.

Return type:

str

save_pretrained(path: str, model_name: str | None = None, create_model_card: bool = True, train_datasets: list[str] | None = None, safe_serialization: bool = True) None[source]

Saves a model and its configuration files to a directory, so that it can be loaded again.

Parameters:
  • path (str) – Path on disk where the model will be saved.

  • model_name (str, optional) – Optional model name.

  • create_model_card (bool, optional) – If True, create a README.md with basic information about this model.

  • train_datasets (List[str], optional) – Optional list with the names of the datasets used to train the model.

  • safe_serialization (bool, optional) – If True, save the model using safetensors. If False, save the model the traditional (but unsafe) PyTorch way.

set_adapter(*args, **kwargs) None[source]

Sets a specific adapter by forcing the model to use that adapter and disable the other adapters.

Parameters:
set_pooling_include_prompt(include_prompt: bool) None[source]

Sets the include_prompt attribute in the pooling layer in the model, if there is one.

This is useful for models where the prompt should be excluded from the pooling strategy, e.g. CSR models with a Pooling layer.

property similarity: Callable[[Tensor | ndarray[Any, dtype[float32]], Tensor | ndarray[Any, dtype[float32]]], Tensor]

Compute the similarity between two collections of embeddings. The output will be a matrix with the similarity scores between all embeddings from the first parameter and all embeddings from the second parameter. This differs from similarity_pairwise which computes the similarity between each pair of embeddings. This method supports only embeddings with fp32 precision and does not accommodate quantized embeddings.

Parameters:
  • embeddings1 (Union[Tensor, ndarray]) – [num_embeddings_1, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.

  • embeddings2 (Union[Tensor, ndarray]) – [num_embeddings_2, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.

Returns:

A [num_embeddings_1, num_embeddings_2]-shaped torch tensor with similarity scores.

Return type:

Tensor

Example

>>> model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
>>> sentences = [
...     "The weather is so nice!",
...     "It's so sunny outside.",
...     "He's driving to the movie theater.",
...     "She's going to the cinema.",
... ]
>>> embeddings = model.encode(sentences)
>>> model.similarity(embeddings, embeddings)
tensor([[   30.953,    12.871,     0.000,     0.011],
        [   12.871,    27.505,     0.580,     0.578],
        [    0.000,     0.580,    36.068,    15.301],
        [    0.011,     0.578,    15.301,    39.466]])
>>> model.similarity_fn_name
"dot"
>>> model.similarity_fn_name = "cosine"
>>> model.similarity(embeddings, embeddings)
tensor([[    1.000,     0.441,     0.000,     0.000],
        [    0.441,     1.000,     0.018,     0.018],
        [    0.000,     0.018,     1.000,     0.406],
        [    0.000,     0.018,     0.406,     1.000]])
property similarity_fn_name: Literal['cosine', 'dot', 'euclidean', 'manhattan']

Return the name of the similarity function used by SparseEncoder.similarity() and SparseEncoder.similarity_pairwise().

Returns:

The name of the similarity function.

Defaults to “dot” when first accessed if not explicitly set.

Return type:

Literal[“cosine”, “dot”, “euclidean”, “manhattan”]

Example

>>> model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
>>> model.similarity_fn_name
'dot'
property similarity_pairwise: Callable[[Tensor | ndarray[Any, dtype[float32]], Tensor | ndarray[Any, dtype[float32]]], Tensor]

Compute the similarity between two collections of embeddings. The output will be a vector with the similarity scores between each pair of embeddings. This method supports only embeddings with fp32 precision and does not accommodate quantized embeddings.

Parameters:
  • embeddings1 (Union[Tensor, ndarray]) – [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.

  • embeddings2 (Union[Tensor, ndarray]) – [num_embeddings, embedding_dim] or [embedding_dim]-shaped numpy array or torch tensor.

Returns:

A [num_embeddings]-shaped torch tensor with pairwise similarity scores.

Return type:

Tensor

Example

>>> model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
>>> sentences = [
...     "The weather is so nice!",
...     "It's so sunny outside.",
...     "He's driving to the movie theater.",
...     "She's going to the cinema.",
... ]
>>> embeddings = model.encode(sentences, convert_to_sparse_tensor=False)
>>> model.similarity_pairwise(embeddings[::2], embeddings[1::2])
tensor([12.871, 15.301])
>>> model.similarity_fn_name
"dot"
>>> model.similarity_fn_name = "cosine"
>>> model.similarity_pairwise(embeddings[::2], embeddings[1::2])
tensor([0.441, 0.406])
static sparsity(embeddings: Tensor) dict[str, float][source]

Calculate sparsity statistics for the given embeddings, including the mean number of active (non-zero) dimensions and the mean sparsity ratio.

For a single embedding (1D), the values are for that embedding directly. For a batch of embeddings (2D), they are averaged across the batch.

Parameters:

embeddings (torch.Tensor) – The embeddings to analyze. Must be a 1D or 2D tensor.

Returns:

Dictionary with "active_dims" (mean active dimensions) and

"sparsity_ratio" (mean sparsity ratio).

Return type:

dict[str, float]

Example

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
embeddings = model.encode(["The weather is so nice!", "It's so sunny outside."])
stats = model.sparsity(embeddings)
print(stats)
# => {'active_dims': 44.0, 'sparsity_ratio': 0.9985584020614624}
property splade_pooling_chunk_size: int | None

Returns the chunk size of the SpladePooling module, if present.

This chunk size is along the sequence length dimension (i.e., number of tokens per chunk). If None, processes the entire sequence at once. Using smaller chunks reduces memory usage but may lower training and inference speed. Default is None.

This property is only meaningful for SPLADE-architecture models. For CSR-architecture models (Transformer + Pooling + SparseAutoEncoder), it returns None.

Returns:

The chunk size, or None if SpladePooling is not found or chunk_size is not set.

Return type:

int or None

start_multi_process_pool(target_devices: list[str] | None = None) dict[Literal['input', 'output', 'processes'], Any][source]

Starts a multi-process pool to infer with several independent processes.

This method is recommended if you want to predict on multiple GPUs or CPUs. It is advised to start only one process per GPU. This method works together with predict and stop_multi_process_pool.

Parameters:

target_devices (List[str], optional) – PyTorch target devices, e.g. [“cuda:0”, “cuda:1”, …], [“npu:0”, “npu:1”, …], or [“cpu”, “cpu”, “cpu”, “cpu”]. If target_devices is None and CUDA/NPU is available, then all available CUDA/NPU devices will be used. If target_devices is None and CUDA/NPU is not available, then 4 CPU devices will be used.

Returns:

A dictionary with the target processes, an input queue, and an output queue.

Return type:

Dict[str, Any]

static stop_multi_process_pool(pool: dict[Literal['input', 'output', 'processes'], Any]) None[source]

Stops all processes started with start_multi_process_pool.

Parameters:

pool (Dict[str, object]) – A dictionary containing the input queue, output queue, and process list.

Returns:

None

supports(modality: Literal['text', 'image', 'audio', 'video', 'message'] | tuple[Literal['text', 'image', 'audio', 'video'], ...]) bool[source]

Check if the model supports the given modality.

A modality is supported if:

  1. It is directly listed in modalities (including tuple modalities that are explicitly listed), or

  2. It is a tuple of modalities (e.g. ("image", "text")) where each part is individually supported and the model also supports "message" format, which is used to combine multiple modalities into a single input.

Parameters:

modality – A single modality string (e.g. "text", "image") or a tuple of modality strings (e.g. ("image", "text")).

Returns:

Whether the model supports the given modality.

Return type:

bool

Example:

>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
>>> model.supports("text")
True
>>> model.supports("image")
False
to(*args, **kwargs)

Moves and/or casts the parameters and buffers.

This can be called as

to(device=None, dtype=None, non_blocking=False)
to(dtype, non_blocking=False)
to(tensor, non_blocking=False)
to(memory_format=torch.channels_last)

Its signature is similar to torch.Tensor.to(), but only accepts floating point or complex dtypes. In addition, this method will only cast the floating point or complex parameters and buffers to dtype (if given). The integral parameters and buffers will be moved device, if that is given, but with dtypes unchanged. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

See below for examples.

Note

This method modifies the module in-place.

Parameters:
  • device (torch.device) – the desired device of the parameters and buffers in this module

  • dtype (torch.dtype) – the desired floating point or complex dtype of the parameters and buffers in this module

  • tensor (torch.Tensor) – Tensor whose dtype and device are the desired dtype and device for all parameters and buffers in this module

  • memory_format (torch.memory_format) – the desired memory format for 4D parameters and buffers in this module (keyword only argument)

Returns:

self

Return type:

Module

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> linear = nn.Linear(2, 2)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]])
>>> linear.to(torch.double)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1913, -0.3420],
        [-0.5113, -0.2325]], dtype=torch.float64)
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1)
>>> gpu1 = torch.device("cuda:1")
>>> linear.to(gpu1, dtype=torch.half, non_blocking=True)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1')
>>> cpu = torch.device("cpu")
>>> linear.to(cpu)
Linear(in_features=2, out_features=2, bias=True)
>>> linear.weight
Parameter containing:
tensor([[ 0.1914, -0.3420],
        [-0.5112, -0.2324]], dtype=torch.float16)

>>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble)
>>> linear.weight
Parameter containing:
tensor([[ 0.3741+0.j,  0.2382+0.j],
        [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128)
>>> linear(torch.ones(3, 2, dtype=torch.cdouble))
tensor([[0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j],
        [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
tokenize(texts: list[str] | list[dict] | list[tuple[str, str]], **kwargs) dict[str, Tensor][source]

Deprecated since version `tokenize`: is deprecated. Use preprocess instead.

property tokenizer: Any

Property to get the tokenizer that is used by this model

train(mode: bool = True) T

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters:

mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.

Returns:

self

Return type:

Module

property transformers_model: PreTrainedModel | None

Property to get the underlying transformers PreTrainedModel instance, if it exists. Note that it’s possible for a model to have multiple underlying transformers models, but this property will return the first one it finds in the module hierarchy.

Returns:

The underlying transformers model or None if not found.

Return type:

PreTrainedModel or None

Example

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-v3")

# You can now access the underlying transformers model
transformers_model = model.transformers_model
print(type(transformers_model))
# => <class 'transformers.models.bert.modeling_bert.BertForMaskedLM'>

SparseEncoderModelCardData

class sentence_transformers.sparse_encoder.model_card.SparseEncoderModelCardData(language: str | list[str] | None = <factory>, license: str | None = None, model_name: str | None = None, model_id: str | None = None, train_datasets: list[dict[str, str]] = <factory>, eval_datasets: list[dict[str, str]] = <factory>, task_name: str | None = None, tags: list[str] = <factory>, local_files_only: bool = False, generate_widget_examples: bool = True)[source]

A dataclass storing data used in the model card.

Parameters:
  • language (Optional[Union[str, List[str]]]) – The model language, either a string or a list, e.g. “en” or [“en”, “de”, “nl”]

  • license (Optional[str]) – The license of the model, e.g. “apache-2.0”, “mit”, or “cc-by-nc-sa-4.0”

  • model_name (Optional[str]) – The pretty name of the model, e.g. “SparseEncoder based on answerdotai/ModernBERT-base”.

  • model_id (Optional[str]) – The model ID when pushing the model to the Hub, e.g. “tomaarsen/se-mpnet-base-ms-marco”.

  • train_datasets (List[Dict[str, str]]) – A list of the names and/or Hugging Face dataset IDs of the training datasets. e.g. [{“name”: “SNLI”, “id”: “stanfordnlp/snli”}, {“name”: “MultiNLI”, “id”: “nyu-mll/multi_nli”}, {“name”: “STSB”}]

  • eval_datasets (List[Dict[str, str]]) – A list of the names and/or Hugging Face dataset IDs of the evaluation datasets. e.g. [{“name”: “SNLI”, “id”: “stanfordnlp/snli”}, {“id”: “mteb/stsbenchmark-sts”}]

  • task_name (str) – The human-readable task the model is trained on, e.g. “semantic search and sparse retrieval”.

  • tags (Optional[List[str]]) – A list of tags for the model, e.g. [“sentence-transformers”, “sparse-encoder”].

  • local_files_only (bool) – If True, don’t attempt to find dataset or base model information on the Hub. Defaults to False.

  • generate_widget_examples (bool) – If True, generate widget examples from the evaluation or training dataset, and compute their similarities. Defaults to True.

Tip

Install codecarbon to automatically track carbon emission usage and include it in your model cards.

Example:

>>> model = SparseEncoder(
...     "microsoft/mpnet-base",
...     model_card_data=SparseEncoderModelCardData(
...         model_id="tomaarsen/se-mpnet-base-allnli",
...         train_datasets=[{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}],
...         eval_datasets=[{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}],
...         license="apache-2.0",
...         language="en",
...     ),
... )
pipeline_tag: str = None
task_name: str | None = None