quantization

sentence_transformers.util.quantization defines different helpful functions to perform embedding quantization.

Note

Embedding Quantization differs from model quantization. The former shrinks the size of embeddings such that semantic search/retrieval is faster and requires less memory and disk space. The latter refers to lowering the precision of the model weights to speed up inference. This page only shows documentation for the former.

sentence_transformers.util.quantization.quantize_embeddings(embeddings: Tensor | ndarray | list[Tensor] | list[ndarray], precision: Literal['float32', 'int8', 'uint8', 'binary', 'ubinary'], ranges: ndarray | None = None, calibration_embeddings: ndarray | None = None) → ndarray | list[ndarray]

Quantizes embeddings to a lower precision. This can be used to reduce the memory footprint and increase the speed of similarity search. The supported precisions are “float32”, “int8”, “uint8”, “binary”, and “ubinary”.

Parameters:

embeddings – Unquantized (e.g. float) embeddings with to quantize to a given precision
precision – The precision to convert to. Options are “float32”, “int8”, “uint8”, “binary”, “ubinary”.
ranges (Optional[np.ndarray]) – Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges refers to the minimum and maximum values for each dimension. So, it’s a 2D array with shape (2, embedding_dim). Default is None, which means that the ranges will be calculated from the calibration embeddings.
calibration_embeddings (Optional[np.ndarray]) – Embeddings used for calibration during quantization. This is only used for int8 quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum values for each dimension. Default is None, which means that the ranges will be calculated from the query embeddings. This is not recommended.

Returns:

Quantized embeddings with the specified precision. For a list of multi-vector matrices (variable-length (num_tokens, dim) arrays), returns a list of quantized matrices with shared per-dimension buckets.

sentence_transformers.util.quantization.semantic_search_faiss(query_embeddings: np.ndarray, corpus_embeddings: np.ndarray | None = None, corpus_index: faiss.Index | None = None, corpus_precision: Literal['float32', 'uint8', 'ubinary'] = 'float32', top_k: int = 10, ranges: np.ndarray | None = None, calibration_embeddings: np.ndarray | None = None, rescore: bool = True, rescore_multiplier: int = 2, exact: bool = True, output_index: bool = False) → tuple[list[list[dict[str, int | float]]], float, faiss.Index]

Performs semantic search using the FAISS library.

Rescoring will be performed if: 1. rescore is True 2. The query embeddings are not quantized 3. The corpus is quantized, i.e. the corpus precision is not float32 Only if these conditions are true, will we search for top_k * rescore_multiplier samples and then rescore to only keep top_k.

Parameters:

query_embeddings – Embeddings of the query sentences. Ideally not quantized to allow for rescoring.
corpus_embeddings – Embeddings of the corpus sentences. Either corpus_embeddings or corpus_index should be used, not both. The embeddings can be quantized to “int8” or “binary” for more efficient search.
corpus_index – FAISS index for the corpus sentences. Either corpus_embeddings or corpus_index should be used, not both.
corpus_precision – Precision of the corpus embeddings. The options are “float32”, “int8”, or “binary”. Default is “float32”.
top_k – Number of top results to retrieve. Default is 10.
ranges – Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges refers to the minimum and maximum values for each dimension. So, it’s a 2D array with shape (2, embedding_dim). Default is None, which means that the ranges will be calculated from the calibration embeddings.
calibration_embeddings – Embeddings used for calibration during quantization. This is only used for int8 quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum values for each dimension. Default is None, which means that the ranges will be calculated from the query embeddings. This is not recommended.
rescore – Whether to perform rescoring. Note that rescoring still will only be used if the query embeddings are not quantized and the corpus is quantized, i.e. the corpus precision is not “float32”. Default is True.
rescore_multiplier – Oversampling factor for rescoring. The code will now search top_k * rescore_multiplier samples and then rescore to only keep top_k. Default is 2.
exact – Whether to use exact search or approximate search. Default is True.
output_index – Whether to output the FAISS index used for the search. Default is False.

Returns:

A tuple containing a list of search results and the time taken for the search. If output_index is True, the tuple will also contain the FAISS index used for the search.

Raises:

ValueError – If both corpus_embeddings and corpus_index are provided or if neither is provided.

The list of search results is in the format: [[{“corpus_id”: int, “score”: float}, …], …] The time taken for the search is a float value.

sentence_transformers.util.quantization.semantic_search_usearch(query_embeddings: np.ndarray, corpus_embeddings: np.ndarray | None = None, corpus_index: usearch.index.Index | None = None, corpus_precision: Literal['float32', 'int8', 'binary'] = 'float32', top_k: int = 10, ranges: np.ndarray | None = None, calibration_embeddings: np.ndarray | None = None, rescore: bool = True, rescore_multiplier: int = 2, exact: bool = True, output_index: bool = False) → tuple[list[list[dict[str, int | float]]], float, usearch.index.Index]

Performs semantic search using the usearch library.

Rescoring will be performed if: 1. rescore is True 2. The query embeddings are not quantized 3. The corpus is quantized, i.e. the corpus precision is not float32 Only if these conditions are true, will we search for top_k * rescore_multiplier samples and then rescore to only keep top_k.

Parameters:

query_embeddings – Embeddings of the query sentences. Ideally not quantized to allow for rescoring.
corpus_embeddings – Embeddings of the corpus sentences. Either corpus_embeddings or corpus_index should be used, not both. The embeddings can be quantized to “int8” or “binary” for more efficient search.
corpus_index – usearch index for the corpus sentences. Either corpus_embeddings or corpus_index should be used, not both.
corpus_precision – Precision of the corpus embeddings. The options are “float32”, “int8”, “ubinary” or “binary”. Default is “float32”.
top_k – Number of top results to retrieve. Default is 10.
ranges – Ranges for quantization of embeddings. This is only used for int8 quantization, where the ranges refers to the minimum and maximum values for each dimension. So, it’s a 2D array with shape (2, embedding_dim). Default is None, which means that the ranges will be calculated from the calibration embeddings.
calibration_embeddings – Embeddings used for calibration during quantization. This is only used for int8 quantization, where the calibration embeddings can be used to compute ranges, i.e. the minimum and maximum values for each dimension. Default is None, which means that the ranges will be calculated from the query embeddings. This is not recommended.
rescore – Whether to perform rescoring. Note that rescoring still will only be used if the query embeddings are not quantized and the corpus is quantized, i.e. the corpus precision is not “float32”. Default is True.
rescore_multiplier – Oversampling factor for rescoring. The code will now search top_k * rescore_multiplier samples and then rescore to only keep top_k. Default is 2.
exact – Whether to use exact search or approximate search. Default is True.
output_index – Whether to output the usearch index used for the search. Default is False.

Returns:

A tuple containing a list of search results and the time taken for the search. If output_index is True, the tuple will also contain the usearch index used for the search.

Raises:

ValueError – If both corpus_embeddings and corpus_index are provided or if neither is provided.

The list of search results is in the format: [[{“corpus_id”: int, “score”: float}, …], …] The time taken for the search is a float value.