util¶

sentence_transformers.util defines different helpful functions to work with text embeddings.

sentence_transformers.util.community_detection(embeddings, threshold=0.75, min_community_size=10, batch_size=1024, show_progress_bar=False)List[List[int]]¶

Function for Fast Community Detection Finds in the embeddings all communities, i.e. embeddings that are close (closer than threshold). Returns only communities that are larger than min_community_size. The communities are returned in decreasing order. The first element in each list is the central point in the community.

sentence_transformers.util.cos_sim(a: torch.Tensor, b: torch.Tensor)torch.Tensor¶

Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.

Returns

Matrix with res[i][j] = cos_sim(a[i], b[j])

sentence_transformers.util.dot_score(a: torch.Tensor, b: torch.Tensor)torch.Tensor¶

Computes the dot-product dot_prod(a[i], b[j]) for all i and j.

Returns

Matrix with res[i][j] = dot_prod(a[i], b[j])

sentence_transformers.util.http_get(url, path)None¶

Downloads a URL to a given path on disc

sentence_transformers.util.paraphrase_mining(model, sentences: List[str], show_progress_bar: bool = False, batch_size: int = 32, *args, **kwargs)List[List[Union[float, int]]]¶

Given a list of sentences / texts, this function performs paraphrase mining. It compares all sentences against all other sentences and returns a list with the pairs that have the highest cosine similarity score.

Parameters
  • model – SentenceTransformer model for embedding computation

  • sentences – A list of strings (texts or sentences)

  • show_progress_bar – Plotting of a progress bar

  • batch_size – Number of texts that are encoded simultaneously by the model

  • query_chunk_size – Search for most similar pairs for #query_chunk_size at the same time. Decrease, to lower memory footprint (increases run-time).

  • corpus_chunk_size – Compare a sentence simultaneously against #corpus_chunk_size other sentences. Decrease, to lower memory footprint (increases run-time).

  • max_pairs – Maximal number of text pairs returned.

  • top_k – For each sentence, we retrieve up to top_k other sentences

  • score_function – Function for computing scores. By default, cosine similarity.

Returns

Returns a list of triplets with the format [score, id1, id2]

This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used for Information Retrieval / Semantic Search for corpora up to about 1 Million entries.

Parameters
  • query_embeddings – A 2 dimensional tensor with the query embeddings.

  • corpus_embeddings – A 2 dimensional tensor with the corpus embeddings.

  • query_chunk_size – Process 100 queries simultaneously. Increasing that value increases the speed, but requires more memory.

  • corpus_chunk_size – Scans the corpus 100k entries at a time. Increasing that value increases the speed, but requires more memory.

  • top_k – Retrieve top k matching entries.

  • score_function – Function for computing scores. By default, cosine similarity.

Returns

Returns a list with one entry for each query. Each entry is a list of dictionaries with the keys ‘corpus_id’ and ‘score’, sorted by decreasing cosine similarity scores.

sentence_transformers.util.truncate_embeddings(embeddings: numpy.ndarray, truncate_dim: Optional[int])numpy.ndarray¶
sentence_transformers.util.truncate_embeddings(embeddings: torch.Tensor, truncate_dim: Optional[int])torch.Tensor
Parameters
  • embeddings – Embeddings to truncate.

  • truncate_dim – The dimension to truncate sentence embeddings to. None does no truncation.

Returns

Truncated embeddings.