Sentence Transformers

What embeddings actually are
Sentence Transformers in plain English
Installation and first embedding
Choosing the right model
Batch encoding and similarity search
Fine-tuning for your domain
Gotchas

SECTION 01

What embeddings actually are

Imagine turning a sentence into GPS coordinates — but in 768-dimensional space instead of 2D. Sentences that mean similar things end up geographically close. "The dog barked loudly" and "The canine made a noise" would be nearby; "photosynthesis" would be far away from both.

That geometric closeness is what makes semantic search possible. Instead of matching keywords, you convert both the query and every document into coordinates, then find the nearest neighbours.

SECTION 02

Sentence Transformers in plain English

Sentence Transformers is a Python library that wraps pre-trained transformer models (BERT, RoBERTa, etc.) and fine-tunes them with a contrastive objective: sentences with the same meaning should produce similar vectors; dissimilar sentences should produce distant ones.

The key difference from raw transformer output: a standard BERT model outputs one vector per token. Sentence Transformers add a pooling layer (usually mean-pooling of token embeddings) to produce a single fixed-size vector for the entire sentence. This makes comparison fast and cheap.

SECTION 03

Installation and first embedding

pip install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")   # 22M params, fast

sentences = [
    "How do I reset my password?",
    "I forgot my login credentials.",
    "What is the capital of France?",
]
embeddings = model.encode(sentences)  # shape: (3, 384)

# Cosine similarity between first two vs first and third
from sentence_transformers import util
sim_12 = util.cos_sim(embeddings[0], embeddings[1]).item()
sim_13 = util.cos_sim(embeddings[0], embeddings[2]).item()
print(f"Password vs Credentials: {sim_12:.3f}")   # high ~0.82
print(f"Password vs France:      {sim_13:.3f}")   # low  ~0.12

SECTION 04

Choosing the right model

Model	Dims	Speed	Quality	Best for
all-MiniLM-L6-v2	384	Very fast	Good	Prototyping, low-latency apps
all-mpnet-base-v2	768	Medium	Excellent	General semantic search
multi-qa-MiniLM-L6-cos-v1	384	Fast	Good	Q&A retrieval specifically
paraphrase-multilingual-MiniLM-L12-v2	384	Fast	Good	50+ language support
BAAI/bge-large-en-v1.5	1024	Slow	State-of-art	Production RAG, accuracy-first

Start with all-MiniLM-L6-v2 for speed; upgrade to all-mpnet-base-v2 or BGE when quality matters.

SECTION 05

Batch encoding and similarity search

from sentence_transformers import SentenceTransformer, util
import torch

model = SentenceTransformer("all-mpnet-base-v2")

# Encode a corpus (your documents)
corpus = [
    "Refund requests must be submitted within 30 days of purchase.",
    "We offer free shipping on orders over $50.",
    "Password reset links expire after 24 hours.",
    "Our support team is available Monday through Friday.",
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Encode a query
query = "How long do I have to return an item?"
query_embedding = model.encode(query, convert_to_tensor=True)

# Find top-3 most similar documents
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
for hit in hits[0]:
    print(f"Score {hit['score']:.3f}: {corpus[hit['corpus_id']]}")

For large corpora (>100k docs), pre-compute and save embeddings to disk:

import numpy as np
np.save("corpus_embeddings.npy", corpus_embeddings.cpu().numpy())
# Load later: embeddings = torch.tensor(np.load("corpus_embeddings.npy"))

SECTION 06

Fine-tuning for your domain

Out-of-the-box models work well for general English. For specialised domains (medical, legal, code), fine-tuning on domain pairs improves recall significantly:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer("all-MiniLM-L6-v2")

# Training pairs: (query, relevant_doc)
train_examples = [
    InputExample(texts=["patient presents with fever", "38.5°C temperature, chills reported"]),
    InputExample(texts=["myocardial infarction treatment", "aspirin and PCI for STEMI patients"]),
    # ... thousands more pairs from your domain
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./my-domain-model"
)

MultipleNegativesRankingLoss treats all other sentences in the batch as negatives — a cheap, effective contrastive objective that needs no explicit negative pairs.

SECTION 07

Gotchas

Max sequence length. Most models truncate at 256–512 tokens. Long documents need chunking before encoding — encode paragraphs, not whole pages.

Normalise before dot product. Cosine similarity requires unit vectors. Use model.encode(..., normalize_embeddings=True) and then use dot product instead of cosine — it's faster and equivalent.

GPU batch sizing. Default batch_size=32 may OOM on small GPUs. Set model.encode(texts, batch_size=16) or use show_progress_bar=True to monitor.

Asymmetric tasks. For Q&A, the query and the passage are stylistically different. Use a bi-encoder model trained for asymmetric retrieval (like multi-qa-* models) rather than symmetric paraphrase models.

Production deployment and performance optimization

Sentence Transformers inference throughput scales dramatically with GPU batching. A single encode() call with a list of texts processes all texts in parallel on GPU, achieving 10–50x higher throughput than encoding texts individually. The optimal batch size depends on available GPU memory and text length — typical values of 32–256 provide good throughput without causing out-of-memory errors for standard 512-token models. For CPU inference on quantized models, using ONNX Runtime instead of PyTorch can reduce inference latency by 2–4x with minimal quality degradation.

Caching embeddings for frequently repeated texts is a simple optimization with large impact for applications where the same documents are embedded repeatedly. Storing embeddings in a Redis cache or SQLite database keyed by text hash eliminates redundant model calls for repeated content. For RAG systems that re-embed the same document corpus on each application restart, persisting embeddings to disk and loading from cache reduces cold-start time from minutes to seconds and eliminates the compute cost of repeated corpus embedding.

The semantic search workflow with Sentence Transformers involves three stages that must be engineered carefully for production quality. First, document encoding: all corpus documents are encoded once and stored in a vector index. Second, query encoding: the user query is encoded with the same model at query time. Third, similarity search: cosine similarity between the query embedding and all document embeddings identifies the most relevant documents. The biggest correctness pitfall is encoding queries and documents with different models or different preprocessing — normalization, pooling strategy, and instruction prefixes must be identical for query and document embeddings to be comparable.

Sentence Transformers' cross-encoder classes provide reranking capability that complements the bi-encoder retrieval models. A cross-encoder takes a query-document pair as input and outputs a single relevance score, allowing the model to consider the interaction between query and document tokens directly. Cross-encoders are 5–10x slower per query-document pair than bi-encoder similarity lookups but consistently achieve higher ranking quality because they can model fine-grained relevance signals that are lost when query and document are encoded independently. The standard two-stage pipeline encodes a large candidate set with the bi-encoder, then reranks with a cross-encoder to produce the final top-k results.

Sentence Transformers supports diverse training approaches for domain adaptation beyond the standard contrastive fine-tuning. Knowledge distillation from a larger, slower cross-encoder teacher to a smaller bi-encoder student produces bi-encoder models with near-cross-encoder quality at bi-encoder inference speed. Augmented SBERT, which uses a cross-encoder to label synthetic query-passage pairs for bi-encoder fine-tuning, enables bootstrapping high-quality training data from unlabeled domain text. These advanced training approaches are accessible through the Sentence Transformers training API and documentation, making state-of-the-art embedding fine-tuning achievable without implementing training algorithms from scratch.

Sentence Transformers' SentenceTransformer.encode() method with convert_to_tensor=True returns PyTorch tensors rather than NumPy arrays, enabling direct use in GPU-accelerated similarity computations. The util.semantic_search() function computes cosine similarity between query and corpus embeddings on GPU and returns ranked results, providing a complete semantic search implementation in two lines after encoding. For applications requiring maximum throughput, replacing util.semantic_search() with a FAISS index lookup reduces retrieval latency from O(n) cosine similarity computation to O(log n) ANN search, which is essential for corpus sizes above 100K documents.

Sentence Transformers' multi-task learning support enables training models that produce embeddings optimized simultaneously for multiple tasks — retrieval, reranking, and classification — using a joint training objective. Models like E5-mistral and GTE-Qwen demonstrate that large-scale multi-task embedding training produces models that generalize broadly across evaluation benchmarks, suggesting that task diversity in training data is at least as important as dataset size for embedding model quality. For practitioners, this means that domain fine-tuning on a mix of tasks (retrieval, duplicate detection, classification) from the target domain produces better all-around embeddings than fine-tuning exclusively on retrieval pairs.

Sentence Transformers' community model hub on Hugging Face hosts thousands of fine-tuned embedding models covering diverse domains and languages. Before investing in custom fine-tuning, searching the hub for models fine-tuned on data from the same domain — biomedical, legal, code, financial — often identifies existing models that outperform the general-purpose models with no additional training. Model cards on the hub include MTEB benchmark scores and dataset descriptions that enable quick comparison of candidate models, making the hub a valuable first resource for embedding model selection before defaulting to general-purpose models.