Embeddings

BGE Embeddings

BAAI General Embeddings — a family of open-source embedding models from Beijing Academy of AI that consistently top MTEB leaderboards.

MTEB #1
Multiple benchmarks
Local
Run on your GPU
FlagEmbedding
Library

Table of Contents

SECTION 01

Why BGE stands out

The best embedding model is usually the one at the top of the MTEB leaderboard — a comprehensive benchmark across retrieval, clustering, classification, and similarity tasks. BGE models from BAAI have held the top spots for English retrieval since mid-2023, matching or beating OpenAI's models while running entirely locally.

For production RAG where you want the best retrieval quality without per-token API costs, BGE is the first model family to evaluate.

SECTION 02

The BGE model family

ModelDimsSizeNotes
BAAI/bge-small-en-v1.538433MFastest, good for high-throughput
BAAI/bge-base-en-v1.5768109MGood balance speed/quality
BAAI/bge-large-en-v1.51024335MBest quality in v1.5 family
BAAI/bge-m31024570MMulti-lingual, dense+sparse+ColBERT

bge-large-en-v1.5 is the standard choice for English RAG. bge-m3 is exceptional for multi-lingual or if you want a single model that supports dense, sparse, and late-interaction retrieval simultaneously.

SECTION 03

Basic usage with FlagEmbedding

pip install FlagEmbedding
from FlagEmbedding import FlagModel
import numpy as np

model = FlagModel(
    "BAAI/bge-large-en-v1.5",
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages: ",
    use_fp16=True    # halves memory, minimal quality loss
)

# Encode passages (no prefix needed)
passages = [
    "The Eiffel Tower was completed in 1889.",
    "Python was created by Guido van Rossum.",
    "Photosynthesis converts light energy into chemical energy.",
]
p_embeddings = model.encode(passages)

# Encode query (adds instruction prefix automatically)
query = "When was the Eiffel Tower built?"
q_embedding = model.encode_queries([query])

# Cosine similarity
scores = q_embedding @ p_embeddings.T
print(scores)  # [[0.91, 0.12, 0.08]] — correct document ranked first
SECTION 04

BGE with LangChain/LlamaIndex

# LangChain
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
    query_instruction="Represent this sentence for searching relevant passages: "
)

query_result = embeddings.embed_query("What is photosynthesis?")
doc_result = embeddings.embed_documents(["Photosynthesis uses sunlight to make glucose."])
# LlamaIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-large-en-v1.5",
    query_instruction="Represent this sentence for searching relevant passages: "
)
SECTION 05

Instruction prefix trick

BGE models are trained with an instruction prefix for queries — this signals to the model that the input is a question rather than a passage, improving retrieval accuracy. Crucially, only the query gets the prefix; passages are encoded without it:

QUERY_INSTRUCTION = "Represent this sentence for searching relevant passages: "

# ✓ Correct
query_emb = model.encode([QUERY_INSTRUCTION + "How do transformers work?"])
passage_emb = model.encode(["Transformers use self-attention to process sequences."])

# ✗ Wrong — adding instruction to passages hurts performance
passage_emb_wrong = model.encode([QUERY_INSTRUCTION + "Transformers use self-attention..."])

When using FlagModel, the query_instruction_for_retrieval parameter in the constructor handles this automatically — use model.encode_queries() for queries and model.encode() for passages.

SECTION 06

BGE-reranker

BAAI also provides cross-encoder rerankers that dramatically improve precision when used as a second-stage after initial retrieval:

from FlagEmbedding import FlagReranker

reranker = FlagReranker("BAAI/bge-reranker-large", use_fp16=True)

# After initial retrieval of top-20, rerank to top-5
query = "How do I reset my password?"
candidates = [
    "Password reset instructions are in Account Settings.",
    "Our office is open 9-5 Monday to Friday.",
    "To reset your password, click Forgot Password on the login page.",
    "We accept Visa, Mastercard, and PayPal.",
    "If you forgot your password, visit the login page and click Reset.",
]

# Reranker takes (query, passage) pairs
pairs = [[query, c] for c in candidates]
scores = reranker.compute_score(pairs)

ranked = sorted(zip(scores, candidates), reverse=True)
for score, passage in ranked[:3]:
    print(f"{score:.3f}: {passage[:60]}")
SECTION 07

Gotchas

Memory on large models. bge-large with FP32 needs ~1.3GB VRAM. Use use_fp16=True to halve this. For CPU-only inference, expect ~3× the time of a small model.

Chunk size matters more than model choice. A 512-token chunk with a mid-tier model often outperforms a 4096-token chunk with the best model. Sentence-level or 256-token window chunks tend to work well for Q&A retrieval.

Always normalise before dot product. Use normalize_embeddings=True and then dot product (faster than cosine_similarity with an extra sqrt). Unnormalised vectors give misleading similarity scores.

BGE-M3 for multilingual. If your corpus has mixed languages, bge-m3 supports 100+ languages and can do dense + sparse retrieval simultaneously — effectively replacing separate BM25 and dense retrieval pipelines.

BGE fine-tuning and production deployment

BGE models are the most commonly fine-tuned open-source embedding models because FlagEmbedding provides a complete fine-tuning pipeline with hard negative mining, contrastive training loss, and evaluation scripts. Domain-specific BGE fine-tuning on 5,000–50,000 query-passage pairs from the target domain consistently outperforms the base BGE model on in-domain retrieval by 5–15 nDCG points. The fine-tuning pipeline supports both full model training and LoRA-based efficient fine-tuning, with LoRA being sufficient for most domain adaptation tasks at a fraction of the compute cost.

BGE-M3 represents the most versatile embedding model in the BGE family, combining three retrieval approaches in a single model: dense retrieval (standard vector similarity), sparse retrieval (learned term weights similar to SPLADE), and multi-vector retrieval (ColBERT-style late interaction). This multi-functionality allows BGE-M3 to serve as a complete retrieval backbone without requiring separate models for each retrieval strategy, reducing infrastructure complexity for systems that need hybrid sparse-dense or late-interaction retrieval.

BGE model selection should start with the MTEB leaderboard filtered to the task categories relevant to the application. BGE-M3 leads on retrieval tasks requiring multilingual or multi-representation support; BGE-large-en-v1.5 is competitive on English-only retrieval benchmarks while being faster and less memory-intensive than M3; BGE-small-en-v1.5 provides the smallest latency footprint for latency-sensitive applications where retrieval quality can tolerate 5–8 percentage point nDCG degradation compared to larger models. Running a quick A/B evaluation on a domain-representative query set before committing to a model choice is faster than comparing MTEB scores and more predictive of in-domain performance.

BGE's instruction prefix feature is one of the most impactful and underused capabilities of the model family. Prepending "Represent this sentence: " to queries, "Represent this passage for retrieval: " to documents, or domain-specific instructions like "Represent this medical question: " activates different aspects of the model's learned representations. The improvement from using the correct instruction prefix over no prefix can be 3–8 percentage points on nDCG@10, comparable to the improvement from switching to a larger model. This makes instruction prefix tuning the highest-ROI optimization for BGE-based retrieval systems before investing in fine-tuning.

Embedding model versioning requires careful management in production systems. BGE model updates between versions (e.g., bge-large-en-v1.5 to bge-m3) change the embedding space, making embeddings from different model versions incompatible for similarity search. A vector database built with v1.5 embeddings must be fully re-embedded before switching to M3 — the two models cannot share the same index. Version pinning in requirements files and explicit re-embedding pipelines triggered by model version changes are essential operational practices to prevent silent quality degradation from mixing embeddings produced by different model versions.

BGE's sparse retrieval capability in BGE-M3 produces SPLADE-style term weights that can be integrated with BM25-based search systems through the BEIR evaluation framework. The sparse weights represent learned importance scores for vocabulary terms that go beyond simple term frequency, capturing the semantic relevance of terms to the document's meaning. Hybrid retrieval combining BGE-M3's dense and sparse outputs achieves higher BEIR benchmark scores than either component alone, making BGE-M3 the single-model solution for teams who want to implement hybrid retrieval without deploying separate BM25 and dense retrieval pipelines.

BGE model quantization reduces inference latency by 2–4x with minimal quality degradation. Quantizing BGE-large to INT8 using ONNX Runtime's quantization tools produces models that run at near-FP32 quality with 4x lower memory usage and 2–3x faster inference on CPU. For edge deployment scenarios where GPU is not available, the quantized BGE-small-en-v1.5 model achieves throughput of 200–500 embeddings per second on modern CPUs, sufficient for real-time query embedding in low-traffic applications without the operational cost of GPU instances.

BGE's integration with the FAISS library enables billion-scale retrieval from BGE embeddings using GPU-accelerated approximate nearest neighbor search. The standard workflow embeds all documents with BGE, builds a FAISS IVF-PQ index for compressed approximate search, and queries the index at retrieval time. The FAISS ivf_nlist parameter (number of coarse clusters) and nprobe parameter (clusters searched at query time) control the recall-latency tradeoff. For BGE-large embeddings at 1024 dimensions, an IVF4096,PQ128 index typically achieves 95% recall at 10ms query latency for corpora of 10M+ documents.

BGE's integration with the FAISS library enables billion-scale retrieval from BGE embeddings using GPU-accelerated approximate nearest neighbor search. The standard workflow embeds all documents with BGE, builds a FAISS IVF-PQ index for compressed approximate search, and queries the index at retrieval time. The FAISS ivf_nlist parameter (number of coarse clusters) and nprobe parameter (clusters searched at query time) control the recall-latency tradeoff. For BGE-large embeddings at 1024 dimensions, an IVF4096,PQ128 index typically achieves 95% recall at 10ms query latency for corpora of 10M+ documents.

BGE embedding models follow a consistent naming convention that encodes key properties: BGE---v. The size variants (small, base, large) trade accuracy for speed — BGE-small produces 384-dimensional embeddings with 33M parameters ideal for edge deployment, while BGE-large produces 1024-dimensional embeddings with 335M parameters for maximum retrieval quality. Language variants (en, zh, multilingual) indicate the primary training language, with the multilingual variant requiring more parameters to cover diverse scripts. Version numbers indicate training data and methodology improvements, with later versions consistently outperforming earlier ones on MTEB benchmarks.