Sentence Transformers (SBERT) with PyTorch: Similarity and Semantic Search
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Sentence Transformers score semantic similarity by embedding sentences into fixed-length vectors and comparing them with cosine similarity.
Briefing
Sentence Transformers (SBERT) turn sentences into fixed-length embeddings and then use cosine similarity to score semantic closeness—making it practical to run semantic search and rank results by meaning rather than keywords. The core idea is simple: encode each sentence into a vector, compare vectors with cosine similarity, and treat the resulting score as a measure of how related two texts are. That approach matters because it avoids the slowness of running full BERT-style models for every pair during search, enabling fast similarity scoring across a corpus.
The transcript traces this capability back to SBERT’s Siamese-network training approach. Instead of processing one sentence at a time, Siamese setups feed two inputs through the same neural network, then apply a scoring head (often with a sigmoid) to output a similarity score in a bounded range. This design is what allows SBERT-style models to reuse precomputed embeddings: once sentence vectors are generated for the corpus, queries can be embedded once and compared efficiently against all stored vectors.
A key performance motivation comes from the original work on Siamese BERT embeddings: pairwise scoring can be dramatically accelerated. The transcript cites an implementation result where a Siamese approach reduced runtime from roughly 65 hours to about five seconds, illustrating why embedding-based similarity is attractive for real-world semantic search.
For the hands-on portion, the workflow starts by installing Sentence Transformers and loading a strong pretrained model from the library’s leaderboard: mpnet-base-v2 (an MPNet model trained by Microsoft). The model is checked for its maximum sequence length (384) and then used to embed a small corpus of sentences. Each sentence becomes a 768-dimensional vector, and cosine similarity is used to compare a query embedding against corpus embeddings.
The example queries show how semantic search behaves in practice. A cryptocurrency-related query ranks crypto sentences near the top, while powerlifting-related sentences land at the bottom. Another query about deadlifting similarly surfaces powerlifting content and pushes unrelated cryptocurrency material down—demonstrating that similarity is driven by meaning and context, not surface wording.
Finally, the transcript shows how to fine-tune an existing model to match a custom similarity notion. Using a small labeled training setup, it constructs an InputExample with two sentences and a target similarity score (e.g., 0.9). Training is run with model.fit for multiple epochs using a cosine-similarity-based loss. After saving and reloading the trained model, the similarity score between the chosen sentence pair increases to align with the target label, confirming that the embedding space can be adjusted for a specific task.
Overall, the takeaway is that SBERT-style models provide a fast, embedding-first route to semantic similarity and search, and they can be tuned to better reflect domain-specific judgments of what “similar” should mean.
Cornell Notes
Sentence Transformers (SBERT) convert sentences into 768-dimensional embeddings and use cosine similarity to score semantic closeness. The method relies on Siamese networks: the same model processes two sentences, and a scoring head produces a bounded similarity score. This embedding approach enables efficient semantic search because corpus embeddings can be precomputed, then compared to a query embedding without rerunning a full pairwise transformer each time. The transcript demonstrates semantic search with the pretrained model mpnet-base-v2, including ranking results for cryptocurrency and powerlifting queries. It also shows fine-tuning by training on labeled sentence pairs (e.g., target similarity 0.9) and verifying that the similarity score increases after training.
Why do Siamese networks make sentence similarity search faster than running BERT-style models for every pair?
How does cosine similarity fit into SBERT-style semantic search?
What does the mpnet-base-v2 model contribute in the example workflow?
What does the semantic_search output structure represent?
How does fine-tuning change similarity scores in the transcript’s example?
Review Questions
- In an embedding-based semantic search pipeline, which computations can be precomputed for the corpus, and which computations must be done per query?
- How would you expect the ranking to change if you replaced cosine similarity with a different similarity metric (e.g., dot product) while keeping embeddings the same?
- What kinds of labeled examples would you need to fine-tune SBERT effectively for a domain-specific definition of “semantic similarity”?
Key Points
- 1
Sentence Transformers score semantic similarity by embedding sentences into fixed-length vectors and comparing them with cosine similarity.
- 2
Siamese-network training enables an embedding-first workflow where corpus embeddings can be precomputed for fast search.
- 3
The transcript demonstrates semantic search using mpnet-base-v2, embedding each sentence into a 768-dimensional vector.
- 4
semantic_search ranks corpus candidates by similarity score and returns results as score/Corpus ID pairs.
- 5
Fine-tuning can align similarity scores with custom labels by training on labeled sentence pairs using a cosine-similarity-based loss.
- 6
After fine-tuning, re-encoding with the saved model can noticeably increase similarity for the targeted sentence pair.