Hybrid Search Fundamentals
Duration: 5 min
This module delves into the intricacies of hybrid search systems, which combine the strengths of both lexical and semantic search to deliver more accurate and relevant results. Understanding hybrid search is crucial for developing advanced retrieval-augmented generation (RAG) systems that can handle complex queries effectively.
Vector Databases and Embeddings
Vector databases store data points as vectors in a multi-dimensional space, allowing for efficient similarity searches. Embeddings are vector representations of words, phrases, or documents that capture semantic meaning. By converting text into embeddings, we can perform semantically rich searches that go beyond keyword matching.
import numpy as np
# Example embeddings for words
embeddings = {
'cat': np.array([0.1, 0.2, 0.3]),
'dog': np.array([0.3, 0.2, 0.1]),
'animal': np.array([0.2, 0.25, 0.25])
}
# Function to compute cosine similarity
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Compute similarity between 'cat' and 'animal'
similarity = cosine_similarity(embeddings['cat'], embeddings['animal'])
print(f'Cosine similarity between cat and animal: {similarity}')Cosine similarity between cat and animal: 0.9428090415820634Chunking and Reranking
Chunking involves breaking down large documents into smaller, manageable pieces called chunks. This allows for more granular and context-aware searches. Reranking is the process of reordering search results based on relevance, often using a combination of lexical and semantic signals to improve the quality of the top results.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Example documents
documents = [
'The cat sat on the mat.',
'The dog barked at the cat.',
'The animal ran quickly.'
]
# Query
query = 'The cat and the dog.'
# Vectorize documents and query
vectorizer = TfidfVectorizer()
vectorized_docs = vectorizer.fit_transform(documents)
vectorized_query = vectorizer.transform([query])
# Compute similarities
similarities = cosine_similarity(vectorized_query, vectorized_docs).flatten()
# Rerank documents based on similarity
ranked_docs = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
# Print results
for doc, score in ranked_docs:
print(f'Document: {doc}, Similarity: {score}')💡 Tip: When implementing chunking, ensure that the chunks are semantically coherent to maintain the context and meaning of the original document.
❓ What is the primary purpose of using embeddings in a vector database?
❓ What is the goal of reranking in a hybrid search system?
Key Concepts
| Concept | Description |
|---|---|
| Vector | Core principle in this module |
| Keyword | Core principle in this module |
| Combination | Core principle in this module |
| Ranking | Core principle in this module |
Check Your Understanding
❓ What is the main purpose of Hybrid?
❓ Which of these is a key characteristic of Hybrid?