Hybrid Search Fundamentals

Duration: 5 min

This module delves into the intricacies of hybrid search systems, which combine the strengths of both lexical and semantic search to deliver more accurate and relevant results. Understanding hybrid search is crucial for developing advanced retrieval-augmented generation (RAG) systems that can handle complex queries effectively.

Vector Databases and Embeddings

Vector databases store data points as vectors in a multi-dimensional space, allowing for efficient similarity searches. Embeddings are vector representations of words, phrases, or documents that capture semantic meaning. By converting text into embeddings, we can perform semantically rich searches that go beyond keyword matching.

import numpy as np

# Example embeddings for words
embeddings = {
    'cat': np.array([0.1, 0.2, 0.3]),
    'dog': np.array([0.3, 0.2, 0.1]),
    'animal': np.array([0.2, 0.25, 0.25])
}

# Function to compute cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Compute similarity between 'cat' and 'animal'
similarity = cosine_similarity(embeddings['cat'], embeddings['animal'])
print(f'Cosine similarity between cat and animal: {similarity}')

Try it in Google Colab:

Cosine similarity between cat and animal: 0.9428090415820634

Chunking and Reranking

Chunking involves breaking down large documents into smaller, manageable pieces called chunks. This allows for more granular and context-aware searches. Reranking is the process of reordering search results based on relevance, often using a combination of lexical and semantic signals to improve the quality of the top results.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example documents
documents = [
    'The cat sat on the mat.',
    'The dog barked at the cat.',
    'The animal ran quickly.'
]

# Query
query = 'The cat and the dog.'

# Vectorize documents and query
vectorizer = TfidfVectorizer()
vectorized_docs = vectorizer.fit_transform(documents)
vectorized_query = vectorizer.transform([query])

# Compute similarities
similarities = cosine_similarity(vectorized_query, vectorized_docs).flatten()

# Rerank documents based on similarity
ranked_docs = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)

# Print results
for doc, score in ranked_docs:
    print(f'Document: {doc}, Similarity: {score}')

💡 Tip: When implementing chunking, ensure that the chunks are semantically coherent to maintain the context and meaning of the original document.

❓ What is the primary purpose of using embeddings in a vector database?

To store data points as integers To capture semantic meaning of text To perform keyword matching To encrypt data

❓ What is the goal of reranking in a hybrid search system?

To increase the number of search results To reorder results based on relevance To filter out irrelevant documents To improve the speed of the search

Key Concepts

Concept	Description
Vector	Core principle in this module
Keyword	Core principle in this module
Combination	Core principle in this module
Ranking	Core principle in this module

Check Your Understanding

❓ What is the main purpose of Hybrid?

To classify data To predict values To understand patterns To reduce dimensions

❓ Which of these is a key characteristic of Hybrid?

Supervised Unsupervised Semi-supervised Reinforcement

Hybrid Search Fundamentals

Vector Databases and Embeddings

Chunking and Reranking

Key Concepts

Check Your Understanding

Related Courses