Hybrid search is a powerful technique that integrates multiple search algorithms to enhance the accuracy and relevance of search results. Combining the strengths of full-text and vector search capabilities, hybrid search delivers a more effective and comprehensive user search experience.
In version 1.6.2 of MyScaleDB, this hybrid search (opens new window) feature was introduced. Let’s explore how it was implemented and what makes it a game-changer in the search landscape.
# Combining Vector Search and Full-Text Search
In a hybrid search system within a vector database, two distinct types of searches are typically employed: vector search (opens new window) and full-text search (opens new window).
- Vector Search: This approach finds results based on semantic similarity, often leveraging machine learning models that grasp the meaning and context of words and phrases. It is particularly useful for capturing nuanced or conceptually related results across documents.
- Full-text Search: This method directly matches keywords or phrases from the text. It is more traditional and highly effective when the exact wording of a query is essential.
Each of these search types has its strengths and limitations. Full-text search excels at basic keyword retrieval and text matching, making it ideal for queries where specific terms are crucial. On the other hand, vector search shines in cross-document semantic matching and understanding deeper meanings, though it may be less efficient with short text queries.
Hybrid search combines the capabilities of both vector and full-text searches, achieving a “best-of-both-worlds” approach. This integration allows users to benefit from precise keyword matching while also capturing the broader semantic context, resulting in a more comprehensive and effective search experience.
# Fusion algorithms
Each search method—vector and full-text—produces a set of results, each accompanied by a relevance score. These scores reflect how closely a particular result matches the search query based on the specific method used.
A fusion algorithm effectively merges these results. The algorithm adjusts and normalizes the scores from both the vector and full-text searches, making them comparable and allowing for a seamless combination. This process ensures that the final results presented to the user integrate the strengths of both search methods, delivering results that are both semantically relevant and textually precise.
In MyScaleDB, hybrid search leverages the BM25 score from text searches (referred to as “lex” for lexical) and the distance metric from vector searches (referred to as “sem” for semantic). To achieve this integration, MyScaleDB currently supports two fusion algorithms: Relative Score Fusion (RSF) and Reciprocal Rank Fusion (RRF).
# Relative Score Fusion (RSF)
Relative Score Fusion (RSF) is a method used in hybrid search systems to effectively combine the results from vector and full-text searches. The process involves two key steps: normalization of scores and weighted sum calculation.
Normalization of Scores:
- Normalization: RSF begins by normalizing the scores from both vector and full-text searches. This involves converting the raw scores into a common scale, typically ranging from 0 to 1.
- Highest and Lowest Scores: The highest score in each search type (vector and text) is scaled to 1, representing the most relevant result within that search. Conversely, the lowest score is scaled to 0, representing the least relevant result.
- Proportional Adjustment: All other scores are proportionally adjusted within this 0-1 range based on their relative position between the highest and lowest scores.
Weighted Sum:
- Weight Assignment: After normalization, each score is multiplied by a specific weight that reflects the importance assigned to that particular search type (vector or full-text).
- Final Score Calculation: The final score for each result is determined by summing these weighted, normalized scores, resulting in a balanced and comprehensive ranking that integrates the strengths of both search methods.
The normalization formula used is:
The purpose of RSF is to create a unified ranking that integrates both semantic relevance from vector search and textual accuracy from full-text search. By normalizing the scores and applying appropriate weights, RSF ensures that the final results are balanced, combining the strengths of both search methods to deliver a more comprehensive and precise outcome for the user.
# Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion (RRF) is an alternative method used in hybrid search systems to combine results from different search methods, such as vector and full-text searches. Unlike RSF, RRF does not require score normalization. Instead, it ranks results based on their positions in each result set using the following formula, where k is an arbitrary constant that adjusts the importance of lower-ranked results.
The key advantage of RRF is its simplicity. By bypassing the need for score normalization, RRF focuses directly on rank positions, making it easier to merge results from different search methods, particularly when the scoring scales of those methods are not directly comparable.
# How to Perform Hybrid Search in MyScaleDB
Let’s walk through how to perform the hybrid search in MyScale using an example. The SQL queries shown below are based on a Wikipedia dataset that was pre-imported into the MyScale cluster.
First, create a full-text search—FTS index (opens new window)—on the text column 'body' and an MSTG vector index using Cosine distance on the vector column 'body_vector'.
CREATE TABLE wiki_abstract
(
`id` UInt64,
`body` String,
`title` String,
`url` String,
`body_vector` Array(Float32),
VECTOR INDEX body_vec_idx body_vector TYPE MSTG('metric_type=Cosine'),
INDEX body_idx body TYPE fts('{"body":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}') GRANULARITY 1,
CONSTRAINT check_length CHECK length(body_vector) = 1024
) ENGINE = MergeTree ORDER BY id;
In this SQL CREATE TABLE statement, we’ve defined an FTS index for BM25 full-text search and a vector index for cosine similarity search on high-dimensional data, converting from text columns.
Next, we can execute a hybrid search that utilizes both the text and vector columns. Here’s how it works:
SELECT id, body, HybridSearch('fusion_type=RSF', 'fusion_weight=0.4')(body_vector, body,
HuggingFaceEmbedText('Who won the Polar Medal'), 'Who won the Polar Medal') AS score
FROM wiki_abstract
ORDER BY score DESC
LIMIT 5;
This hybrid search query first performs a vector search distance()
function on body_vector
column to find the objects most similar to the query vector. Simultaneously, a full-text search is conducted using the TextSearch()
function on the body
column, ranking results based on the frequency of query terms. The two search results are then combined using the selected fusion algorithm—RSF in this case—and the top candidates are returned.
In this example, we applied a weight of 0.4 for the full-text search and 0.6 for vector search. You can adjust these weights to experiment and find the optimal balance for your data and search requirements.
# When to Use Hybrid Search?
Hybrid search queries are especially useful for search systems that aim to harness the power of semantic vector search while also depending on the precision of exact keyword matches in full-text search. For instance, a search query like “Who won the Polar Medal” would yield more accurate and relevant results with a hybrid search than with a standard keyword-based full-text search or even a purely semantic vector search.
In this scenario:
- The top 5 results from the vector search contain 3 correct answers:
- The top 5 results from the full-text search contain 2 correct answers:
- As demonstrated in the previous section, the top 5 results from the hybrid search contain 4 correct answers.
By combining vector and full-text search, hybrid search significantly enhances search precision, delivering higher search accuracy and more relevant results.
# Conclusion
Hybrid search is important because it improves the precision and relevance of search results by combining the directness of keyword search with the contextual depth of semantic search. This dual approach ensures users receive not only exact matches but also contextually relevant results, even if different terminology is used. Hybrid search also excels in handling diverse queries, providing comprehensive and adaptive results that align with user intent. Its ability to optimize performance and adapt to various applications makes it a key technology in modern information retrieval.
This article has explained the concept of hybrid search and its implementation in MyScale. If you have any questions about how to use hybrid search in MyScale, check out the MyScale doc (opens new window) or join our Discord (opens new window) for help.