Vector search looks for similar vectors or data points in a dataset based on their vector representations. Unlike proprietary vector databases such as Pinecone, Milvus, Qdrant, and Weaviate, MyScaleDB is built on the open-source SQL-compatible, ClickHouse database.
Structured Query Language (SQL) is an effective tool for managing relational databases. MyScaleDB combines the power of SQL and vectors to provide a powerful approach to tackling complex AI related questions. User can execute traditional SQL and vector queries on structured data and vector embeddings (data) to address complex queries and analyze high-dimensional data in a unified and efficient manner.
In this blog, we first discuss some of the most popular advanced SQL techniques. Then, we provide examples by considering how MyScaleDB joins complex SQL queries with vector search in a single query to perform vector-related searches and optimizes query execution together with ClickHouse, retrieves data faster and more efficiently.
# Advanced SQL Techniques for Complex Queries
Simple SQL queries are commands that perform straightforward data retrieval, usually from only one table at a time. Complex SQL queries go beyond standard requests by retrieving data from several tables and limiting the result set with multiple conditions.
A complex query could include feartures such as:
- Common Table Expressions
- Subqueries
- Joining to many tables, and using different join types
# Common Table Expressions
A Common Table Expression (CTE) is a name you give a subquery within your main query. The main reason to do this is to simplify your query, making it easier to read and debug. It can sometimes improve performance, which is another benefit, but it's mostly about readability and simplification.
Consider a scenario in which you want to determine the average age of customers who have bought a particular product. You have a table of customer data including their name, age, and the products they have purchased.
Here's an example query to perform this calculation using a CTE:
WITH product_customers AS (
SELECT name, age
FROM customer_data
WHERE product = 'widget'
)
SELECT AVG(age) AS avg_age
FROM product_customers;
This query uses a CTE, which is a temporary named result set (subquery) that can be referenced within a single query.
The CTE is named product_customers
. It's created using a SELECT
statement that retrieves the name and age columns from the customer_data
table for customers who have purchased the product 'widget'.
Moving the subquery to the top of the query and giving it a name will mean it's easier to understand what the query does. If your subquery selects an sample embedding vector, you could name your subquery someting like target_vector_embed
. When you refer to this in the main query, you'll see this name and will know what it refers to.
It's also helpful if you have a long query and need the same logic in several places. You can define it at the top of the query and refer to it multiple times throughout your main query.
So, consider using CTEs whenever you have a subquery as a way to improve the readability of your query.
# Subqueries
A subquery is a simple SQL command embedded within another query. By nesting queries, you can set up larger restrictions on the data included in the result set.
Subqueries can be used in several places within a query, but it's easiest to start with the FROM
statement. Here's an example of a basic subquery:
SELECT sub.*
FROM (
SELECT *
FROM table
WHERE conditions
) sub
WHERE sub.column_1 = 'MyScaleDB';
Let's break down what happens when you run the above query:
First, the database runs the "inner query" - the part between the parentheses. If you run this independently, it produces a result set just like any other query. Once the inner query runs, the outer query will run using the results from the inner query as its underlying table:
SELECT sub.*
FROM (
<< restuls from inner query go here >>
) sub
WHERE sub.column_1 = 'MyScaleDB';
Subqueries are required to have names, which are added after parentheses the same way you would add an alias to a regular table. This query uses the name sub
.
# Using Subqueries in Conditional Logic
You can use subqueries in conditional logic (in conjunction with WHERE
, JOIN
/ON
, or CASE
). The following query returns all the entries from the same date as the specified entry in the dataset:
SELECT *
FROM table
WHERE Date = (SELECT Date
FROM table
WHERE id='00001');
This query works because the result of the subquery is only one cell. Most conditional logic will work with subqueries containing one-cell results. However, IN
is the only type of conditional logic that will work when the inner query contains multiple results:
SELECT *
FROM table
WHERE Date IN (SELECT Date
FROM table
ORDER BY Date
LIMIT 5);
Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN
clause) rather than as a table.
# Joining Tables
Join produces a new table by combining columns from one or multiple tables by using values common to each. Different types of Joins are as follows:
INNER JOIN
: Only matching records are returned.LEFT JOIN
: Returns all records from the left table and the matching records from the right table.RIGHT JOIN
: Returns all records from the right table and the matching records from the left table.FULL JOIN
: Returns all records from both tables when there is a match in either left or right table.CROSS JOIN
: Produces the cartesian product of whole tables, as "join keys" are not specified.
# Using Complex SQL and Vector Queries in MyScaleDB
SQL vector database MyScaleDB includes several features that help you with complex SQL and vector queries. Let's look at some examples highlighting what MyScaleDB's complex queries are capable of.
# Common Table Expressions
MyScaleDB supports CTE and substitutes the code defined in the WITH
clause for the rest of SELECT
query. Named subqueries can be included in the current and child query context anywhere table objects are allowed.
Vector search is a search method that represents data as vectors. It is commonly used in applications such as image search, video search, and text search. MyScaleDB uses the distance()
function to perform vector searches. It calculates the distance between a specified vector and all vector data in a specified column, and returns the top candidates.
In some cases, if the specified vector is obtained from another table or the dimension of the specified vector is large and it is inconvenient to represent, you can use CTE or subquery.
Assume you have a vector table named photo
that stores metadata information linked to the photo library's images, with id
, photo_id
and photo_embed
for feature vector.
The following example treats the result of selection in CTE as a target vector to execute vector search:
WITH target_photo_embed AS (
SELECT photo_embed
FROM photos
LIMIT 1)
SELECT id, photo_id, distance(photo_embed, target_photo_embed) as dist
FROM photos
ORDER BY dist
LIMIT 10;
# Joins and Subqueries
The support for join is limited, and it is recommended to use subquery as a workaround. In MyScaleDB, the vector search is based on the vector index on a table with a vector column. Although the distance()
function appears in the SELECT
clause, its value is calculated during vector search on the table, not after join. The join result may not be the expected result.
THe following are possible workarounds:
You can use the
distance()...WHERE...ORDER BY...LIMIT
query pattern in subqueries that utilize vector indexes and get expected results on vector tables.You can also use subqueries in
WHERE
clause to rewrite the join.
Assume you have another table photo_meta
that stores information about the photo library's images, with photo_id
, photo_author
, year
and title
. The following example retrieves relevant photos taken in 2023 from a collection of images:
SELECT t1.photo_id, distance(t1.photo_embed,[0.0269, 0.0316,...]) as dist
FROM photos t1
JOIN photo_meta t2 ON t1.photo_id = t2.photo_id
WHERE t2.year = 2023
ORDER BY dist
LIMIT 5;
Here is what happens when you run the above query:
- First, MyScaleDB executes vector search on table
photos
to get the required columnphoto_id
and value ofdistance()
function for the top five relevant records:
SELECT photo_id, distance(photo_embed,[0.0269, 0.0316,...]) as dist
FROM photos
ORDER BY dist
LIMIT 5;
- Then, the
join
runs using the results from the vector table as its underlying table:
SELECT t1.photo_id, t1.dist
FROM (<< results of vector table go here >>) t1
JOIN photo_meta t2 ON t1.photo_id = t2.photo_id
WHERE t2.year = 2023;
Because the vector search does't consider the year photos were taken, the result may be incorrect. To get the correct result, we need to rewrite the join query by using a subquery:
SELECT t1.photo_id, t1.dist
FROM (
SELECT photo_id, distance(photo_embed,[0.0269, 0.0316,...]) as dist
FROM photos
WHERE photo_id IN (
SELECT t1.photo_id
FROM photos t1 JOIN photo_meta t2 ON t1.photo_id = t2.photo_id
WHERE t2.year = 2023)
ORDER BY dist
LIMIT 5
) t1
ORDER BY dist
LIMIT 5;
# Improve Data Analysis
Advanced SQL techniques like CTEs, subqueries, and joins can help you perform complex data analyses and manipulations with greater precision and efficiency. MyScaleDB combines the power of SQL and vectors to provide a powerful approach to tackling complex AI-related questions. With MyScaleDB You can efficiently execute traditional SQL and vector queries on structured data and vector data to address complex queries and analyze high-dimensional data in a unified and efficient manner.
If you are interested in learning more, please follow us on X (Twitter) (opens new window) or join our Discord (opens new window) community. Let's build the future of data and AI together!
This article is originally published on The New Stack. (opens new window)