In today's digital age, professionals across all industries must stay updated with upcoming events, conferences, and workshops. However, efficiently finding events that align with one's interests amidst the vast ocean of online information presents a significant challenge.
This blog introduces an innovative solution to this challenge: a comprehensive application designed to scrape event data from Facebook (opens new window) and analyze the scraped data using MyScale (opens new window). While MyScale is commonly associated with the RAG tech stack or used as a vector database, its capabilities extend beyond these realms. We will utilize it for data analysis, leveraging its vector search functionality to analyze events that are semantically similar, thus providing better results and insights.
You may notice that Grok AI (opens new window) utilized the Qdrant vector database as the search engine to retrieve real-time information from X (formerly known as Twitter) data. You can also assess the power of vector databases in this way with MyScale by integrating MyScale with other platforms like Apify to enhance daily life tasks through the development of simple personalized applications.
So in this blog, let’s develop an application that takes only the name of a city as input and scrapes all related events from Facebook. Subsequently, we will conduct data analysis and semantic search using the advanced SQL vector capabilities of MyScale.
# Tools and Technologies
We’ll use several tools, including Apify, MyScale (opens new window), and OpenAI, to develop this useful application.
- Apify: A popular web scraping (opens new window) and automation platform that significantly streamlines the process of data collection. It provides the capability to scrape data and subsequently feed it to LLMs. This allows us to train LLMs on real-time data and develop applications.
- MyScale: MyScale is a SQL vector database that we use to store and process both structured and unstructured data in an optimized way.
- OpenAI: We will use the model
text-embedding-3-small
from OpenAI (opens new window) to get the embeddings of the text and then save those embeddings in MyScale for data analysis and semantic search.
# How to Set up MyScale and Apify
To start setting up MyScale and Apify, you'll need to create a new directory and a Python (opens new window) file. You can do this by opening your terminal or command line and entering the following commands:
mkdir MyScale
cd MyScale
touch main.ipynb
Let's install the packages. Copy the command below, and paste it into your terminal. These packages will provide the tools and libraries we need to develop our application.
pip install openai apify-client clickhouse-connect pandas numpy
This should install all the dependencies in your system. To confirm that everything is installed properly, you can enter the following command in your terminal.
pip freeze | egrep '(openai|apify-client|clickhouse-connect|pandas|numpy)'
This should include all the installed dependencies with their versions. If you spot any missing dependencies, you may need to re-run the installation command for that specific package. Now, we're ready to write our code after the installations.
💡 Note: We will be working in a Python notebook. Consider every code block a notebook cell.
# How to Scrape Data with Apify
Now, we will use the Apify API to scrape event data of New York City using Facebook Events scraper (opens new window).
import pandas as pd
from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("Enter_your_apify_key_here")
# Prepare the Actor input
run_input = {
"searchQueries": ["Sport New York"],
"startUrls": [],
"maxEvents": 50,
}
# Run the Actor and wait for it to finish
run = client.actor("UZBnerCFBo5FgGouO").call(run_input=run_input)
df_columns = ['Name', 'Datetime', 'Description', 'Users_Going', 'Users_Interested', 'Users_Responded', 'City', 'Organized_By', 'Street_Address']
dataframe1 = pd.DataFrame(columns=df_columns)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
# Use a dictionary comprehension to replace None values with an empty string
row = {
'Name': item.get('name', ''),
'Datetime': item.get('dateTimeSentence', ''),
'Description': item.get('description', ''),
'Users_Going': item.get('usersGoing', ''),
'Users_Interested': item.get('usersInterested', ''),
'Users_Responded': item.get('usersResponded', ''),
'City': item.get('location', {}).get('city', '') if item.get('location') else '',
'Organized_By': item.get('organizedBy', ''),
'Street_Address': item.get('location', {}).get('streetAddress', '') if item.get('location') else ''
}
# Ensure all None values are replaced by an empty string
row = {k: (v if v is not None else '') for k, v in row.items()}
dataframe1 = dataframe1._append(row, ignore_index=True)
# Cleaning the data
dataframe1['Description'] = dataframe1['Description'].replace('\\n', '', regex=True)
This script gives us the upcoming event details in the form of a pandas DataFrame. 💡 Note: Don’t forget to add your Apify API key in the script above. You can find your API token on the Integrations (opens new window) page in the Apify Console.
# Data Pre-processing
When we gather raw data, it comes in various formats. In this script, we will bring the event dates into a single format so that our data filtering can be more efficiently done.
# Import necessary libraries for data manipulation and date parsing
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil import parser
# Function to parse date strings that may represent a range or a single date
def parse_dates(date_str):
# Check if the date string contains a dash, indicating a range
if '-' in date_str:
parts = date_str.split('-')
# If the string splits into two parts, it's a valid range
if len(parts) == 2:
try:
# Parse start and end dates, formatting them to a readable format
start_date = parser.parse(parts[0], fuzzy=True).strftime('%a, %b %d')
end_date = parser.parse(parts[1], fuzzy=True).strftime('%a, %b %d')
return start_date, end_date
except ValueError:
# In case of parsing error, do nothing (will handle below)
pass
# If not a range or if parsing the range failed, try parsing as a single date
try:
parsed_date = parser.parse(date_str, fuzzy=True)
# Format the single date for start_date and format differently for end_date
start_date = parsed_date.strftime('%a, %b %d AT %I:%M %p EDT')
end_date = parsed_date.strftime('%a, %b %d') # Omitting time for end_date
return start_date, end_date
except ValueError:
# Return NaN for both dates if parsing fails
return np.nan, np.nan
# Function to extract detailed date, time, and day from a date string
def extract_date_time_day(date_str):
try:
# Parse the date string, allowing for some flexibility in input format
parsed_date = parser.parse(date_str, fuzzy=True)
# Extract and format the date, time, and day parts
date = parsed_date.strftime('%Y-%m-%d')
day = parsed_date.strftime('%a')
# Determine if the original string included a time component
time_component = parsed_date.strftime('%I:%M %p') not in date_str
time = parsed_date.strftime('%H:%M:%S') if not time_component else np.nan
except ValueError:
# If parsing fails, set date, time, and day to NaN
date, time, day = np.nan, np.nan, np.nan
return date, time, day
# Apply parse_dates function across the dataframe, creating new columns for start and end dates
dataframe1[['Start_Date', 'End_Date']] = dataframe1.apply(lambda row: pd.Series(parse_dates(row['Datetime'])), axis=1)
# Drop rows where Start_Date is NaN, indicating parsing was unsuccessful
dataframe = dataframe1.dropna(subset=['Start_Date'])
# Apply extract_date_time_day to split the start and end dates into separate date, time, and day columns
dataframe['Start_Date'], dataframe['Start_Time'], dataframe['Start_Day'] = zip(*dataframe['Start_Date'].apply(extract_date_time_day))
dataframe['End_Date'], _, dataframe['End_Day'] = zip(*dataframe['End_Date'].apply(extract_date_time_day))
# Remove the original 'Datetime' column as it's no longer needed
dataframe=dataframe.drop(['Datetime'], axis=1)
# Convert 'Start_Date' and 'End_Date' to datetime format, extracting just the date part
dataframe['Start_Date'] = pd.to_datetime(dataframe['Start_Date']).dt.date
dataframe['End_Date'] = pd.to_datetime(dataframe['End_Date']).dt.date
# Convert 'Start_Time' to datetime format, retaining the time information
dataframe['Start_Time'] = pd.to_datetime(dataframe['Start_Time'])
This code snippet uses pandas
with datetime
and dateutil
packages of Python to format the data.
# Generating Embeddings
To deeply understand and search events, we will generate embeddings from their descriptions using the text-embedding-3-small
. These embeddings capture the semantic essence of each event, helping the application to return better results.
# Import OpenAI library for API access.
from openai import OpenAI
# Initialize OpenAI client with an API key.
openai_client = OpenAI(api_key="your_openai_api_key_here")
# Function to get text embeddings
def get_embedding(text, model="text-embedding-3-small"):
return openai_client.embeddings.create(input=text, model=model).data
embeddings = get_embedding(dataframe["Description"].tolist())
# Extract embedding vectors from the embeddings object
vectors = [embedding.embedding for embedding in embeddings]
array = np.array(vectors)
embeddings_series = pd.Series(list(array))
# Add embeddings as a new column in the DataFrame.
dataframe['Description_Embeddings'] = embeddings_series
Now, we will insert the new DataFrame with embeddings to MyScale.
# Connecting with MyScale
As we have discussed at the start we will use the MyScale as a vector database for storing and managing data. Here, we will connect to MyScale in preparation for data storage.
import clickhouse_connect
client = clickhouse_connect.get_client(
host='host_name_here',
port=443,
username='username_here',
password='passwd_here'
)
This connection setup ensures our application can communicate with MyScale, and use SQL’s power for data manipulation and analysis.
💡 Note: See Connection Details (opens new window) for more information on how to connect to the MyScale cluster.
# Create Tables and Indexes Using MyScale
We now create a table according to our DataFrame. All the data will be stored in this table, including the embeddings.
client.command("""
CREATE TABLE default.Events (
Name String,
Description String,
Users_Going Int64,
Users_Interested Int64,
Users_Responded Int64,
City String,
Organized_By String,
Street_Address String,
Start_Date Date32,
End_Date Nullable(Date32),
Start_Time Nullable(DateTime64),
Start_Day String,
End_Day String,
Description_Embeddings Array(Float32),
CONSTRAINT check_data_length CHECK length(Description_Embeddings) = 1536
) ENGINE = MergeTree()
ORDER BY (Name);
""")
The above SQL statements create a table named Events
on the cluster. The CONSTRAINT
makes sure that all the vectors embedding are of the same length 1536
.
# Storing the Data and Creating an Index in MyScale
In this step, we insert the processed data into MyScale. This involves batch-inserting the data to ensure efficient storage and retrieval.
batch_size = 10 # Adjust based on your needs
num_batches = len(dataframe) // batch_size
for i in range(num_batches):
start_idx = i * batch_size
end_idx = start_idx + batch_size
batch_data = dataframe[start_idx:end_idx]
# print(batch_data["Description_Embeddings"])
client.insert("default.Events", batch_data.to_records(index=False).tolist(), column_names=batch_data.columns.tolist())
print(f"Batch {i+1}/{num_batches} inserted.")
client.command("""
ALTER TABLE default.Events
ADD VECTOR INDEX vector_index Description_Embeddings
TYPE MSTG
""")
Using pandas, the above code efficiently transfers our prepared dataset into the MyScale database.
# Data Analysis Using MyScale
Finally, we use MyScale’s analytical capabilities to perform analysis and enable semantic search. By executing SQL queries, we can analyze events based on topics, locations, and dates. So, let’s try to write some queries.
# Simple SQL query
Let’s first try to get the top 10 results from the table.
results=client.query("""
SELECT Name,Description FROM default.Events LIMIT 10
""")
for row in results.named_results():
print(row["Name"])
print(row['Description'])
This query will simply return the top 10 results from the events
table.
# Discover Events by Semantic Relevance
Let’s try to find the top 10 upcoming events with a vibe similar to a reference event, like this, “One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady”. This is achieved by comparing semantic embeddings of event descriptions, ensuring a match in themes and emotions.
embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embedding=embeddings[0].embedding
results = client.query(f"""
SELECT Name, Description,
distance(Description_Embeddings, {embedding}) as dist FROM default.Events ORDER BY dist LIMIT 10
""")
for row in results.named_results():
print("Title of the event ", row["Name"])
print("Description of the event ", row['Description'])
print("Distance : ", row["dist"])
# Trending Events by Popularity
This query ranks the top 10 events by the number of attendees and interested users, highlighting popular events from large city festivals to major conferences. It is ideal for those seeking to join large, energetic gatherings.
results = client.query(f"""
SELECT Name, City, Users_Going, Users_Interested, Users_Responded
FROM default.Events
ORDER BY Users_Going DESC, Users_Interested DESC
LIMIT 10
""")
for row in results.named_results():
print("Event Name ", row["Name"])
print("City ", row["City"])
print("Users Going ", row["Users_Going"])
print("Interested Users ", row["Users_Interested"])
# Popular Local Events in New York
Combining relevance and popularity, this query identifies similar events in New York City related to a specific event and ranks them by attendance, offering a curated list of events that reflect the city's vibrant culture and attract local interest.
embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embeddi=embeddings[0].embedding
results = client.query(f"""
SELECT Name,City, Description, Users_Going,distance(Description_Embeddings, {embeddi}) as dist
FROM default.Events
WHERE City LIKE '%New York%' and dist < 1.5
ORDER BY Users_Going DESC,dist
LIMIT 10
""")
for row in results.named_results():
print("Event Name ", row["Name"])
print("Description ", row["Description"])
print("Users Going ", row["Users_Going"])
# Leading Event Organizers
This query ranks the top 10 event organizers by the total number of attendees and interested users, highlighting those who excel in creating compelling events and attracting large audiences. It provides insights for event planners and attendees interested in top-tier events.
# Which client has attracted the most number of users
results = client.query(f"""
SELECT Organized_By, SUM(Users_Going + Users_Interested) AS Total_Users
FROM default.Events
GROUP BY Organized_By
ORDER BY Total_Users DESC
Limit 10
""")
for row in results.named_results():
print("Event Name ", row["Organized_By"])
print("Total_Users ", row["Total_Users"])
# Implement RAG
Previously, we have explored MyScale for data analysis, highlighting its capabilities in enhancing our data workflows. Moving forward, we'll go one step ahead by implementing Retrieval-Augmented Generation (RAG), an innovative framework combining an external knowledge base with LLMs. This step will help you to understand your data better and find more detailed insights. Next, you'll see how to use RAG with MyScale, which will make working with data more interesting and productive.
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
query="Can you please suggest me some events related to basketball"
# Use the get_embedding method defined above, it accepts a list of sentences
embeddings=get_embedding([query])
embeddings=embeddings[0].embedding
results = client.query(f"""
SELECT Name, City, Users_Going, Description, distance(Description_Embeddings, {embeddings}) as dist
FROM default.Events
ORDER BY Users_Going DESC,dist
LIMIT 5
""")
PROMPT_TEMPLATE = """
Your objective is to formulate a response to a question using only the information provided below:
{context}
---
Your task is to carefully analyze the provided context and provide an answer to the following question based solely on the information given:
{question}
"""
# Combine the descriptions of the top results.
descriptions = [row["Description"] for row in results.named_results()]
context_text = "\n\n---\n\n".join(descriptions)
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query)
model = ChatOpenAI(openai_api_key="your_api_key_here")
response_text = model.predict(prompt)
formatted_response = f"Response: {response_text}\n"
print(formatted_response)
Throughout this blog, we have observed that MyScale is much more than a vector database that can be used to develop all kinds of applications. We can use it as a simple SQL database or for advanced AI applications, covering the majority of the development domain. We encourage you to at least give it a try and explore the advanced features by signing up for the free tier and get 5 million free vector storage.
# Conclusion
We have explored the abilities and functionalities of MyScale with Apify Scraper through the process of developing an event analytics application. MyScale has demonstrated its exceptional capabilities in high-performance vector search while retaining all the functionalities of SQL databases, which helps developers make semantic searches using familiar SQL syntax with much better speed and accuracy.
The capabilities of MyScale are not limited to this application, you can adopt it to develop any AI applications using the RAG (opens new window) method. MyScale outperformed other vector databases (opens new window) in terms of cost, speed, and accuracy, so why not give it a try for your next application?
If you have any feedback or suggestions, please reach out to us on MyScale Discord (opens new window) or Twitter (opens new window).