import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegressionPreface
The best way to learn something is to teach it. In this post, I will introduce you to Mode, a data analytics platform that simplifies the process of analyzing and visualizing data. Whether you’re a data scientist, analyst, or business user, Mode provides a collaborative environment to work with data efficiently.
Typically, a user connects Mode to a database like Google BigQuery, Amazon Redshift, or Snowflake. Once connected, users can write SQL queries to extract and manipulate data. Mode’s SQL editor provides features like syntax highlighting, auto-completion, and query history to enhance the coding experience.
For this tutorial, I will use Mode’s default sample database, which contains flight data.
You don’t need to connect a database to follow along with this tutorial. Simply create a free Mode account, and you can access the sample database directly by calling the tutorial schema.
1. Querying Data
First things first, let’s write a simple SQL query to retrieve data from the sample database. SQL (Structured Query Language) is the standard language for managing and manipulating relational databases. For this example, let’s look at tutorial.flights, which contains information about flights, including departure and arrival times, delays, and distances.
Suppose we want to find the average arrival delay for flights grouped by flight distance categories (short haul, medium haul, long haul). We can define these categories as follows:
SELECT
CASE
WHEN distance < 500 THEN 'Short Haul'
WHEN distance >= 500 AND distance <= 1500 THEN 'Medium Haul'
WHEN distance > 500 THEN 'Long Haul'
END AS flight_type,
COUNT(*) AS total_flights,
AVG(arrival_delay) AS average_delay
FROM
tutorial.flights
GROUP BY
flight_type
ORDER BY
average_delay
In this query, we use a CASE statement to categorize flights based on their distance. We then count the total number of flights and calculate the average arrival delay for each category. The results are grouped by flight_type and ordered by average_delay. By convention, SQL keywords are written in uppercase to distinguish them from column names and other identifiers.
As a result, we get a table showing the average arrival delay for short haul, medium haul, and long haul flights.
| flight_type | total_flights | average_delay |
|---|---|---|
| Long Haul | 74281 | -0.6173 |
| Medium Haul | 297380 | 3.8683 |
| Short Haul | 222181 | 3.9832 |
2. Visualizing Data
Now that we have our data, let’s visualize it using Mode’s built-in charting capabilities. Visualizations help us understand data patterns and trends more intuitively.
Beneath the SQL editor, Mode provides a charting interface where we can create various types of visualizations, such as bar charts, line charts, scatter plots, and more. For our flight delay data, let’s create a bar chart to visualize the average arrival delay for each flight type.

3. Advanced Analysis with Python
Mode supports python notebooks, allowing us to perform more advanced analyses and visualizations. I’m a fan of this integration because it allows us to leverage the power of Python libraries like Pandas, NumPy, Matplotlib, and Seaborn for data analysis and visualization.
By default, SQL query results are available as a Pandas DataFrame in the Python notebook. Specifically, the variable, datasets contains a list of sql query results as DataFrames. Thus, to index the first query result, you would use datasets[0].
Let’s query the flight data again, this time retrieving arrival and departure delays for individual flights. We can then use Python to analyze the relationship between these two variables.
SELECT arrival_delay, departure_delay
FROM tutorial.flights
LIMIT 1000
Now we can move to the Python notebook. Import familiar libraries.
Recall that the SQL query results are stored in datasets. This variable stores all query results as a list of DataFrames. Since we only have two queries, we want to access the first index, datasets[1].
# df = datasets[1]
df = pd.read_csv("flight_arrivals_departure_delays.csv")
df = df.dropna(subset=["arrival_delay", "departure_delay"])
X = df[["departure_delay"]]
y = df["arrival_delay"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)
pip = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression())
])
pip.fit(X_train, y_train)
y_pred = pip.predict(X_val)
score = pip.score(X_val, y_val)
print(f"Model R^2 Score: {score:.2f}")
print(f"Model Coefficients: {pip.named_steps['regressor'].coef_}")
print(f"Model Intercept: {pip.named_steps['regressor'].intercept_}")(799, 1) (200, 1) (799,) (200,)
Model R^2 Score: 0.76
Model Coefficients: [17.15344637]
Model Intercept: -8.27784730913642
A R^2 score of 0.76 indicates how well the model explains the variability of the data. The model coefficient is 17.2, meaning for every minute of departure delay, the arrival delay increases by 17.2 minutes on average.
Finally, let’s visualize the relationship between departure delay and arrival delay, along with the regression line from our linear model.
# super impose lm line on scatter plot using seaborn
sns.lmplot(x="departure_delay", y="arrival_delay", data=df, line_kws={'color': 'red'})
plt.title("Arrival Delay vs Departure Delay with Regression Line")
plt.xlabel("Departure Delay (minutes)")
plt.ylabel("Arrival Delay (minutes)")
plt.show()
No surprise here - there is a positive correlation between departure delay and arrival delay. The regression line shows that as departure delays increase, arrival delays tend to increase as well.
Conclusion
In this post, we explored the QVA (Query, Visualize, Analyze) workflow using Mode. We started by querying flight data using SQL, visualized the results with a bar chart, and then performed advanced analysis using Python to build a linear regression model. Mode’s integration of SQL and Python makes it a powerful tool for data analysis and visualization.
Past articles: - Principal Component Analysis - Support Vector Machine - K-Means Clustering
Github: - Running on Numbers