Query, Visualize, and Analyze with Mode

Preface

The best way to learn something is to teach it. In this post, I will introduce you to Mode, a data analytics platform that simplifies the process of analyzing and visualizing data. Whether you’re a data scientist, analyst, or business user, Mode provides a collaborative environment to work with data efficiently.

Typically, a user connects Mode to a database like Google BigQuery, Amazon Redshift, or Snowflake. Once connected, users can write SQL queries to extract and manipulate data. Mode’s SQL editor provides features like syntax highlighting, auto-completion, and query history to enhance the coding experience.

For this tutorial, I will use Mode’s default sample database, which contains flight data.

Tip

You don’t need to connect a database to follow along with this tutorial. Simply create a free Mode account, and you can access the sample database directly by calling the tutorial schema.

1. Querying Data

First things first, let’s write a simple SQL query to retrieve data from the sample database. SQL (Structured Query Language) is the standard language for managing and manipulating relational databases. For this example, let’s look at tutorial.flights, which contains information about flights, including departure and arrival times, delays, and distances.

Suppose we want to find the average arrival delay for flights grouped by flight distance categories (short haul, medium haul, long haul). We can define these categories as follows:

SELECT 
  CASE
    WHEN distance < 500 THEN 'Short Haul'
    WHEN distance >= 500 AND distance <= 1500 THEN 'Medium Haul'
    WHEN distance > 500 THEN 'Long Haul'
  END AS flight_type,
  COUNT(*) AS total_flights,
  AVG(arrival_delay) AS average_delay
FROM 
  tutorial.flights
GROUP BY 
  flight_type
ORDER BY 
  average_delay

In this query, we use a CASE statement to categorize flights based on their distance. We then count the total number of flights and calculate the average arrival delay for each category. The results are grouped by flight_type and ordered by average_delay. By convention, SQL keywords are written in uppercase to distinguish them from column names and other identifiers.

As a result, we get a table showing the average arrival delay for short haul, medium haul, and long haul flights.

Demonstration of pipe table syntax
flight_type	total_flights	average_delay
Long Haul	74281	-0.6173
Medium Haul	297380	3.8683
Short Haul	222181	3.9832

2. Visualizing Data

Now that we have our data, let’s visualize it using Mode’s built-in charting capabilities. Visualizations help us understand data patterns and trends more intuitively.

Beneath the SQL editor, Mode provides a charting interface where we can create various types of visualizations, such as bar charts, line charts, scatter plots, and more. For our flight delay data, let’s create a bar chart to visualize the average arrival delay for each flight type.

Bar Chart of Average Arrival Delay by Flight Type

3. Advanced Analysis with Python

Mode supports python notebooks, allowing us to perform more advanced analyses and visualizations. I’m a fan of this integration because it allows us to leverage the power of Python libraries like Pandas, NumPy, Matplotlib, and Seaborn for data analysis and visualization.

By default, SQL query results are available as a Pandas DataFrame in the Python notebook. Specifically, the variable, datasets contains a list of sql query results as DataFrames. Thus, to index the first query result, you would use datasets[0].

Let’s query the flight data again, this time retrieving arrival and departure delays for individual flights. We can then use Python to analyze the relationship between these two variables.

SELECT arrival_delay, departure_delay 
FROM tutorial.flights
LIMIT 1000

Now we can move to the Python notebook. Import familiar libraries.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

Recall that the SQL query results are stored in datasets. This variable stores all query results as a list of DataFrames. Since we only have two queries, we want to access the first index, datasets[1].

# df = datasets[1]
df = pd.read_csv("flight_arrivals_departure_delays.csv")
df = df.dropna(subset=["arrival_delay", "departure_delay"])
X = df[["departure_delay"]]
y = df["arrival_delay"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

pip = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])
pip.fit(X_train, y_train)
y_pred = pip.predict(X_val)
score = pip.score(X_val, y_val)
print(f"Model R^2 Score: {score:.2f}")
print(f"Model Coefficients: {pip.named_steps['regressor'].coef_}")
print(f"Model Intercept: {pip.named_steps['regressor'].intercept_}")

(799, 1) (200, 1) (799,) (200,)
Model R^2 Score: 0.76
Model Coefficients: [17.15344637]
Model Intercept: -8.27784730913642

A R^2 score of 0.76 indicates how well the model explains the variability of the data. The model coefficient is 17.2, meaning for every minute of departure delay, the arrival delay increases by 17.2 minutes on average.

Finally, let’s visualize the relationship between departure delay and arrival delay, along with the regression line from our linear model.

# super impose lm line on scatter plot using seaborn
sns.lmplot(x="departure_delay", y="arrival_delay", data=df, line_kws={'color': 'red'})
plt.title("Arrival Delay vs Departure Delay with Regression Line")
plt.xlabel("Departure Delay (minutes)")
plt.ylabel("Arrival Delay (minutes)")
plt.show()

No surprise here - there is a positive correlation between departure delay and arrival delay. The regression line shows that as departure delays increase, arrival delays tend to increase as well.

Conclusion

In this post, we explored the QVA (Query, Visualize, Analyze) workflow using Mode. We started by querying flight data using SQL, visualized the results with a bar chart, and then performed advanced analysis using Python to build a linear regression model. Mode’s integration of SQL and Python makes it a powerful tool for data analysis and visualization.

Note

Past articles: - Principal Component Analysis - Support Vector Machine - K-Means Clustering

Github: - Running on Numbers

--- title: "Query, Visualize, and Analyze with Mode" author: "Oliver Chang" email: oliverc1622@gmail.com date: 2025-12-30 # Update this date when you make changes categories: [Mode, SQL, python, data science, data visualization] toc: true format: html: html-math-method: katex code-tools: true image: "flights_main.png" title-block-banner: default --- ## Preface The best way to learn something is to teach it. In this post, I will introduce you to Mode, a data analytics platform that simplifies the process of analyzing and visualizing data. Whether you're a data scientist, analyst, or business user, Mode provides a collaborative environment to work with data efficiently. Typically, a user connects Mode to a database like Google BigQuery, Amazon Redshift, or Snowflake. Once connected, users can write SQL queries to extract and manipulate data. Mode's SQL editor provides features like syntax highlighting, auto-completion, and query history to enhance the coding experience. For this tutorial, I will use Mode's default sample database, which contains flight data. :::{.callout-tip} You don't need to connect a database to follow along with this tutorial. Simply create a free Mode account, and you can access the sample database directly by calling the `tutorial` schema. ::: ## 1. Querying Data First things first, let's write a simple SQL query to retrieve data from the sample database. SQL (Structured Query Language) is the standard language for managing and manipulating relational databases. For this example, let's look at `tutorial.flights`, which contains information about flights, including departure and arrival times, delays, and distances. Suppose we want to find the average arrival delay for flights grouped by flight distance categories (short haul, medium haul, long haul). We can define these categories as follows: ```{sql} SELECT CASE WHEN distance < 500 THEN 'Short Haul' WHEN distance >= 500 AND distance <= 1500 THEN 'Medium Haul' WHEN distance > 500 THEN 'Long Haul' END AS flight_type, COUNT(*) AS total_flights, AVG(arrival_delay) AS average_delay FROM tutorial.flights GROUP BY flight_type ORDER BY average_delay ``` In this query, we use a `CASE` statement to categorize flights based on their distance. We then count the total number of flights and calculate the average arrival delay for each category. The results are grouped by `flight_type` and ordered by `average_delay`. By convention, SQL keywords are written in uppercase to distinguish them from column names and other identifiers. As a result, we get a table showing the average arrival delay for short haul, medium haul, and long haul flights. | flight_type | total_flights | average_delay | |-------------|--------------:|--------------:| | Long Haul | 74281 | -0.6173 | | Medium Haul | 297380 | 3.8683 | | Short Haul | 222181 | 3.9832 | : Demonstration of pipe table syntax ## 2. Visualizing Data Now that we have our data, let's visualize it using Mode's built-in charting capabilities. Visualizations help us understand data patterns and trends more intuitively. Beneath the SQL editor, Mode provides a charting interface where we can create various types of visualizations, such as bar charts, line charts, scatter plots, and more. For our flight delay data, let's create a bar chart to visualize the average arrival delay for each flight type. ![Bar Chart of Average Arrival Delay by Flight Type](bar_chart.png) ## 3. Advanced Analysis with Python Mode supports python notebooks, allowing us to perform more advanced analyses and visualizations. I'm a fan of this integration because it allows us to leverage the power of Python libraries like Pandas, NumPy, Matplotlib, and Seaborn for data analysis and visualization. By default, SQL query results are available as a Pandas DataFrame in the Python notebook. Specifically, the variable, `datasets` contains a list of sql query results as DataFrames. Thus, to index the first query result, you would use `datasets[0]`. Let's query the flight data again, this time retrieving arrival and departure delays for individual flights. We can then use Python to analyze the relationship between these two variables. ```{sql} SELECT arrival_delay, departure_delay FROM tutorial.flights LIMIT 1000 ``` Now we can move to the Python notebook. Import familiar libraries. ```{python} import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression ``` Recall that the SQL query results are stored in `datasets`. This variable stores all query results as a list of DataFrames. Since we only have two queries, we want to access the first index, `datasets[1]`. ```{python} # df = datasets[1] df = pd.read_csv("flight_arrivals_departure_delays.csv") df = df.dropna(subset=["arrival_delay", "departure_delay"]) X = df[["departure_delay"]] y = df["arrival_delay"] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) print(X_train.shape, X_val.shape, y_train.shape, y_val.shape) pip = Pipeline([ ('scaler', StandardScaler()), ('regressor', LinearRegression()) ]) pip.fit(X_train, y_train) y_pred = pip.predict(X_val) score = pip.score(X_val, y_val) print(f"Model R^2 Score: {score:.2f}") print(f"Model Coefficients: {pip.named_steps['regressor'].coef_}") print(f"Model Intercept: {pip.named_steps['regressor'].intercept_}") ``` A $R^2$ score of 0.76 indicates how well the model explains the variability of the data. The model coefficient is 17.2, meaning for every minute of departure delay, the arrival delay increases by 17.2 minutes on average. Finally, let's visualize the relationship between departure delay and arrival delay, along with the regression line from our linear model. ```{python} # super impose lm line on scatter plot using seaborn sns.lmplot(x="departure_delay", y="arrival_delay", data=df, line_kws={'color': 'red'}) plt.title("Arrival Delay vs Departure Delay with Regression Line") plt.xlabel("Departure Delay (minutes)") plt.ylabel("Arrival Delay (minutes)") plt.show() ``` No surprise here - there is a positive correlation between departure delay and arrival delay. The regression line shows that as departure delays increase, arrival delays tend to increase as well. ## Conclusion In this post, we explored the QVA (Query, Visualize, Analyze) workflow using Mode. We started by querying flight data using SQL, visualized the results with a bar chart, and then performed advanced analysis using Python to build a linear regression model. Mode's integration of SQL and Python makes it a powerful tool for data analysis and visualization. :::{.callout-note} Past articles: - [Principal Component Analysis](https://runningonnumbers.com/posts/principal-component-analysis-python-baseball/) - [Support Vector Machine](https://runningonnumbers.com/posts/support-vector-machine/) - [K-Means Clustering](https://runningonnumbers.com/posts/k-means/) Github: - [Running on Numbers](https://github.com/oliverc1623/Running-On-Numbers-Public) ::: <script async data-uid="5d16db9e50" src="https://runningonnumbers.kit.com/5d16db9e50/index.js"></script>