Unlock the essentials of translating your code from SQL to Python with this quick guide tailored for SQL users. Dive into key insights and streamline your data manipulation process.
In this post, we will compare the implementation of Pandas and SQL for data queries. Weβll explore how to use Pandas in a manner similar to SQL by translating SQL queries into Pandas operations.
Itβs important to note that there are various ways to achieve similar results, and the translation of SQL queries to Pandas will be done by employing some of its core methods.
source image Image by upklyak on Freepik
We aim to explore the diverse Python Pandas methods, focusing on their application through the nycflights13 datasets.
This datasets offer comprehensive information about airlines
, airports
, weather conditions
, and aircraft
for all flights
passing through New York airports in 2013.
Through this exercise, weβll not only explore Pandas functionality but also learn to apply fundamental SQL concepts in a Python data manipulation environment.
The nycflights13 library contains tables with flight data from New York airports in 2023. Below, you can find a high-level representation of an entity-relationship diagram with its five tables.
To install the nycflights13 library, you can use the following command:
!pip install nycflights13
This library provides datasets containing comprehensive information about flights from New York airports in 2023. Once installed, you can easily access and analyze this flight data using various tools and functionalities provided by the nycflights13 package.
In the next code snippet, we are importing essential Python libraries for data analysis. * π Pandas is a library for data manipulation and analysis, * π Numpy provides support for numerical operations * π Nycflights13 is a specialized library containing datasets related to flights from New York airports in 2023.
import pandas as pd
import numpy as np
import nycflights13 as nyc
In the following lines of code, we are assigning two specific datasets from the nycflights13 library to variables.
= nyc.flights
flights = nyc.airlines airlines
The following SQL query retrieves all columns and rows from the βπ©οΈ flightsβ table. In Pandas, the equivalent is simply writing the DataFrame name, in this case, βflights.β For example:
πsql
SELECT * FROM flights;
πpython
flights
To select specific columns from a Pandas DataFrame, you can use the following syntax:
πsql
select
year,
month,
day,
dep_time,
flight,
tailnum,
origin,
dest from flights;
πpython
(
flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest'])
)
To filter all βοΈ flights where the origin is βJFKβ in Pandas, you can use the following code:
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin = 'JFK'
limit 10;
πpython
( flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest'])
"origin=='JFK'")
.query(10)
.head( )
To achieve the same filtering in Pandas for specific criteria: * βοΈ Flights departing from JFK, LGA, or EWR.
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
limit 10;
πpython
( flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest'])
"origin in ['JFK', 'EWR', 'LGA']")
.query(10)
.head( )
To achieve the same filtering in Pandas for specific criteria:
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' ) and dest<>'MIA'
limit 10;
πpython
( flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest'])
"(origin in ['JFK', 'EWR', 'LGA']) and (dest != 'MIA')")
.query(10)
.head( )
To achieve the same filtering in Pandas for specific criteria:
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and distance < = 1000
limit 10;
πpython
( flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest', 'time_hour', 'distance'])
"(origin in ['JFK', 'EWR', 'LGA']) and (dest != 'MIA') and (distance <= 1000)")
.query(10)
.head( )
To achieve the same filtering in Pandas for specific criteria:
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and distance < = 1000
and time_hour between '2013-09-01' and '2012-09-30'
limit 10;
πpython
filter([['year', 'month', 'day', 'dep_time', 'flight',
( flights.'tailnum', 'origin', 'dest', 'time_hour', 'distance'])
.query("(origin in ['JFK', 'EWR', 'LGA'])"
" and (dest != 'MIA')"
" and (distance <= 1000)"
" and ('2013-09-01' <= time_hour <= '2013-09-30')"
)10)
.head( )
To achieve the same filtering in Pandas for specific criteria:
You can use the following code:
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and distance < = 1000
and time_hour between '2013-09-01' and '2012-09-30'
and tailnum like '%N5%'
limit 10;
πpython
(
flights filter(['year', 'month', 'day', 'dep_time', 'flight', 'tailnum',
.'origin', 'dest', 'time_hour'])
.query(" (origin in ['JFK', 'EWR', 'LGA'])"
" and (dest != 'MIA') "
" and ('2013-09-01' <= time_hour <= '2013-09-30')"
" and (tailnum.str.find('N5')>=0)"
)10)
.head( )
To achieve the same filtering in Pandas for specific criteria:
You can use the following code:
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and distance < = 1000
and time_hour between '2013-09-01' and '2012-09-30'
and tailnum like '%N5%'
and dep_time is null
limit 10;
πpython
(
flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest', 'time_hour'])
.query(" (origin in ['JFK', 'EWR', 'LGA'])"
" and (dest != 'MIA') "
" and ('2013-09-01' <= time_hour <= '2013-09-30')"
" and (tailnum.str.find('N5')>=0)"
" and dep_time.isnull()"
)10)
.head( )
The .sort_values() methods in Pandas are equivalent to the ORDER BY clause in SQL.
1οΈβ£.**.sort_values(['origin','dest'], ascending=False)**
: This method sorts the DataFrame based on the βoriginβ and βdestβ columns in descending order (from highest to lowest).
In SQL, this would be similar to the ORDER BY origin DESC, dest DESC
clause.
2οΈβ£.**.sort_values(['day'], ascending=True)**
: This method sorts the DataFrame based on the βdayβ column in ascending order (lowest to highest).
In SQL, this would be similar to the ORDER BY day ASC
clause.
Both methods allow you to sort your DataFrame according to one or more columns, specifying the sorting direction with the ascending parameter. True means ascending order, and False means descending order.
πsql
select year, month , day, dep_time, flight, tailnum, origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and distance < = 1000
and time_hour between '2013-09-01' and '2012-09-30'
and tailnum like '%N5%'
and dep_time is null
order by origin, dest desc
limit 10;
πpython
(
flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest', 'time_hour'])
.query(" (origin in ['JFK', 'EWR', 'LGA'])"
" and (dest != 'MIA') "
" and ('2013-09-01' <= time_hour <= '2013-09-30')"
" and (tailnum.str.find('N5')>=0)"
" and year.notnull()"
)'origin','dest'],ascending=False)
.sort_values([10)
.head(
)
To perform a distinct select in pandas, you need to first execute the entire query, and then apply the drop_duplicates()
method to eliminate all duplicate rows.
πsql
select distinct origin, dest
from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and time_hour between '2013-09-01' and '2012-09-30'
order by origin, dest desc;
πpython
(
flightsfilter(['origin','dest','time_hour'])
.
.query(" (origin in ['JFK', 'EWR', 'LGA'])"
" and (dest != 'MIA') "
" and ('2013-09-01' <= time_hour <= '2013-09-30')"
)filter(['origin','dest'])
.
.drop_duplicates()
)
Now, letβs introduce a new calculated column called βdelay_total,β where we sum the values from the βdep_delayβ and βarr_delayβ columns.
πsql
select
origin,
dest,
time_hour,
dep_delay,
arr_delay,+ flights.arr_delay as delay_total
flights.dep_delay from flights
where origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and time_hour between '2013-09-01' and '2012-09-30';
πpython
(
flightsfilter(['origin', 'dest', 'time_hour', 'dep_delay', 'arr_delay'])
.= flights.dep_delay + flights.arr_delay )
.assign(delay_total
.query(" (origin in ['JFK', 'EWR', 'LGA'])"
" and (dest != 'MIA') "
" and ('2013-09-01' <= time_hour <= '2013-09-30')"
) )
To perform a GROUP BY operation in pandas, weβll use the groupby method, which operates similarly to its SQL counterpart. Similarly, we can employ common aggregate functions such as sum, max, min, mean (equivalent to avg in SQL), and count. Below is a simple example to illustrate this process:
πsql
select
year,
month,
max(dep_delay) as dep_delay,
from flights
group by
year,
month;
πpython
(
flights'year','month'],as_index=False)
.groupby(['dep_delay'].max()
[ )
In the following example, weβll explore how to implement a HAVING clause in pandas, leveraging the query method, as weβve done previously for filtering.
πsql
select
year,
month,
max(dep_delay) as dep_delay,
from flights
group by
year,
month
having max(dep_delay)>1000
πpython
(
flights'year','month'],as_index=False)['dep_delay']
.groupby([max()
.'(dep_delay>1000)') # having
.query( )
When working with pandas and needing to perform multiple calculations on the same column or across different columns, the agg function becomes a valuable tool. It allows you to specify a list of calculations to be applied, providing flexibility and efficiency in data analysis.
Consider the following SQL query:
πsql
select
year,
month,
max(dep_delay) as dep_delay_max,
min(dep_delay) as dep_delay_min,
as dep_delay_mean,
mean(dep_delay) count(*) as dep_delay_count,
max(arr_delay) as arr_delay_max,
min(arr_delay) as arr_delay_min,
sum(arr_delay) as arr_delay_sum
from flights
group by
year,
month
This query retrieves aggregated information from the βflightsβ dataset, calculating various statistics like maximum, minimum, mean, count, and sum for both βdep_delayβ and βarr_delayβ columns. To achieve a similar result in pandas, we use the agg function, which allows us to specify these calculations concisely and efficiently. The resulting DataFrame provides a clear summary of the specified metrics for each combination of βyearβ and βmonth.β
πpython
= (
result
flights'year','month'],as_index=False)
.groupby(['dep_delay':['max','min','mean','count'], 'arr_delay':['max','min','sum']})
.agg({
)
# Concatenate function names with column names
= result.columns.map('_'.join)
result.columns
# Print the results
result
To execute a UNION ALL operation in Pandas, it is necessary to create two DataFrames and concatenate them using the concat method. Unlike SQL, a DataFrame in Pandas can be combined to generate additional columns or additional rows. Therefore, it is essential to define how the concatenation should be performed:
In our example, we will perform the equivalent of a UNION ALL in SQL, so we will use axis=0.
πsql
select
origin,
dest,
time_hour,
dep_delay,
arr_delay,+ flights.arr_delay as delay_total ,
flights.dep_delay 'NYC' group
FROM flights
WHERE origin in ( 'JFK', 'LGA', 'EWR' )
and dest<>'MIA'
and time_hour between '2013-09-01' and '2012-09-30'
ORDER BY flights.dep_delay + flights.arr_delay DESC
LIMIT 3
UNION ALL
select
origin,
dest,
time_hour,
dep_delay,
arr_delay,+ flights.arr_delay as delay_total ,
flights.dep_delay 'MIA' group
FROM flights
WHERE origin in ( 'JFK', 'LGA', 'EWR' )
and time_hour between '2013-07-01' and '2012-09-30'
ORDER BY flights.dep_delay + flights.arr_delay DESC
LIMIT 2;
πpython
= (
Flights_NYC
flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest', 'time_hour',
'dep_delay', 'arr_delay'])
= flights.dep_delay + flights.arr_delay )
.assign(delay_total
.query(" (origin in ['JFK', 'EWR', 'LGA'])"
" and ('2013-09-01' <= time_hour <= '2013-09-30')"
)='NYC')
.assign(group 'delay_total',ascending=False)
.sort_values(3)
.head(
)
= (
Flights_MIAMI
flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest', 'time_hour',
'dep_delay', 'arr_delay'])
= flights.dep_delay + flights.arr_delay )
.assign(delay_total
.query(" (dest in ['MIA', 'OPF', 'FLL'])"
" and ('2013-07-01' <= time_hour <= '2013-09-30')"
)='MIA')
.assign(group 'delay_total',ascending=False)
.sort_values(2)
.head(
)
# union all
=0) pd.concat([ Flights_NYC,Flights_MIAMI],axis
To replicate the CASE WHEN
statement, we can use two different methods from NumPy:
1οΈβ£.
If there are only two conditions, for example, checking if the total delay exceeds 0, then we label it as βDelayedβ; otherwise, we label it as βOn Timeβ.
For this, the np.where
method from NumPy is utilized.
πsql
select
origin,
dest,
time_hour,
dep_delay,
arr_delay,case
(when flights.dep_delay + flights.arr_delay >0 then 'Delayed'
else 'On Time' end) as status ,
FROM flights
LIMIT 5;
πpython
(
flightsfilter(['year', 'month', 'day', 'dep_time',
.'flight', 'tailnum', 'origin', 'dest',
'time_hour', 'dep_delay', 'arr_delay'])
=np.where((flights['dep_delay'] + flights['arr_delay']) > 0, 'Delayed',
.assign(status'On Time'))
5)
.head(
)
2οΈβ£.
In case there are more conditions, such as identifying Miami airports and labeling them as βMIAβ, labeling βATLβ airports that they are in Altanta, and for any other cases, using the label βOTHERβ.
For this, the np.select
method from NumPy is employed.
City | Name | Acronym |
---|---|---|
Miami | Miami International | (MIA) |
Miami | Opa-locka Executive | (OPF) |
Miami | Fort Lauderdale-Hollywood | (FLL) |
Atlanta | Hartsfield-Jackson Atlanta | (ATL) |
Atlanta | DeKalb-Peachtree | (PDK) |
Atlanta | Fulton County | (FTY) |
πsql
select
origin,
dest,
time_hour,
dep_delay,
arr_delay,case
(when dest in ('ATL','PDK','FTY') then 'ATL'
when dest in ('MIA','OPF','FLL') then 'MIA'
else 'Other'
end) as city ,
FROM flights
LIMIT 10;
πpython
(
flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest', 'time_hour',
'dep_delay', 'arr_delay'])
=np.select([flights['dest'].isin(['ATL','PDK','FTY']),
.assign( city'dest'].isin(['MIA', 'OPF', 'FLL'])],
flights['ATL','MIA'],
[='Other')
default
)10)
.head( )
Entity relationship diagram [DER]
When performing a join in Pandas, the merge method should be used.
How: Specifies the type of join to be performed.
Available options: {'left', 'right', 'outer', 'inner', 'cross'}
On: The key on which the tables will be joined. If more than one column is involved, a list should be provided. Examples:
on='year'
='inner', on='tailnum') fligths.merge(planes, how
='inner', on=['year','month','day']) fligths.merge(weather, how
='inner', left_on = 'origin', rigth_on='faa') fligths.merge(airports, how
Hereβs an example using the airlines and flights tables:
πsql
select
year,
f.month,
f.day,
f.
f.dep_time,
f.flight,
f.tailnum,as airport_origen,
f.origin
f.dest,
f.time_hour,
f.dep_delay,
f.arr_delay,
f.carrier,as airline_name
a.name FROM flights f
left join airlines a on f.carrier = a.carrier
LIMIT 5;
The rename method is used to rename columns, similar to the βasβ clause in SQL.
πpython
(
flightsfilter(['year', 'month', 'day', 'dep_time', 'flight',
.'tailnum', 'origin', 'dest', 'time_hour', 'dep_delay',
'arr_delay', 'carrier'])
= 'left', on ='carrier')
.merge(airlines, how = {'name':'airline_name', 'origin':'airport_origen'})
.rename(columns5)
.head(
)
You can find all the code in a π python notebook in the following [link]
If you want to learn more about `Pandas` and `NumPy`β¦
- [Pandas]
- [NumPy]
Other references:
- Image preview reference: [Imagen de vectorjuice en Freepik]
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Mendez (2023, Dec. 17). Romina Mendez: SQL to Python Pandas: A SQL User's Quick Guide. Retrieved from https://r0mymendez.github.io/posts_en/2023-12-17-sql-to-python-pandas-a-sql-users-quick-guide/
BibTeX citation
@misc{mendez2023sql, author = {Mendez, Romina}, title = {Romina Mendez: SQL to Python Pandas: A SQL User's Quick Guide}, url = {https://r0mymendez.github.io/posts_en/2023-12-17-sql-to-python-pandas-a-sql-users-quick-guide/}, year = {2023} }