Use pandas to do joins, grouping, aggregations, and analytics on datasets in Python.
Python’s pandas library, with its fast and flexible data structures, has become the de facto standard for data-centric Python applications, offering a rich set of built-in facilities to analyze details of structured data. Built on top of other core Python libraries, such as NumPy, SQLAlchemy, and Matplotlib, pandas leverages these libraries behind the scenes for quick and easy data manipulations, allowing you to take advantage of their functionality with less coding. For example, the read_sql() and to_sql() pandas methods use SQLAlchemy under the hood, providing a unified way to send pandas data in and out of a SQL database.
This article illustrates how you can use pandas to combine datasets, as well as how to group, aggregate, and analyze data in them. For comparison purposes, you’ll also see how these same tasks can be addressed with SQL.
Let’s start with an example: an online outdoor fashion retailer where the business fulfills customer sales orders for one or more items. The sample dataset includes information related to sold items, customer orders, and employees who fulfil those orders. You’ll see how to combine, group by, and aggregate this sales data.
Creating database structures for article examples
To follow along with the examples in this article, you need to create several example tables in an Oracle database by executing the pandas_article.sql script that accompanies the article. Also make sure you have the pandas, SQLAlchemy, and cx_Oracle libraries installed in your Python environment. You can install them using the pip command:
pip install pandas pip install SQLAlchemy pip install cx_Oracle
For details on how to install pandas, refer to the documentation. For SQLAlchemy installation details, refer to the SQLAlchemy documentation. For details on how to install cx_Oracle, refer to the cx_Oracle Installation page. You might also want to look at the cx_Oracle Initialization page.
Loading data from Oracle Database to pandas DataFrames
After executing the pandas_article.sql script, you should have the orders and details database tables populated with example data. The following script connects to the database and loads the data from the orders and details tables into two separate DataFrames (in pandas, DataFrame is a key data structure designed to work with tabular data):
import pandas as pd
import cx_Oracle
import sqlalchemy
from sqlalchemy.exc import SQLAlchemyError
try:
engine = sqlalchemy.create_engine("oracle+cx_oracle://usr:pswd@localhost/?service_name=orclpdb1", arraysize=1000)
orders_sql = """SELECT * FROM orders""";
df_orders = pd.read_sql(orders_sql, engine)
details_sql = """SELECT * FROM details""";
df_details = pd.read_sql(details_sql, engine)
print(df_orders)
print(df_details)
except SQLAlchemyError as e:
print(e)
Note the use of the DataFrame.read_sql() function in the above script. This function removes the burden of explicitly fetching the retrieved data and then converting it into the pandas DataFrame format. The read_sql() function does these tasks for you behind the scenes.
In this example, you use sqlalchemy to create an engine to connect to an Oracle database. Using a SQLalchemy engine allows you to pass in the arraysize argument that will be used when cx_Oracle.Cursor objects are created.
The arraysize attribute of the cx_Oracle.Cursor object is used to tune the number of rows internally fetched and buffered when fetching rows from SELECT statements and REF CURSOR. By default, this attribute is set to 100, which is perfectly acceptable when you need to load a small amount of data from the database. However, when you’re dealing with large amounts of data, you should increase the value of arraysize to reduce the number of round trips between your script and the database and, therefore, improve performance. For further details on how you can use arraysize, refer to the Tuning cx_Oracle documentation page.
The script prints the df_orders and df_details DataFrames loaded from the database, producing the following output:
PONO ORDATE EMPL 0 7723510 2020-12-15 John Holland 1 5626234 2020-12-15 Tim Lewis 2 7723533 2020-12-15 John Holland 3 7823675 2020-12-16 Maya Candy 4 5626376 2020-12-16 Tim Lewis 5 5626414 2020-12-17 Dan West 6 7823787 2020-12-17 Maya Candy 7 5626491 2020-12-17 Dan West PONO LINEID ITEM BRAND PRICE QUANTITY DISCOUNT 0 7723510 1 Swim Shorts Hurley 17.95 1 0 1 7723510 2 Jacket Oakley 142.33 1 0 2 5626234 1 Socks Vans 16.15 4 15 3 7723533 1 Jeans Quiksilver 84.90 2 25 4 7723533 2 Socks Mons Royale 10.90 2 0 5 7723533 3 Socks Stance 12.85 2 20 6 7823675 1 T-shirt Patagonia 35.50 3 0 7 5626376 1 Hoody Animal 44.05 1 0 8 5626376 2 Cargo Shorts Animal 38.60 1 12 9 5626414 1 Shirt Volcom 78.55 2 0 10 7823787 1 Boxer Shorts Superdry 30.45 2 18 11 7823787 2 Shorts Barts 35.90 1 0 12 5626491 1 Cargo Shorts Billabong 48.74 1 22 13 5626491 2 Sweater Dickies 65.95 1 0
Joining DataFrames
Often, the information you need is scattered over several datasets, requiring you to join these datasets before you can query for necessary data. That is why combining data is typically one of the first steps in a data analysis pipeline. Quite often you need to join two datasets by matching data in a column that they have in common. The DataFrame.merge() method is designed to address this task for two DataFrames. The method allows you to explicitly specify columns in the DataFrames, on which you want to join those DataFrames. You can also specify the type of join to produce the desired result set. By default, merge() creates an inner join on the column that the DataFrames being joined have in common. So, you can join the df_orders and df_details DataFrames created in the previous section with the following simple call of merge():
df_orders_details = df_orders.merge(df_details)
If you print the df_orders_details DataFrame, it should look as follows:
PONO ORDATE EMPL LINEID ITEM BRAND PRICE QUANTITY DISCOUNT 0 7723510 2020-12-15 John Holland 1 Swim Shorts Hurley 17.95 1 0 1 7723510 2020-12-15 John Holland 2 Jacket Oakley 142.33 1 0 2 5626234 2020-12-15 Tim Lewis 1 Socks Vans 16.15 4 15 3 7723533 2020-12-15 John Holland 1 Jeans Quiksilver 84.90 2 25 4 7723533 2020-12-15 John Holland 2 Socks Mons Royale 10.90 2 0 5 7723533 2020-12-15 John Holland 3 Socks Stance 12.85 2 20 6 7823675 2020-12-16 Maya Candy 1 T-shirt Patagonia 35.50 3 0 7 5626376 2020-12-16 Tim Lewis 1 Hoody Animal 44.05 1 0 8 5626376 2020-12-16 Tim Lewis 2 Cargo Shorts Animal 38.60 1 12 9 5626414 2020-12-17 Dan West 1 Shirt Volcom 78.55 2 0 10 7823787 2020-12-17 Maya Candy 1 Boxer Shorts Superdry 30.45 2 18 11 7823787 2020-12-17 Maya Candy 2 Shorts Barts 35.90 1 0 12 5626491 2020-12-17 Dan West 1 Cargo Shorts Billabong 48.74 1 22 13 5626491 2020-12-17 Dan West 2 Sweater Dickies 65.95 1 0
After joining two datasets into a single one, you may still need to modify it before you can perform analysis. In the case of df_orders_details being discussed here, you might need to add some new columns, calculating their values based on the values in the existing columns. Thus, you might need to add a TOTAL column that contains the extended item price (price multiplied by quantity and minus discount), for example:
df_orders_details['TOTAL'] = df_orders_details.PRICE * df_orders_details.QUANTITY * (1 - df_orders_details.DISCOUNT/100)
You might also need to include a column that contains a given discount, for example:
df_orders_details['OFF'] = df_orders_details.PRICE * df_orders_details.QUANTITY * (df_orders_details.DISCOUNT/100)
Since all the float columns in the df_orders_details DataFrame contain monetary values, you can specify two decimal places to round each float column to
df_orders_details = df_orders_details.round(2)
Some columns in the DataFrame may not be needed for the analysis you want to perform. So, you can keep only those columns that are needed. In the df_orders_details DataFrame, for example, if you want to group sales data (both totals and discounts) by order dates and employees, you can keep just these four columns:
df_sales = df_orders_details[['ORDATE','EMPL', 'TOTAL', 'OFF']]
If you print the DataFrame, it will look like this:
ORDATE EMPL TOTAL OFF 0 2020-12-15 John Holland 17.95 0.00 1 2020-12-15 John Holland 142.33 0.00 2 2020-12-15 Tim Lewis 54.91 9.69 3 2020-12-15 John Holland 127.35 42.45 4 2020-12-15 John Holland 21.80 0.00 5 2020-12-15 John Holland 20.56 5.14 6 2020-12-16 Maya Candy 106.50 0.00 7 2020-12-16 Tim Lewis 44.05 0.00 8 2020-12-16 Tim Lewis 33.97 4.63 9 2020-12-17 Dan West 157.10 0.00 10 2020-12-17 Maya Candy 49.94 10.96 11 2020-12-17 Maya Candy 35.90 0.00 12 2020-12-17 Dan West 38.02 10.72 13 2020-12-17 Dan West 65.95 0.00
How might you arrive at the same data structure in your database schema in which you have the orders and details tables? One simple solution is to create a view over the join of these tables, including in the select list the same columns as you had in the df_sales DataFrame:
CREATE VIEW sales_v AS SELECT ordate, empl, price*quantity*(1-discount/100) AS total, price*quantity*(discount/100) AS off FROM orders INNER JOIN details ON orders.pono = details.pono;
Grouping and aggregating data
Using the DataFrame.groupby() method you can split a DataFrame’s data into subsets (groups) that have matching values for one or more columns, and then apply an aggregate function to each group. In the following example, you group by the ORDATE and EMPL columns in the df_sales DataFrame and then apply the sum() aggregate function to the TOTAL and OFF columns within the formed groups:
df_date_empl = df_sales.groupby(['ORDATE','EMPL']).sum()
The generated DataFrame should look as shown below:
TOTAL OFF
ORDATE EMPL
2020-12-15 John Holland 329.99 47.59
Tim Lewis 54.91 9.69
2020-12-16 Maya Candy 106.50 0.00
Tim Lewis 78.02 4.63
2020-12-17 Dan West 261.07 10.72
Maya Candy 85.84 10.96
One problem here is that the aggregate function you apply to the groupby object is applied to each numeric column of the DataFrame. But what if you need to apply multiple aggregate functions to multiple groupby columns? For example, if you want to apply sum() and mean() to the TOTAL column and max() to the OFF column, then this is where the agg() function of the groupby object comes in handy:
df_aggs = df_sales.groupby(['ORDATE','EMPL']).agg({'TOTAL': ['sum', 'mean'], 'OFF': 'max'}).round(2)
The above example illustrates how you can select a certain column for aggregation and perform different aggregations per column. If you print df_aggs, it will look as follows:
TOTAL OFF
sum mean max
ORDATE EMPL
2020-12-15 John Holland 329.99 66.00 42.45
Tim Lewis 54.91 54.91 9.69
2020-12-16 Maya Candy 106.50 106.50 0.00
Tim Lewis 78.02 39.01 4.63
2020-12-17 Dan West 261.07 87.02 10.72
Maya Candy 85.84 42.92 10.96
You might want to flatten a hierarchical index in columns. This can be done as follows:
df_aggs.columns = df_aggs.columns.map('_'.join).str.strip()
This will change the column names as shown below:
TOTAL_sum TOTAL_mean OFF_max ORDATE EMPL ...
To generate the same result set with a query to the article sample database, you could issue the following SELECT statement against the sales_v view that you should have created previously:
SELECT
ordate,
empl,
ROUND(SUM(total),2) TOTAL_sum,
ROUND(AVG(total),2) TOTAL_mean,
ROUND(MAX(off),2) OFF_max
FROM
sales_v
GROUP BY
ordate, empl
ORDER BY
ordate;
Analytical processing within groups of data
In practice, you may not always need to view data in summarized format, aggregating a group of rows into a single resulting row as illustrated in the previous example. In contrast, you may need to do some analytical processing within a group of rows so the number of rows in the group remains the same. For example, if you want to compare the salary of each employee in a department with the average salary of the employees in this department, this processing does not imply any reduction in the number of rows in the dataset—the number of rows must match the number of employees, both before and after processing.
Let’s illustrate this analytical processing with a second, more complex example. Imagine you want to analyze stock price data for a list of tickers over a certain period of time. To start, you want to weed out the tickers whose prices dropped below 1% of the previous day’s price over the period. To accomplish this, you need to group data by ticker symbol, ordering the rows by date in each group. Then you can iterate over the rows in a group, comparing the stock price in the current row with the price in the previous row. If the price in a current row is less than the price in the previous row by more than 1%, then the entire group of rows must be excluded from the result set. This section describes how you could implement this filtering.
The example uses stock data obtained via the yfinance library, a Python wrapper for the Yahoo Finance API, which you can install with the pip command, as follows:
pip install yfinance
In the following script, you get stock data for several popular stocks for a five-day period:
import pandas as pd
import yfinance as yf
stocks = pd.DataFrame()
tickers = ['AAPL', 'TSLA', 'FB', 'ORCL','AMZN']
for ticker in tickers:
tkr = yf.Ticker(ticker)
hist = tkr.history(period='5d')
hist['Symbol']=ticker
stocks = stocks.append(hist[['Symbol', 'Close']].rename(columns={'Close': 'Price'}))
yfinance returns a requested dataset as a pandas DataFrame with the Date column as the index. Assuming you are targeting the closing prices only, you keep only the Symbol and Close columns, having renamed the latter to Price for comprehension. As a result, the data in the DataFrame might look like this:
Symbol Price Date 2020-12-18 AAPL 126.660004 2020-12-21 AAPL 128.229996 2020-12-22 AAPL 131.880005 2020-12-23 AAPL 130.960007 2020-12-24 AAPL 131.970001 2020-12-18 TSLA 695.000000 2020-12-21 TSLA 649.859985 2020-12-22 TSLA 640.340027 2020-12-23 TSLA 645.979980 2020-12-24 TSLA 661.770020 2020-12-18 FB 276.399994 2020-12-21 FB 272.790009 2020-12-22 FB 267.089996 2020-12-23 FB 268.109985 2020-12-24 FB 267.399994 2020-12-18 ORCL 65.059998 2020-12-21 ORCL 64.480003 2020-12-22 ORCL 65.150002 2020-12-23 ORCL 65.300003 2020-12-24 ORCL 64.959999 2020-12-18 AMZN 3201.649902 2020-12-21 AMZN 3206.179932 2020-12-22 AMZN 3206.520020 2020-12-23 AMZN 3185.270020 2020-12-24 AMZN 3172.689941
From the above row set, you need to select the rows related to only those symbols whose prices did not drop below 1% of the previous day’s price. For this, you need a mechanism that will allow you to compare the Price value of a row with the Price value of the previous row within a symbol group. The following line of code implements this mechanism, grouping the row set by symbol and adding a new column that contains the previous day’s price values.
stocks['Prev'] = stocks.groupby(['Symbol'])['Price'].shift(1)
The updated row set should look as follows:
Symbol Price Prev Date 2020-12-18 AAPL 126.660004 NaN 2020-12-21 AAPL 128.229996 126.660004 2020-12-22 AAPL 131.880005 128.229996 2020-12-23 AAPL 130.960007 131.880005 2020-12-24 AAPL 131.970001 130.960007 2020-12-18 TSLA 695.000000 NaN 2020-12-21 TSLA 649.859985 695.000000 2020-12-22 TSLA 640.340027 649.859985 2020-12-23 TSLA 645.979980 640.340027 2020-12-24 TSLA 661.770020 645.979980 2020-12-18 FB 276.399994 NaN 2020-12-21 FB 272.790009 276.399994 2020-12-22 FB 267.089996 272.790009 2020-12-23 FB 268.109985 267.089996 2020-12-24 FB 267.399994 268.109985 2020-12-18 ORCL 65.059998 NaN 2020-12-21 ORCL 64.480003 65.059998 2020-12-22 ORCL 65.150002 64.480003 2020-12-23 ORCL 65.300003 65.150002 2020-12-24 ORCL 64.959999 65.300003 2020-12-18 AMZN 3201.649902 NaN 2020-12-21 AMZN 3206.179932 3201.649902 2020-12-22 AMZN 3206.520020 3206.179932 2020-12-23 AMZN 3185.270020 3206.520020 2020-12-24 AMZN 3172.689941 3185.270020
The Prev results for the first day of the observation period are NaN because this example does not track what happened before the five-day range.
Now you can find those rows where the ratio of the price to the previous price is less than 99%, for example:
stocks_to_exclude = stocks[stocks['Price']/stocks['Prev'] < .99]
The found rows are the following:
Symbol Price Prev Date 2020-12-21 TSLA 649.859985 695.000000 2020-12-22 TSLA 640.340027 649.859985 2020-12-21 FB 272.790009 276.399994 2020-12-22 FB 267.089996 272.790009
You can extract the symbols presented in the above rows as follows:
exclude_list = list(set(stocks_to_exclude['Symbol'].tolist()))
Here you extract the values of the Symbol column in the stocks_to_exclude DataFrame, converting those values to a list. To exclude duplicates from this list, you convert it into a set and then back to a list (one of the most popular ways to remove duplicates from a list).
['TSLA', 'FB']
Next you need to exclude those rows from the stocks DataFrame that includes the above names in the Symbol field. This can be implemented as the following one-liner:
stocks_filtered = stocks[~stocks['Symbol'].isin(exclude_list)][['Symbol', 'Price']]
You pass in the exclude_list that you created previously to the stocks['Symbol'].isin() function to get a Series of booleans indicating if each value in the stocks['Symbol'] Series is in exclude_list. You put a tilde sign in front of stocks['Symbol'] to invert the boolean Series returned by isin(). You pass in this inverted boolean Series to the [] operator of the stocks DataFrame to return all the rows that do not contain in the Symbol column symbols found in the exclude_list. So, the resulting DataFrame should look as follows:
Symbol Price Date 2020-12-18 AAPL 126.660004 2020-12-21 AAPL 128.229996 2020-12-22 AAPL 131.880005 2020-12-23 AAPL 130.960007 2020-12-24 AAPL 131.970001 2020-12-18 ORCL 65.059998 2020-12-21 ORCL 64.480003 2020-12-22 ORCL 65.150002 2020-12-23 ORCL 65.300003 2020-12-24 ORCL 64.959999 2020-12-18 AMZN 3201.649902 2020-12-21 AMZN 3206.179932 2020-12-22 AMZN 3206.520020 2020-12-23 AMZN 3185.270020 2020-12-24 AMZN 3172.689941
With the help of analytical SQL, you can get the same result set with a single query to the database. Before you can do this, however, you need to save the unfiltered row set to the article database, which should contain the stocks table for storing this data (refer back to the pandas_article.sql script that you should have run at the very beginning).
To conform to the structure of the stocks database table, you need to modify the stocks DataFrame that contains the unfiltered data of this example. This can be done with the following lines of code:
stocks_to_db = stocks[['Symbol', 'Price']].reset_index().rename(columns={'Date': 'Dt'}).round(2)
stocks_to_db = stocks_to_db.astype({'Dt': str})
In the first line, you specify the columns to include in the result set: Symbol and Price. By resetting the index, you add Date to this column list. To conform to the name of this column in the stocks database table, you rename it to Dt. The round(2) function rounds the values in the Price column to two decimal places. In the second line, you cast the Dt column to the str type, because pandas sets it to datetime by default.
Finally, you need to convert the stocks_to_db DataFrame to a structure that is passable to a method that can do a bulk insert operation. In the following line of code, you convert the stocks_to_db DataFrame to a list of tuples:
data = list(stocks_to_db.itertuples(index=False, name=None))
The following script uses the above list of tuples to upload the data it contains to the database. Storing the data you work with can be useful when you’re going to reuse it.
import cx_Oracle
try:
conn = cx_Oracle.connect("usr", "pswd", "localhost/orcl")
cursor = conn.cursor()
#defining the query
query_add_stocks = """INSERT INTO stocks (dt, symbol, price)
VALUES (TO_DATE(:1, 'YYYY-MM-DD'), :2, :3)"""
#inserting the stock rows
cursor.executemany(query_add_stocks, data)
conn.commit()
except cx_Oracle.DatabaseError as exc:
err = exc.args
print("Oracle-Error-Code:", err.code)
print("Oracle-Error-Message:", err.message)
finally:
cursor.close()
conn.close()
In this script, you connect to the database and obtain a cursor object to interact with it. You use the cursor.executemany() method that inserts all the rows from the data list of tuples into the database in a single round trip.
After the successful execution of the above script, you can issue queries against the stocks table. To get the row set you had in the stocks_filtered DataFrame, you can issue the following query:
SELECT s.* FROM stocks s LEFT JOIN (SELECT DISTINCT(symbol) FROM (SELECT price/LAG(price) OVER (PARTITION BY symbol ORDER BY dt) AS dif, symbol FROM stocks) WHERE dif <0.99) a ON a.symbol = s.symbol WHERE a.symbol IS NULL;
Conclusion
By providing a rich set of methods, data containers, and types, pandas is one of those packages that make Python a great language for data processing and analysis. As you saw in this article, pandas, just like SQL, lets you perform even complex data queries with the help of intuitive and easy-to-use facilities, combining and reshaping original datasets as needed.
Dig deeper
- Learn more about how to install pandas.
- Learn more about how to install SQLAlchemy.
- Learn more about how to install cx_Oracle.
- Read the cx_Oracle Initialization guide.
- Get the sample dataset for this article.
Illustration: Wes Rowell