Aggregating Data
Explore the Jupyter Notebooks
This page includes my notes on this topic.
To see these functions in action, check out the Jupyter notebook here: Sample Notebooks.
Summarizing Numerical Data
The mean is a common way to identify the "center" of numeric data. You can compute it using .mean() on a column. Other summary statistics include median, mode, min, max, variance, standard deviation, sums, and quantiles.
my_dataframe['column_name'].mean()
my_dataframe['column_name'].median()
my_dataframe['column_name'].quantile(0.25) # 25th percentile
Summarizing Dates
Date columns can also have summary statistics. For example, to find the oldest and youngest dates in a column, use .min() and .max().
my_dataframe['date_column'].min()
my_dataframe['date_column'].max()
Using .agg() Method
The .agg() method lets you apply custom summary functions. For example, we can create a compute_30th function which takes the column and computes the 30th percentile. This function can then be used with the agg() function.
def compute_30th(column):
return column.quantile(0.3)
my_dataframe['weight'].agg(compute_30th)
Summarizing Multiple Columns
You can apply .agg() to multiple columns at once, like so:
my_dataframe[['weight', 'height']].agg(compute_30th)
Multiple Summaries
Use .agg() to calculate multiple statistics at once by passing a list of functions:
def compute_40th(column):
return column.quantile(0.4)
my_dataframe['weight'].agg([compute_30th, compute_40th])
Cumulative Sum
The .cumsum() method computes cumulative sums. Each value in the result is the running total up to that row.
my_dataframe['weight'].cumsum()
Cumulative Statistics
Other cumulative methods include .cummax(), .cummin(), and .cumprod(). These return cumulative maximum, minimum, and product values, respectively.
my_dataframe['weight'].cummax()
my_dataframe['weight'].cummin()
my_dataframe['weight'].cumprod()
Avoiding Double Counting
Counting can be tricky when items appear multiple times in your data. Consider the DataFrame below which shows vet visits.
Name Breed Visit Date
0 Ted Labrador 2023-01-10
1 Stella Golden Retriever 2023-01-12
2 Ted Chow Chow 2023-01-15
3 Robin Poodle 2023-01-18
4 Ted Labrador 2023-01-20
5 Stella Golden Retriever 2023-01-22
Note that some pets, like Ted and Stella, have visited multiple times. To count Breeds correctly, we must address duplicates.
-
Dropping Duplicate Names
We can remove rows with duplicate names using
drop_duplicates(). By settingsubsetto the name column, we ensure each name appears only once. However, if two pets share the same name but are different Breeds, this method might not work.vet_dataframe.drop_duplicates(subset='Name')Output:
Name Breed Visit Date0 Ted Labrador 2023-01-101 Stella Golden Retriever 2023-01-123 Robin Poodle 2023-01-18 -
Dropping Duplicate Pairs
To handle duplicates more accurately, drop rows based on both
nameandBreedby passing both columns tosubset. This ensures all unique name-Breed combinations are retained.vet_dataframe.drop_duplicates(subset=['Name', 'Breed'])Output:
Name Breed Visit Date0 Ted Labrador 2023-01-101 Stella Golden Retriever 2023-01-122 Ted Chow Chow 2023-01-153 Robin Poodle 2023-01-18 -
Counting Breeds
Use the
value_counts()method on the Breed column to count each Breed, and addsort=Trueto display the most common Breeds first.vet_dataframe['Breed'].value_counts(sort=True)Output:
BreedLabrador 2Golden Retriever 2Chow Chow 1Poodle 1Name: count, dtype: int64 -
Proportions
Set
normalize=Trueinvalue_counts()to calculate proportions, showing percentages instead of raw counts. For example, 25% of the pets might be Labradors.vet_dataframe['Breed'].value_counts(normalize=True)Output:
BreedLabrador 0.333333Golden Retriever 0.333333Chow Chow 0.166667Poodle 0.166667Name: proportion, dtype: float64
Grouped Summaries
Grouped summaries allow you to get insights by calculating statistics for specific groups in your data. Consider the following dog data:
Name Breed Color Visit Date Weight
0 Ted Labrador Black 2023-01-10 30
1 Stella Golden Retriever Golden 2023-01-12 25
2 Ted Chow Chow Brown 2023-01-15 28
3 Robin Poodle White 2023-01-18 20
4 Ted Labrador Black 2023-01-20 32
5 Stella Golden Retriever Golden 2023-01-22 26
Instead of manually subsetting and calculating statistics for each group, the groupby() method simplifies the process.
-
Grouped Summaries
Group by a single column and calculate statistics. For instance, count visits for each breed:
dogs.groupby("Breed")["Name"].count()Output:
BreedChow Chow 1Golden Retriever 2Labrador 2Poodle 1Name: Name, dtype: int64 -
Multiple Summaries
You can calculate multiple statistics at once using
.agg(). For example, count unique visitors and total visits for each color:dogs.groupby("Color").agg({"Name": "nunique","Visit Date": "count"})Output:
Name Visit DateColorBlack 1 2Brown 1 1Golden 1 2White 1 1 -
Grouping by Multiple Columns
Group by multiple columns to calculate more specific statistics, such as the number of visits by breed and color:
dogs.groupby(["Color", "Breed"])["Name"].count()Output:
Color BreedBlack Labrador 2Brown Chow Chow 1Golden Golden Retriever 2White Poodle 1Name: Name, dtype: int64 -
Aggregating Multiple Columns
You can group by multiple columns and aggregate across multiple columns. For example, count unique names and total visits:
dogs.groupby(["Color", "Breed"]).agg({"Name": "nunique","Visit Date": "count"})Output:
Name Visit DateColor BreedBlack Labrador 1 2Brown Chow Chow 1 1Golden Golden Retriever 1 2White Poodle 1 1
Pivot Tables
Pivot tables offer a powerful way to calculate grouped summaries, similar to those in spreadsheets. Consider the following dog data:
Name Breed Color Visit Date Weight
0 Ted Labrador Black 2023-01-10 30
1 Stella Golden Retriever Golden 2023-01-12 25
2 Ted Chow Chow Brown 2023-01-15 28
3 Robin Poodle White 2023-01-18 20
4 Ted Labrador Black 2023-01-20 32
5 Stella Golden Retriever Golden 2023-01-22 26
-
Basic Pivot Table
Create a pivot table to calculate grouped summaries. For example, to find the mean weight of dogs by color:
dogs.pivot_table(values="Weight", index="Color")Output:
WeightColorBlack 31.0Brown 28.0Golden 25.5White 20.0 -
Custom Statistics
Use the
aggfuncargument to specify a different summary statistic, such as the median:import numpy as npdogs.pivot_table(values="Weight", index="Color", aggfunc=np.median)Output:
WeightColorBlack 31.0Brown 28.0Golden 25.5White 20.0 -
Multiple Statistics
Pass a list of functions to
aggfuncto calculate multiple statistics, such as the mean and median:dogs.pivot_table(values="Weight", index="Color", aggfunc=[np.mean, np.median])Output:
mean medianWeight WeightColorBlack 31.0 31.0Brown 28.0 28.0Golden 25.5 25.5White 20.0 20.0 -
Pivoting on Two Variables
Group by two variables using the
columnsargument. For example, to find mean weights by color and breed:dogs.pivot_table(values="Weight", index="Color", columns="Breed")Missing values appear as
NaNif no data exists for a combination.Breed Chow Chow Golden Retriever Labrador PoodleColorBlack NaN NaN 31.0 NaNBrown 28.0 NaN NaN NaNGolden NaN 25.5 NaN NaNWhite NaN NaN NaN 20.0 -
Filling Missing Values
Fill missing values with
fill_valueto replaceNaNwith a default value:dogs.pivot_table(values="Weight", index="Color", columns="Breed", fill_value=0)Output:
Breed Chow Chow Golden Retriever Labrador PoodleColorBlack 0 0 31.0 0Brown 28.0 0 0 0Golden 0 25.5 0 0White 0 0 0 20.0 -
Adding Totals
Enable the
marginsargument to include row and column totals, summarizing the dataset:dogs.pivot_table(values="Weight", index="Color", columns="Breed", fill_value=0, margins=True)This adds a summary row and column with mean values for each group and the overall dataset.
Breed Chow Chow Golden Retriever Labrador Poodle AllColorBlack 0.0 0.0 31.0 0.0 31.000000Brown 28.0 0.0 0.0 0.0 28.000000Golden 0.0 25.5 0.0 0.0 25.500000White 0.0 0.0 0.0 20.0 20.000000All 28.0 25.5 31.0 20.0 26.833333