Slicing and Indexing
Explore the Jupyter Notebooks
This page includes my notes on this topic.
To see these functions in action, check out the Jupyter notebook here: Sample Notebooks.
Explicit Indexes
Indexes provide an efficient way to manage and subset data in pandas DataFrames. They offer efficient filtering, but they have limitations: they add complexity and break "tidy data" rules.
Consider the dataset below:
Name Breed Color Visit Date Weight
0 Ted Labrador Black 2023-01-10 30
1 Stella Golden Retriever Golden 2023-01-12 25
2 Ted Chow Chow Brown 2023-01-15 28
3 Robin Poodle White 2023-01-18 20
4 Ted Labrador Black 2023-01-20 32
5 Stella Golden Retriever Golden 2023-01-22 26
Indexing operations:
-
Accessing Columns and Index
You can access column names and row indexes directly.
print(dogs.columns) # Access column namesprint(dogs.index) # Access row indexOutput:
Index(['Name', 'Breed', 'Color', 'Visit Date', 'Weight'], dtype='object')RangeIndex(start=0, stop=6, step=1) -
Setting an Index
Set a column (e.g.,
Name) as the index.dogs_indexed = dogs.set_index("Name")print(dogs_indexed)Output:
Breed Color Visit Date WeightNameTed Labrador Black 2023-01-10 30Stella Golden Retriever Golden 2023-01-12 25Ted Chow Chow Brown 2023-01-15 28Robin Poodle White 2023-01-18 20Ted Labrador Black 2023-01-20 32Stella Golden Retriever Golden 2023-01-22 26 -
Resetting the Index
Reset the index to move it back to the DataFrame body.
dogs_reset = dogs_indexed.reset_index()print(dogs_reset)Output:
Name Breed Color Visit Date Weight0 Ted Labrador Black 2023-01-10 301 Stella Golden Retriever Golden 2023-01-12 252 Ted Chow Chow Brown 2023-01-15 283 Robin Poodle White 2023-01-18 204 Ted Labrador Black 2023-01-20 325 Stella Golden Retriever Golden 2023-01-22 26 -
Dropping an Index
Remove the index entirely using the
dropargument.dogs_dropped = dogs_indexed.reset_index(drop=True)print(dogs_dropped)Output:
Breed Color Visit Date Weight0 Labrador Black 2023-01-10 301 Golden Retriever Golden 2023-01-12 252 Chow Chow Brown 2023-01-15 283 Poodle White 2023-01-18 204 Labrador Black 2023-01-20 325 Golden Retriever Golden 2023-01-22 26 -
Subsetting with Indexes
Indexes make filtering data simpler. For example, filter rows where
Nameis "Ted."ted_rows = dogs_indexed.loc["Ted"]print(ted_rows)Output:
Breed Color Visit Date WeightNameTed Labrador Black 2023-01-10 30Ted Chow Chow Brown 2023-01-15 28Ted Labrador Black 2023-01-20 32 -
Multi-Level Indexing
Set multiple columns (e.g.,
BreedandColor) as the index.dogs_multi = dogs.set_index(["Breed", "Color"])print(dogs_multi)Output:
Name Visit Date WeightBreed ColorLabrador Black Ted 2023-01-10 30Golden Retriever Golden Stella 2023-01-12 25Chow Chow Brown Ted 2023-01-15 28Poodle White Robin 2023-01-18 20Labrador Black Ted 2023-01-20 32Golden Retriever Golden Stella 2023-01-22 26 -
Subsetting Multi-Level Indexes
Subset rows using outer and inner indexes.
Outer Index: Filter rows for "Labrador."
labs = dogs_multi.loc["Labrador"]print(labs)Output:
Name Visit Date WeightColorBlack Ted 2023-01-10 30Black Ted 2023-01-20 32Inner Index: Filter rows for specific combinations (e.g., Labrador and Black).
brown_labs = dogs_multi.loc[[("Labrador", "Black")]]print(brown_labs)Output:
Name Visit Date WeightBreed ColorLabrador Black Ted 2023-01-10 30Labrador Black Ted 2023-01-20 32 -
Sorting Indexes
Sort rows by index values using
sort_index().sorted_dogs = dogs_multi.sort_index()print(sorted_dogs)Output:
Name Visit Date WeightBreed ColorChow Chow Brown Ted 2023-01-15 28Golden Retriever Golden Stella 2023-01-12 25Golden Retriever Golden Stella 2023-01-22 26Labrador Black Ted 2023-01-10 30Labrador Black Ted 2023-01-20 32Poodle White Robin 2023-01-18 20
Slicing and Subsetting
Slicing is a way to select consecutive elements or subsets of data from lists and DataFrames. pandas provides versatile methods like .loc and .iloc for this purpose.
Consider the dogs data:
breed color age weight date_of_birth
0 Labrador Brown 3 25 2016-01-01
1 Chow Chow Tan 4 18 2015-06-15
2 Poodle White 2 12 2017-03-20
3 Labrador Black 7 30 2014-09-10
4 Beagle Brown 5 10 2015-12-01
5 Chihuahua Grey 1 3 2018-05-05
Operations:
-
Slicing Lists
You can slice a list by specifying start and end positions with a colon. The end position is exclusive.
breeds = ["Labrador", "Poodle", "Chow Chow", "Golden Retriever", "Labrador", "Chihuahua"]print(breeds[2:5]) # Slice from index 2 to 4print(breeds[:3]) # Slice first three elementsOutput:
['Chow Chow', 'Golden Retriever', 'Labrador']['Labrador', 'Poodle', 'Chow Chow'] -
Sort Index Before Slicing
To slice rows in a DataFrame, sort the index first.
dogs = dogs.set_index(["Breed", "Color"]).sort_index()print(dogs)Output:
Name Visit DateBreed ColorChow Chow Brown Ted 2023-01-15Golden Retriever Golden Stella 2023-01-12Labrador Black Ted 2023-01-10Labrador Black Ted 2023-01-20 -
Slicing Outer Index Level
Use
.loc[]to slice rows by index values. The last value is included.print(dogs.loc["Labrador":"Poodle"])Output:
Name Visit DateBreed ColorLabrador Black Ted 2023-01-10Labrador Black Ted 2023-01-20 -
Slicing Inner Index Levels
For inner index levels, use tuples for start and end positions.
print(dogs.loc[("Labrador", "Black"):("Poodle", "White")])Output:
Name Visit DateBreed ColorLabrador Black Ted 2023-01-10 -
Slicing Columns
Slice columns while keeping all rows by passing a colon for rows.
print(dogs.loc[:, "Name":"Visit Date"])Output:
Name Visit DateBreed ColorChow Chow Brown Ted 2023-01-15Golden Retriever Golden Stella 2023-01-12 -
Slicing Rows and Columns
Simultaneously slice rows and columns by specifying slices for both.
print(dogs.loc["Labrador":"Poodle", "Name":"Visit Date"])Output:
Name Visit DateBreed ColorLabrador Black Ted 2023-01-10 -
Slicing by Dates
Set a date column as the index to slice by date ranges.
dogs = dogs.set_index("Visit Date").sort_index()print(dogs.loc["2023-01-10":"2023-01-15"])Output:
Name Breed ColorVisit Date2023-01-10 Ted Labrador Black2023-01-12 Stella Golden Retriever Golden -
Partial Date Slicing
Slice by year only to include all dates within the range.
print(dogs.loc["2023"])Output:
Name Breed ColorVisit Date2023-01-10 Ted Labrador Black2023-01-12 Stella Golden Retriever Golden -
Subsetting by Position with .iloc
Use
.ilocfor slicing by row and column numbers. The end value is exclusive.print(dogs.iloc[1:3, 0:2])Output:
Name BreedVisit Date2023-01-12 Stella Golden Retriever2023-01-15 Ted Chow Chow
Pivoting and Summary Stats
Pivot tables are created using .pivot_table(). The values argument specifies the column to aggregate, while index and columns define row and column grouping. The default aggregation is mean.
Consider the dogs data:
breed color age weight date_of_birth height
0 Labrador Brown 3 25 2016-01-01 23
1 Chow Chow Tan 4 18 2015-06-15 20
2 Poodle White 2 12 2017-03-20 20
3 Labrador Black 7 30 2014-09-10 23
4 Beagle Brown 5 10 2015-12-01 21
5 Chihuahua Grey 1 3 2018-05-05 18
6 Poodle Brown 3 20 2017-03-20 22
7 Labrador Black 6 28 2014-09-10 24
8 Beagle White 4 12 2015-12-01 21
9 Chow Chow Tan 5 17 2015-06-15 19
-
Pivoting
You can create a pivot table to group and aggregate data by rows and columns.
dogs.pivot_table(values='height', index='breed', columns='color', aggfunc='mean')Output:
color Black Brown Grey Tan WhitebreedBeagle NaN 21.0 NaN NaN 21.0Chihuahua NaN NaN 18.0 NaN NaNChow Chow NaN NaN NaN 19.5 NaNLabrador 23.5 23.0 NaN NaN NaNPoodle NaN 22.0 NaN NaN 20.0 -
.loc[] + slicing is a power combo
Pivot tables are DataFrames with sorted indexes, so slicing and
.loc[]functionality can be used for subsetting.dogs.loc[dogs['breed'] == 'Poodle', ['height']]Output:
height2 206 22 -
The axis argument
Use the
axisargument to specify whether to calculate summary stats across rows (axis=0, default) or columns (axis=1).dogs.mean(axis="index") # Across rowsOutput:
height 22.5 -
Calculating summary stats across columns
Set
axis=1to calculate summary stats across columns, useful when all columns contain the same data type.dogs.mean(axis=1) # Across columnsOutput:
breedLabrador 22.5Poodle 21.0