Slicing and Indexing
Explore the Jupyter Notebooks
This page includes my notes on this topic.
To see these functions in action, check out the Jupyter notebook here: Sample Notebooks.
Explicit Indexes
Indexes provide an efficient way to manage and subset data in pandas DataFrames. They offer efficient filtering, but they have limitations: they add complexity and break "tidy data" rules.
Consider the dataset below:
Name Breed Color Visit Date Weight
0 Ted Labrador Black 2023-01-10 30
1 Stella Golden Retriever Golden 2023-01-12 25
2 Ted Chow Chow Brown 2023-01-15 28
3 Robin Poodle White 2023-01-18 20
4 Ted Labrador Black 2023-01-20 32
5 Stella Golden Retriever Golden 2023-01-22 26
Indexing operations:
-
Accessing Columns and Index
You can access column names and row indexes directly.
print(dogs.columns) # Access column names
print(dogs.index) # Access row indexOutput:
Index(['Name', 'Breed', 'Color', 'Visit Date', 'Weight'], dtype='object')
RangeIndex(start=0, stop=6, step=1) -
Setting an Index
Set a column (e.g.,
Name
) as the index.dogs_indexed = dogs.set_index("Name")
print(dogs_indexed)Output:
Breed Color Visit Date Weight
Name
Ted Labrador Black 2023-01-10 30
Stella Golden Retriever Golden 2023-01-12 25
Ted Chow Chow Brown 2023-01-15 28
Robin Poodle White 2023-01-18 20
Ted Labrador Black 2023-01-20 32
Stella Golden Retriever Golden 2023-01-22 26 -
Resetting the Index
Reset the index to move it back to the DataFrame body.
dogs_reset = dogs_indexed.reset_index()
print(dogs_reset)Output:
Name Breed Color Visit Date Weight
0 Ted Labrador Black 2023-01-10 30
1 Stella Golden Retriever Golden 2023-01-12 25
2 Ted Chow Chow Brown 2023-01-15 28
3 Robin Poodle White 2023-01-18 20
4 Ted Labrador Black 2023-01-20 32
5 Stella Golden Retriever Golden 2023-01-22 26 -
Dropping an Index
Remove the index entirely using the
drop
argument.dogs_dropped = dogs_indexed.reset_index(drop=True)
print(dogs_dropped)Output:
Breed Color Visit Date Weight
0 Labrador Black 2023-01-10 30
1 Golden Retriever Golden 2023-01-12 25
2 Chow Chow Brown 2023-01-15 28
3 Poodle White 2023-01-18 20
4 Labrador Black 2023-01-20 32
5 Golden Retriever Golden 2023-01-22 26 -
Subsetting with Indexes
Indexes make filtering data simpler. For example, filter rows where
Name
is "Ted."ted_rows = dogs_indexed.loc["Ted"]
print(ted_rows)Output:
Breed Color Visit Date Weight
Name
Ted Labrador Black 2023-01-10 30
Ted Chow Chow Brown 2023-01-15 28
Ted Labrador Black 2023-01-20 32 -
Multi-Level Indexing
Set multiple columns (e.g.,
Breed
andColor
) as the index.dogs_multi = dogs.set_index(["Breed", "Color"])
print(dogs_multi)Output:
Name Visit Date Weight
Breed Color
Labrador Black Ted 2023-01-10 30
Golden Retriever Golden Stella 2023-01-12 25
Chow Chow Brown Ted 2023-01-15 28
Poodle White Robin 2023-01-18 20
Labrador Black Ted 2023-01-20 32
Golden Retriever Golden Stella 2023-01-22 26 -
Subsetting Multi-Level Indexes
Subset rows using outer and inner indexes.
Outer Index: Filter rows for "Labrador."
labs = dogs_multi.loc["Labrador"]
print(labs)Output:
Name Visit Date Weight
Color
Black Ted 2023-01-10 30
Black Ted 2023-01-20 32Inner Index: Filter rows for specific combinations (e.g., Labrador and Black).
brown_labs = dogs_multi.loc[[("Labrador", "Black")]]
print(brown_labs)Output:
Name Visit Date Weight
Breed Color
Labrador Black Ted 2023-01-10 30
Labrador Black Ted 2023-01-20 32 -
Sorting Indexes
Sort rows by index values using
sort_index()
.sorted_dogs = dogs_multi.sort_index()
print(sorted_dogs)Output:
Name Visit Date Weight
Breed Color
Chow Chow Brown Ted 2023-01-15 28
Golden Retriever Golden Stella 2023-01-12 25
Golden Retriever Golden Stella 2023-01-22 26
Labrador Black Ted 2023-01-10 30
Labrador Black Ted 2023-01-20 32
Poodle White Robin 2023-01-18 20
Slicing and Subsetting
Slicing is a way to select consecutive elements or subsets of data from lists and DataFrames. pandas provides versatile methods like .loc
and .iloc
for this purpose.
Consider the dogs data:
breed color age weight date_of_birth
0 Labrador Brown 3 25 2016-01-01
1 Chow Chow Tan 4 18 2015-06-15
2 Poodle White 2 12 2017-03-20
3 Labrador Black 7 30 2014-09-10
4 Beagle Brown 5 10 2015-12-01
5 Chihuahua Grey 1 3 2018-05-05
Operations:
-
Slicing Lists
You can slice a list by specifying start and end positions with a colon. The end position is exclusive.
breeds = ["Labrador", "Poodle", "Chow Chow", "Golden Retriever", "Labrador", "Chihuahua"]
print(breeds[2:5]) # Slice from index 2 to 4
print(breeds[:3]) # Slice first three elementsOutput:
['Chow Chow', 'Golden Retriever', 'Labrador']
['Labrador', 'Poodle', 'Chow Chow'] -
Sort Index Before Slicing
To slice rows in a DataFrame, sort the index first.
dogs = dogs.set_index(["Breed", "Color"]).sort_index()
print(dogs)Output:
Name Visit Date
Breed Color
Chow Chow Brown Ted 2023-01-15
Golden Retriever Golden Stella 2023-01-12
Labrador Black Ted 2023-01-10
Labrador Black Ted 2023-01-20 -
Slicing Outer Index Level
Use
.loc[]
to slice rows by index values. The last value is included.print(dogs.loc["Labrador":"Poodle"])
Output:
Name Visit Date
Breed Color
Labrador Black Ted 2023-01-10
Labrador Black Ted 2023-01-20 -
Slicing Inner Index Levels
For inner index levels, use tuples for start and end positions.
print(dogs.loc[("Labrador", "Black"):("Poodle", "White")])
Output:
Name Visit Date
Breed Color
Labrador Black Ted 2023-01-10 -
Slicing Columns
Slice columns while keeping all rows by passing a colon for rows.
print(dogs.loc[:, "Name":"Visit Date"])
Output:
Name Visit Date
Breed Color
Chow Chow Brown Ted 2023-01-15
Golden Retriever Golden Stella 2023-01-12 -
Slicing Rows and Columns
Simultaneously slice rows and columns by specifying slices for both.
print(dogs.loc["Labrador":"Poodle", "Name":"Visit Date"])
Output:
Name Visit Date
Breed Color
Labrador Black Ted 2023-01-10 -
Slicing by Dates
Set a date column as the index to slice by date ranges.
dogs = dogs.set_index("Visit Date").sort_index()
print(dogs.loc["2023-01-10":"2023-01-15"])Output:
Name Breed Color
Visit Date
2023-01-10 Ted Labrador Black
2023-01-12 Stella Golden Retriever Golden -
Partial Date Slicing
Slice by year only to include all dates within the range.
print(dogs.loc["2023"])
Output:
Name Breed Color
Visit Date
2023-01-10 Ted Labrador Black
2023-01-12 Stella Golden Retriever Golden -
Subsetting by Position with .iloc
Use
.iloc
for slicing by row and column numbers. The end value is exclusive.print(dogs.iloc[1:3, 0:2])
Output:
Name Breed
Visit Date
2023-01-12 Stella Golden Retriever
2023-01-15 Ted Chow Chow
Pivoting and Summary Stats
Pivot tables are created using .pivot_table()
. The values
argument specifies the column to aggregate, while index
and columns
define row and column grouping. The default aggregation is mean.
Consider the dogs data:
breed color age weight date_of_birth height
0 Labrador Brown 3 25 2016-01-01 23
1 Chow Chow Tan 4 18 2015-06-15 20
2 Poodle White 2 12 2017-03-20 20
3 Labrador Black 7 30 2014-09-10 23
4 Beagle Brown 5 10 2015-12-01 21
5 Chihuahua Grey 1 3 2018-05-05 18
6 Poodle Brown 3 20 2017-03-20 22
7 Labrador Black 6 28 2014-09-10 24
8 Beagle White 4 12 2015-12-01 21
9 Chow Chow Tan 5 17 2015-06-15 19
-
Pivoting
You can create a pivot table to group and aggregate data by rows and columns.
dogs.pivot_table(values='height', index='breed', columns='color', aggfunc='mean')
Output:
color Black Brown Grey Tan White
breed
Beagle NaN 21.0 NaN NaN 21.0
Chihuahua NaN NaN 18.0 NaN NaN
Chow Chow NaN NaN NaN 19.5 NaN
Labrador 23.5 23.0 NaN NaN NaN
Poodle NaN 22.0 NaN NaN 20.0 -
.loc[] + slicing is a power combo
Pivot tables are DataFrames with sorted indexes, so slicing and
.loc[]
functionality can be used for subsetting.dogs.loc[dogs['breed'] == 'Poodle', ['height']]
Output:
height
2 20
6 22 -
The axis argument
Use the
axis
argument to specify whether to calculate summary stats across rows (axis=0
, default) or columns (axis=1
).dogs.mean(axis="index") # Across rows
Output:
height 22.5
-
Calculating summary stats across columns
Set
axis=1
to calculate summary stats across columns, useful when all columns contain the same data type.dogs.mean(axis=1) # Across columns
Output:
breed
Labrador 22.5
Poodle 21.0