Slicing and Indexing

Updated Aug 17, 2021 ·

Explore the Jupyter Notebooks

This page includes my notes on this topic.
To see these functions in action, check out the Jupyter notebook here: Sample Notebooks.

Explicit Indexes

Indexes provide an efficient way to manage and subset data in pandas DataFrames. They offer efficient filtering, but they have limitations: they add complexity and break "tidy data" rules.

Consider the dataset below:

       Name             Breed           Color   Visit Date  Weight
    Ted           Labrador          Black  2023-01-10      30
 Stella   Golden Retriever          Golden  2023-01-12      25
    Ted          Chow Chow          Brown  2023-01-15      28
  Robin             Poodle          White  2023-01-18      20
    Ted           Labrador          Black  2023-01-20      32
 Stella   Golden Retriever          Golden  2023-01-22      26

Indexing operations:

Accessing Columns and Index

You can access column names and row indexes directly.

print(dogs.columns)  # Access column names
print(dogs.index)    # Access row index

Output:

Index(['Name', 'Breed', 'Color', 'Visit Date', 'Weight'], dtype='object')
RangeIndex(start=0, stop=6, step=1)

Setting an Index

Set a column (e.g., Name) as the index.

dogs_indexed = dogs.set_index("Name")
print(dogs_indexed)

Output:

            Breed           Color   Visit Date  Weight
Name                                                
Ted      Labrador          Black  2023-01-10      30
Stella   Golden Retriever  Golden  2023-01-12      25
Ted      Chow Chow          Brown  2023-01-15      28
Robin    Poodle            White   2023-01-18      20
Ted      Labrador          Black  2023-01-20      32
Stella   Golden Retriever  Golden  2023-01-22      26

Resetting the Index

Reset the index to move it back to the DataFrame body.

dogs_reset = dogs_indexed.reset_index()
print(dogs_reset)

Output:

      Name             Breed           Color   Visit Date  Weight
    Ted           Labrador          Black  2023-01-10      30
 Stella   Golden Retriever          Golden  2023-01-12      25
    Ted          Chow Chow          Brown  2023-01-15      28
  Robin             Poodle          White  2023-01-18      20
    Ted           Labrador          Black  2023-01-20      32
 Stella   Golden Retriever          Golden  2023-01-22      26

Dropping an Index

Remove the index entirely using the drop argument.

dogs_dropped = dogs_indexed.reset_index(drop=True)
print(dogs_dropped)

Output:

            Breed           Color   Visit Date  Weight
    Labrador          Black  2023-01-10      30
 Golden Retriever  Golden  2023-01-12      25
    Chow Chow          Brown  2023-01-15      28
    Poodle            White   2023-01-18      20
    Labrador          Black  2023-01-20      32
 Golden Retriever  Golden  2023-01-22      26

Subsetting with Indexes

Indexes make filtering data simpler. For example, filter rows where Name is "Ted."

ted_rows = dogs_indexed.loc["Ted"]
print(ted_rows)

Output:

            Breed   Color   Visit Date  Weight
Name                                            
Ted      Labrador  Black  2023-01-10      30
Ted      Chow Chow  Brown  2023-01-15      28
Ted      Labrador  Black  2023-01-20      32

Multi-Level Indexing

Set multiple columns (e.g., Breed and Color) as the index.

dogs_multi = dogs.set_index(["Breed", "Color"])
print(dogs_multi)

Output:

                              Name   Visit Date  Weight
Breed            Color                                    
Labrador         Black               Ted  2023-01-10      30
Golden Retriever Golden           Stella  2023-01-12      25
Chow Chow        Brown               Ted  2023-01-15      28
Poodle           White             Robin  2023-01-18      20
Labrador         Black               Ted  2023-01-20      32
Golden Retriever Golden           Stella  2023-01-22      26

Subsetting Multi-Level Indexes

Subset rows using outer and inner indexes.

Outer Index: Filter rows for "Labrador."

labs = dogs_multi.loc["Labrador"]
print(labs)

Output:

            Name   Visit Date  Weight
Color                                  
Black           Ted  2023-01-10      30
Black           Ted  2023-01-20      32

Inner Index: Filter rows for specific combinations (e.g., Labrador and Black).

brown_labs = dogs_multi.loc[[("Labrador", "Black")]]
print(brown_labs)

Output:

                        Name   Visit Date  Weight
Breed      Color                                  
Labrador   Black             Ted  2023-01-10      30
Labrador   Black             Ted  2023-01-20      32

Sorting Indexes

Sort rows by index values using sort_index().

sorted_dogs = dogs_multi.sort_index()
print(sorted_dogs)

Output:

                              Name   Visit Date  Weight
Breed            Color                                    
Chow Chow        Brown               Ted  2023-01-15      28
Golden Retriever Golden           Stella  2023-01-12      25
Golden Retriever Golden           Stella  2023-01-22      26
Labrador         Black               Ted  2023-01-10      30
Labrador         Black               Ted  2023-01-20      32
Poodle           White             Robin  2023-01-18      20

Slicing and Subsetting

Slicing is a way to select consecutive elements or subsets of data from lists and DataFrames. pandas provides versatile methods like .loc and .iloc for this purpose.

Consider the dogs data:

       breed     color  age  weight date_of_birth
 Labrador     Brown    3      25    2016-01-01
Chow Chow       Tan    4      18    2015-06-15
   Poodle     White    2      12    2017-03-20
 Labrador     Black    7      30    2014-09-10
   Beagle     Brown    5      10    2015-12-01
Chihuahua      Grey    1       3    2018-05-05

Operations:

Slicing Lists

You can slice a list by specifying start and end positions with a colon. The end position is exclusive.

breeds = ["Labrador", "Poodle", "Chow Chow", "Golden Retriever", "Labrador", "Chihuahua"]
print(breeds[2:5])  # Slice from index 2 to 4
print(breeds[:3])   # Slice first three elements

Output:

['Chow Chow', 'Golden Retriever', 'Labrador']
['Labrador', 'Poodle', 'Chow Chow']

Sort Index Before Slicing

To slice rows in a DataFrame, sort the index first.

dogs = dogs.set_index(["Breed", "Color"]).sort_index()
print(dogs)

Output:

                           Name Visit Date
Breed          Color                    
Chow Chow      Brown       Ted 2023-01-15
Golden Retriever Golden   Stella 2023-01-12
Labrador        Black       Ted 2023-01-10
Labrador        Black       Ted 2023-01-20

Slicing Outer Index Level

Use .loc[] to slice rows by index values. The last value is included.

print(dogs.loc["Labrador":"Poodle"])

Output:

                           Name Visit Date
Breed          Color                    
Labrador       Black       Ted 2023-01-10
Labrador       Black       Ted 2023-01-20

Slicing Inner Index Levels

For inner index levels, use tuples for start and end positions.

print(dogs.loc[("Labrador", "Black"):("Poodle", "White")])

Output:

                           Name Visit Date
Breed          Color                    
Labrador       Black       Ted 2023-01-10

Slicing Columns

Slice columns while keeping all rows by passing a colon for rows.

print(dogs.loc[:, "Name":"Visit Date"])

Output:

                           Name Visit Date
Breed          Color                    
Chow Chow      Brown       Ted 2023-01-15
Golden Retriever Golden   Stella 2023-01-12

Slicing Rows and Columns

Simultaneously slice rows and columns by specifying slices for both.

print(dogs.loc["Labrador":"Poodle", "Name":"Visit Date"])

Output:

                           Name Visit Date
Breed          Color                    
Labrador       Black       Ted 2023-01-10

Slicing by Dates

Set a date column as the index to slice by date ranges.

dogs = dogs.set_index("Visit Date").sort_index()
print(dogs.loc["2023-01-10":"2023-01-15"])

Output:

               Name            Breed   Color
Visit Date                                  
2023-01-10     Ted         Labrador   Black
2023-01-12  Stella Golden Retriever Golden

Partial Date Slicing

Slice by year only to include all dates within the range.

print(dogs.loc["2023"])

Output:

               Name            Breed   Color
Visit Date                                  
2023-01-10     Ted         Labrador   Black
2023-01-12  Stella Golden Retriever Golden

Subsetting by Position with .iloc

Use .iloc for slicing by row and column numbers. The end value is exclusive.

print(dogs.iloc[1:3, 0:2])

Output:

               Name            Breed
Visit Date                          
2023-01-12  Stella Golden Retriever
2023-01-15     Ted         Chow Chow

Pivoting and Summary Stats

Pivot tables are created using .pivot_table(). The values argument specifies the column to aggregate, while index and columns define row and column grouping. The default aggregation is mean.

Consider the dogs data:

        breed     color  age  weight date_of_birth  height
  Labrador     Brown    3      25    2016-01-01      23
 Chow Chow       Tan    4      18    2015-06-15      20
    Poodle     White    2      12    2017-03-20      20
  Labrador     Black    7      30    2014-09-10      23
    Beagle     Brown    5      10    2015-12-01      21
 Chihuahua      Grey    1       3    2018-05-05      18
    Poodle     Brown    3      20    2017-03-20      22
  Labrador     Black    6      28    2014-09-10      24
    Beagle     White    4      12    2015-12-01      21
 Chow Chow       Tan    5      17    2015-06-15      19

Pivoting

You can create a pivot table to group and aggregate data by rows and columns.

dogs.pivot_table(values='height', index='breed', columns='color', aggfunc='mean')

Output:

color         Black  Brown  Grey   Tan  White
breed
Beagle        NaN    21.0   NaN   NaN   21.0
Chihuahua     NaN    NaN    18.0   NaN   NaN
Chow Chow     NaN    NaN    NaN    19.5  NaN
Labrador      23.5   23.0   NaN    NaN   NaN
Poodle        NaN    22.0   NaN    NaN   20.0

.loc[] + slicing is a power combo

Pivot tables are DataFrames with sorted indexes, so slicing and .loc[] functionality can be used for subsetting.
```
dogs.loc[dogs['breed'] == 'Poodle', ['height']]
```
Output:
```
  height
2	20
6	22
```
The axis argument

Use the axis argument to specify whether to calculate summary stats across rows (axis=0, default) or columns (axis=1).
```
dogs.mean(axis="index")  # Across rows
```
Output:
```
height   22.5
```
Calculating summary stats across columns

Set axis=1 to calculate summary stats across columns, useful when all columns contain the same data type.
```
dogs.mean(axis=1)  # Across columns
```
Output:
```
breed
Labrador  22.5
Poodle    21.0
```

Explore the Jupyter Notebooks​

Explicit Indexes​

Slicing and Subsetting​

Pivoting and Summary Stats​

Explore the Jupyter Notebooks

Explicit Indexes

Slicing and Subsetting

Pivoting and Summary Stats