Pandas

Updated Aug 17, 2021 ·

Dictionaries

A dictionary in Python is a collection of key-value pairs, where each key is unique and is mapped to a value. It allows you to store and retrieve data efficiently using the key.

Keys in a dictionary must be unique and immutable.
Values in a dictionary can be of any data type.

Consider the sample dictionary for Asia, where the keys are countries and the values are their capitals:

asia = {
    'china': 'beijing',
    'india': 'new delhi',
    'japan': 'tokyo',
    'south korea': 'seoul',
    'thailand': 'bangkok',
    'malaysia': 'kuala lumpur',
    'singapore': 'singapore',
    'indonesia': 'jakarta',
    'vietnam': 'hanoi',
    'philippines': 'manila'
}

To get the keys in the dictionary:

print(asia.keys()) 

Output:

dict_keys(['china', 'india', 'japan', 'south korea', 'thailand', 'malaysia', 'singapore', 'indonesia', 'vietnam', 'philippines'])

To print out the value that belongs to "Thailand":

print(asia['thailand']) 

Output:

bangkok

Dictionaries are immutable in the sense that their keys cannot be changed once set. Unlike lists, which allow you to modify elements, dictionaries require you to add or remove key-value pairs to modify their contents.

Modifying Dictionaries

To add an entry to a dictionary, you can simply assign a value to a new key. Using the previous example, we can add "North Korea":

asia["north korea"] = "pyongyang"
print(asia)

Output:

{
  'china': 'beijing', 
  'india': 'new delhi', 
  'japan': 'tokyo', 
  'south korea': 'seoul', 
  'thailand': 'bangkok', 
  'malaysia': 'kuala lumpur', 
  'singapore': 'singapore', 
  'indonesia': 'jakarta', 
  'vietnam': 'hanoi', 
  'philippines': 'manila', 
  'north korea': 'pyongyang'
}

To verify if "North Korea" has been added:

"north korea" in asia

This will return:

True

To delete an entry from the dictionary, use the del keyword:

del asia["north korea"]
print(asia)

Output:

{
  'china': 'beijing', 
  'india': 'new delhi', 
  'japan': 'tokyo', 
  'south korea': 'seoul', 
  'thailand': 'bangkok', 
  'malaysia': 'kuala lumpur', 
  'singapore': 'singapore', 
  'indonesia': 'jakarta', 
  'vietnam': 'hanoi', 
  'philippines': 'manila', 
}

Pandas

Data scientists often work with large datasets, usually in a table format like a spreadsheet. To manage such data in Python, a rectangular data structure is needed. While 2D NumPy arrays are an option, they aren't suited for datasets with mixed data types.

Consider the BRICS table below.

Country	Capital	Area (million km²)	Population (millions)
Brazil	Brasília	8.5	211
Russia	Moscow	17.1	144
India	New Delhi	3.3	1380
China	Beijing	9.6	1393
South Africa	Pretoria	1.2	58

For these cases, the Pandas library is a better choice. It is built on NumPy and it provides advanced tools for data manipulation. In Pandas, tabular data is stored in a DataFrame.

Creating a DataFrame

From a Dictionary of Lists

You can create a DataFrame from a dictionary, where keys are column labels, and values are lists of column data. For example:

import pandas as pd

data = {
    "country": ["Brazil", "Russia", "India", "China", "South Africa"],
    "capital": ["Brasília", "Moscow", "New Delhi", "Beijing", "Pretoria"],
    "area": [8.5, 17.1, 3.3, 9.6, 1.2],
    "population": [211, 144, 1380, 1393, 58]
}

brics = pd.DataFrame(data)
print(brics)

You can manually set row labels:

brics.index = ["BR", "RU", "IN", "CH", "SA"]
print(brics)

The result is a DataFrame version of the BRICS table.

        country      capital  area (million km²)  population (millions)  
BR       Brazil     Brasília                 8.5                   211  
RU       Russia       Moscow                17.1                   144  
IN        India    New Delhi                 3.3                  1380  
CH        China      Beijing                 9.6                  1393  
SA  South Africa    Pretoria                 1.2                    58  

From a List of Dictionaries

You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row, and the keys correspond to column labels. This approach is helpful when data is naturally structured as rows.

import pandas as pd

# Data represented as a list of dictionaries
data = [
    {"country": "Brazil", "capital": "Brasília", "area": 8.5, "population": 211},
    {"country": "Russia", "capital": "Moscow", "area": 17.1, "population": 144},
    {"country": "India", "capital": "New Delhi", "area": 3.3, "population": 1380},
    {"country": "China", "capital": "Beijing", "area": 9.6, "population": 1393},
    {"country": "South Africa", "capital": "Pretoria", "area": 1.2, "population": 58},
]

brics = pd.DataFrame(data)
brics.index = ["BR", "RU", "IN", "CH", "SA"]
print(brics)

The result is the same BRICS table as shown previously:

        country      capital  area  population
BR       Brazil     Brasília   8.5         211
RU       Russia       Moscow  17.1         144
IN        India    New Delhi   3.3        1380
CH        China      Beijing   9.6        1393
SA  South Africa     Pretoria   1.2          58

From a CSV File

To work with large datasets, it's easier to import them from external files. Let's say you have a CSV file called brics.csv containing the details below:

country,capital,area,population
Brazil,Brasília,8.5,211
Russia,Moscow,17.1,144
India,New Delhi,3.3,1380
China,Beijing,9.6,1393
South Africa,Pretoria,1.2,58

You can use the read_csv function from the Pandas library to load this data into a DataFrame.

brics = pd.read_csv("brics.csv", index_col=0)

The read_csv function reads the CSV file, while the index_col parameter specifies that the first column (country) should be used as row labels (indexes).

From DataFrame to CSV

You can also export a DataFrame to a CSV file using the to_csv method. This allows you to save your data in a widely used format for sharing or further analysis.

brics.tocsv("/path/to/brics_new.csv")

Retrieve Single Column

To select a single column, use square brackets. Using the previous example:

        country      capital  area (million km²)  population (millions)  
BR       Brazil     Brasília                 8.5                   211  
RU       Russia       Moscow                17.1                   144  
IN        India    New Delhi                 3.3                  1380  
CH        China      Beijing                 9.6                  1393  
SA  South Africa    Pretoria                 1.2                    58  

To select just the country column:

brics["country"]

This returns the column as a Pandas Series, a labeled 1D array. You can verify this by using the type function:

type(brics["country"]) 

Output:

pandas.core.series.Series 

To keep the column as a DataFrame, use double brackets:

brics[["country"]]

Output:

        country 
BR       Brazil 
RU       Russia 
IN        India 
CH        China 
SA  South Africa

Checking the type:

type(brics[["country"]]) 

Output:

pandas.core.frame.DataFrame 

Retrieve Multiple Columns

You can also select multiple columns by passing a list of column labels inside double brackets:

brics[["country", "capital"]]

Output:

        country      capital
BR       Brazil     Brasília  
RU       Russia       Moscow 
IN        India    New Delhi 
CH        China      Beijing 
SA  South Africa    Pretoria 

Retrieve Rows

Using the same example:

        country      capital  area (million km²)  population (millions)  
BR       Brazil     Brasília                 8.5                   211  
RU       Russia       Moscow                17.1                   144  
IN        India    New Delhi                 3.3                  1380  
CH        China      Beijing                 9.6                  1393  
SA  South Africa    Pretoria                 1.2                    58  

To select rows, use slicing. For example, to get the 2nd, 3rd, and 4th rows:

brics[1:4]

Remember, slicing in Python is zero-indexed (which means first index is zero) and the end index is exclusive (which means the 2nd argument specified is nto included.)

This will return:

        country      capital  area (million km²)  population (millions)  
RU       Russia       Moscow                17.1                   144  
IN        India    New Delhi                 3.3                  1380  
CH        China      Beijing                 9.6                  1393  

Using `loc`

loc allows you to select rows and columns using labels (the first column)

Single Row: Select Russia's row by its label:
```
brics.loc["RU"]
```
To keep it as a DataFrame, use double brackets:
```
brics.loc[["RU"]]
```
Multiple Rows: Select rows for Russia, India, and China:
```
brics.loc[["RU", "IN", "CN"]]
```
Rows & Columns: Select specific rows and columns, like country and capital:
```
brics.loc[["RU", "IN"], ["country", "capital"]]
```
All Rows, Some Columns: Use : to include all rows:
```
brics.loc[:, ["country", "capital"]]
```

Using `iloc`

iloc uses positions instead of labels.

Single Row: Select the second row:

brics.iloc[1]

Output:

        country      capital  area (million km²)  population (millions)  
RU       Russia       Moscow                17.1                   144  

Multiple Rows: Select rows 2, 3, and 4:

brics.iloc[1:4]

Output:

        country      capital  area (million km²)  population (millions)  
RU       Russia       Moscow                17.1                   144  
IN        India    New Delhi                 3.3                  1380  
CH        China      Beijing                 9.6                  1393  

Rows & Columns: Select specific rows and columns by position:

brics.iloc[1:4, [0, 1]]

Output:

        country      capital
BR       Brazil     Brasília
RU       Russia       Moscow 
IN        India    New Delhi  
CH        China      Beijing   
SA  South Africa    Pretoria  

All Rows, Some Columns: Include all rows but only certain columns:

brics.iloc[:, [0, 1]]

Output:

        country      capital  
RU       Russia       Moscow   
IN        India    New Delhi   
CH        China      Beijing   

Filtering

Consider the previous example:

import pandas as pd

data = {
    "country": ["Brazil", "Russia", "India", "China", "South Africa"],
    "capital": ["Brasília", "Moscow", "New Delhi", "Beijing", "Pretoria"],
    "area": [8.5, 17.1, 3.3, 9.6, 1.2],
    "population": [211, 144, 1380, 1393, 58]
}

brics = pd.DataFrame(data)
brics.index = ["BR", "RU", "IN", "CH", "SA"]
print(brics) 

Output:

        country      capital  area (million km²)  population (millions)  
BR       Brazil     Brasília                 8.5                   211  
RU       Russia       Moscow                17.1                   144  
IN        India    New Delhi                 3.3                  1380  
CH        China      Beijing                 9.6                  1393  
SA  South Africa    Pretoria                 1.2                    58  

To select the countries with area over 8 million square kilometers:

Select the area column

brics["area"] 

Output:

BR     8.5
RU    17.1
IN     3.3
CH     9.6
SA     1.2
Name: area, dtype: float64 

Perform a comparison

brics["area"] > 8

Output:

BR     True
RU     True
IN    False
CH     True
SA    False
Name: area, dtype: bool

Use result to select the countries.

brics[brics["area"] > 8] 

Output:

    country	capital	  area	population
BR	Brazil	Brasília	8.5	  211
RU	Russia	Moscow	  17.1	144
CH	China	  Beijing	  9.6	  1393

Boolean Operators

Since Pandas is built on top of NumPy, we can use operational operators (like < and >=), as well as boolean operators (and, or, and not). To do boolean operation, use"

np.logical_and()
np.logical_or()
np.logical_not()

Examples:

To get the countries with areas larger than 8 million km² but smaller than 100 million :

import numpy as np
np.logical_and(
  brics["area"] > 8,
  brics["area"] < 10
) 

Output:

BR     True
RU    False
IN    False
CH     True
SA    False
Name: area, dtype: bool 

To display/subset the specific countries:

import numpy as np
np.logical_and(
  brics["area"] > 8,
  brics["area"] < 10
) 

Output:

    country	capital	area	population
BR	Brazil	Brasília	    8.5	211
CH	China	  Beijing	      9.6	1393

Dictionaries​

Modifying Dictionaries​

Pandas​

Creating a DataFrame​

From a Dictionary of Lists​

From a List of Dictionaries​

From a CSV File​

From DataFrame to CSV​

Retrieve Single Column​

Retrieve Multiple Columns​

Retrieve Rows​

Using loc​

Using iloc​

Filtering​

Boolean Operators​

Dictionaries

Modifying Dictionaries

Pandas

Creating a DataFrame

From a Dictionary of Lists

From a List of Dictionaries

From a CSV File

From DataFrame to CSV

Retrieve Single Column

Retrieve Multiple Columns

Retrieve Rows

Using `loc`

Using `iloc`

Filtering

Boolean Operators