Skip to main content

Merging Data

Updated Aug 17, 2021 ·

Overview

In data analysis, merging combines two datasets into one based on a common column or index. Pandas provides powerful functions like merge() for this purpose.

Example tables:

Table A:

xyz
11020
21525
32030

Table B:

xyw
150100
260110
470120

First, make sure to create the dataframes:

import pandas as pd

table_a = pd.DataFrame({
'x': [1, 2, 3],
'y': [10, 15, 20],
'z': [20, 25, 30]
})

table_b = pd.DataFrame({
'x': [1, 2, 4],
'y': [50, 60, 70],
'w': [100, 110, 120]
})

To merge Table A and Table B on column "x":

merged_data = table_a.merge(table_b, on='x')
print(merged_data)

Output:

xy_xzy_yw
1102050100
2152560110
info

To see more examples, please see Merging Data with Pandas Notebook.

Set Custom Suffixes

To distinguish columns with the same name, you can use the suffixes parameter.

merged_data_custom = table_a.merge(table_b, on='x', suffixes=('_A', '_B'))

print(merged_data_custom)

Output:

xy_Azy_Bw
1102050100
2152560110

Merging Multiple DataFrames

Merging multiple DataFrames allows you to combine related data from different tables into a single table for analysis. This is useful when data is spread across multiple sources.

To merge two tables:

merged_df = df1.merge(df2, on='id')

To merge three tables:

merged_df = df1.merge(df2, on='id') \
.merge(df3, on='id')

To merge four tables (and so on):

merged_df = df1.merge(df2, on='id') \
.merge(df3, on='id') \
.merge(df4, on='id')
info

To see more examples, please see Merging Data with Pandas Notebook.

Left Join

A left join keeps all rows from the left table and only matching rows from the right table.

Example: Two tables, "left" and "right," merged on column C

  • All rows from "left" are included
  • Only matching rows from "right" are included
  • If no match, right-side values are null

As an example, we'll use the following tables:

  • movies

    • Contains movie details

    • Title, popularity, and unique ID

      movie_idtitlepopularity
      1Inception90.5
      2Interstellar87.3
      3The Dark Knight95.0
      4Tenet75.2
      5Dunkirk80.4
      6Memento70.1
  • taglines

    • Contains movie taglines:

    • Movie ID and tagline text

      movie_idtagline
      1"Your mind is the scene of the crime."
      2"Mankind was born on Earth. It was never meant to die here."
      3"Welcome to a world without rules."
      5"When 400,000 men couldn’t get home, home came for them."

To combine the tables using a left join:

  • Merge on the movie_id column
  • Use how='left' to specify a left join
  • Movies without a matching tagline get a null value (NaN in pandas)

The code:

import pandas as pd

movies = pd.DataFrame({
'movie_id': [1, 2, 3, 4, 5, 6],
'title': ["Inception", "Interstellar", "The Dark Knight", "Tenet", "Dunkirk", "Memento"],
'popularity': [90.5, 87.3, 95.0, 75.2, 80.4, 70.1]
})

taglines = pd.DataFrame({
'movie_id': [1, 2, 3, 5],
'tagline': [
"Your mind is the scene of the crime.",
"Mankind was born on Earth. It was never meant to die here.",
"Welcome to a world without rules.",
"When 400,000 men couldn’t get home, home came for them."
]
})

merged_table = movies.merge(taglines, on="movie_id", how="left")
print(merged_table)

Expected output:

movie_idtitlepopularitytagline
1Inception90.5"Your mind is the scene of the crime."
2Interstellar87.3"Mankind was born on Earth. It was never meant to die here."
3The Dark Knight95.0"Welcome to a world without rules."
4Tenet75.2NaN (no match found)
5Dunkirk80.4"When 400,000 men couldn’t get home, home came for them."
6Memento70.1NaN (no match found)

To count the number of rows with missing taglines, we can use isnull() function to find the rows then count them using sum().

missing = merged_table['tagline'].isnull().sum()
print(missing)

This will return the number of rows, which is only two rows.

Right Join

A right join returns all rows from the right table and only matching rows from the left table. If no match is found, columns from the left table will be null.

Example tables:

  • movies

    idtitlepopularity
    1Inception90.5
    2Interstellar87.3
    3The Dark Knight95.0
    4Tenet75.2
    5Dunkirk80.4
    6Memento70.1
  • tv_genre

    movie_idgenre
    3Action
    5War
    7TV Movie
    8TV Movie

We use pandas to perform a right join. Note that the movies table uses id, while the tv_genre table uses movie_id.

import pandas as pd

movies = pd.DataFrame({
'id': [1, 2, 3, 4, 5, 6],
'title': ["Inception", "Interstellar", "The Dark Knight", "Tenet", "Dunkirk", "Memento"],
'popularity': [90.5, 87.3, 95.0, 75.2, 80.4, 70.1]
})

tv_genre = pd.DataFrame({
'movie_id': [3, 5, 7, 8],
'genre': ["Action", "War", "TV Movie", "TV Movie"]
})


# Performing the Right Join
merged_table = movies.merge(tv_genre, left_on="id", right_on="movie_id", how="right")
print(merged_table)

Expected output:

idtitlepopularitymovie_idgenre
3The Dark Knight95.03Action
5Dunkirk80.45War
NaNNaNNaN7TV Movie
NaNNaNNaN8TV Movie

Self Join

Merging a table to itself, also called a self join, helps link related data within the same table. Self joins are useful for:

  • Hierarchical data (e.g., employees and managers)
  • Sequential relationships (e.g., logistics tracking)
  • Graph data (e.g., social networks)

As an example, let's use the 'sequels` table which contains:

  • movie_id: Unique ID for each movie
  • title: Movie name
  • sequel: ID of its sequel (if any)
movie_idtitlesequel
862Toy Story863
863Toy Story 210193
10193Toy Story 3NULL

To display movies with their sequels in one row, we merge the table to itself.

import pandas as pd  

sequels = pd.DataFrame({
"movie_id": [862, 863, 10193],
"title": ["Toy Story", "Toy Story 2", "Toy Story 3"],
"sequel": [863, 10193, None]
})

# Self merge
merged = sequels.merge(sequels,
left_on="sequel",
right_on="movie_id",
suffixes=("_org", "_seq"))

# Print only these 2 columns
result = merged[["title_org", "title_seq"]]
print(result)

Output:

     title_org     title_seq
0 Toy Story Toy Story 2
1 Toy Story 2 Toy Story 3

Merging on Indexes

Merging tables isn't limited to columns. We can also merge using indexes, which are often unique identifiers in DataFrames.

Table with an Index

Tables usually have an auto-incrementing index, but we can also set a specific column (like id) as the index.

import pandas as pd  

movies = pd.DataFrame({
"id": [1, 2, 3, 4, 5, 6, 7, 8],
"title": ["Inception", "Titanic", "Avatar", "Interstellar", "The Dark Knight", "Pulp Fiction", "Forrest Gump", "The Matrix"]
}).set_index("id") # Setting 'id' as the index

print(movies)

Output:

                title  
id
1 Inception
2 Titanic
3 Avatar
4 Interstellar
5 The Dark Knight
6 Pulp Fiction
7 Forrest Gump
8 The Matrix

Setting an Index

We can set the index when reading a CSV using index_col in read_csv().

movies = pd.read_csv("movies.csv", index_col="id")  

This ensures id is the index when loading the table.

Merging on Index

We can merge two tables using their index instead of a column. If the index name is id, the merge method accepts it just like a column name.

Create the taglines table:

taglines = pd.DataFrame({  
"id": [1, 2, 3, 4, 5, 6, 7, 8],
"tagline": [
"Your mind is the scene of the crime.",
"Nothing on Earth could come between them.",
"Enter the world of Pandora.",
"Mankind was born on Earth. It was never meant to die here.",
"Welcome to a world without rules.",
"You won't know the facts until you've seen the fiction.",
"Life is like a box of chocolates.",
"Reality is a thing of the past."
]
}).set_index("id")

Now merge the taglines table to the movies table.

merged_df = movies.merge(taglines, on="id", how="left")  
print(merged_df)

Output:

                title                                         tagline  
id
1 Inception Your mind is the scene of the crime.
2 Titanic Nothing on Earth could come between them.
3 Avatar Enter the world of Pandora.
4 Interstellar Mankind was born on Earth. It was never meant to die here.
5 The Dark Knight Welcome to a world without rules.
6 Pulp Fiction You won't know the facts until you've seen the fiction.
7 Forrest Gump Life is like a box of chocolates.
8 The Matrix Reality is a thing of the past.

Here, id is the index, and the output retains the same structure, just with id as the index.

MultiIndex Merge

Some tables have multiple levels of indexing, such as movie and cast IDs. We can merge on multiple index levels just like merging on multiple columns.

Create the samuel and cast tables.

samuel = pd.DataFrame({  
"movie_id": [1, 2, 3, 4, 5, 6, 7, 8],
"cast_id": [101, 102, 103, 104, 105, 106, 107, 108],
"actor": ["Samuel L. Jackson"] * 8
}).set_index(["movie_id", "cast_id"])

cast = pd.DataFrame({
"movie_id": [1, 2, 3, 4, 5, 6, 7, 8],
"cast_id": [101, 102, 103, 104, 105, 106, 107, 108],
"role": ["Agent", "Detective", "Soldier", "Explorer", "Vigilante", "Hitman", "Runner", "Hacker"]
}).set_index(["movie_id", "cast_id"])

Merge the cast table to the samuel table using the two indexes.

merged_df = samuel.merge(cast, on=["movie_id", "cast_id"], how="inner")  
print(merged_df)

Output:

                               actor        role  
movie_id cast_id
1 101 Samuel L. Jackson Agent
2 102 Samuel L. Jackson Detective
3 103 Samuel L. Jackson Soldier
4 104 Samuel L. Jackson Explorer
5 105 Samuel L. Jackson Vigilante
6 106 Samuel L. Jackson Hitman
7 107 Samuel L. Jackson Runner
8 108 Samuel L. Jackson Hacker

Since it's an inner join, only matching movie_id and cast_id pairs appear in the result.

Index Merge with left_on and right_on

If index names differ between tables, we use left_on and right_on while setting left_index and right_index to True.

movies_to_genres = pd.DataFrame({  
"movie_id": [1, 2, 3, 4, 5, 6, 7, 8],
"genre": ["Sci-Fi", "Romance", "Sci-Fi", "Sci-Fi", "Action", "Crime", "Drama", "Sci-Fi"]
}).set_index("movie_id")

merged_df = movies.merge(movies_to_genres,
left_on="id",
right_on="movie_id",
left_index=True,
right_index=True)
print(merged_df)

Output:

                title     genre  
id
1 Inception Sci-Fi
2 Titanic Romance
3 Avatar Sci-Fi
4 Interstellar Sci-Fi
5 The Dark Knight Action
6 Pulp Fiction Crime
7 Forrest Gump Drama
8 The Matrix Sci-Fi

This tells Pandas to match the id index of movies with the movie_id index of movies_to_genres.