Skip to main content

Advanced Merging

Updated Aug 17, 2021 ·

Filtering Joins

Filtering joins help select rows from one table based on matches (or lack of matches) in another table. Unlike mutating joins, they don't combine columns but filter data instead.

Semi Join

A semi join keeps only the rows from the left table that have a match in the right table. To perform a semi-join:

  • Merge the left and right tables on key column using an inner join.
  • Check if key column in left table is in the merged table using isin().
  • Subset the rows of the left table.

Consider the two tables below:

  • genres (genre ID, name)
  • top_tracks (track ID, genre ID)
import pandas as pd

genres = pd.DataFrame({
'gid': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'genre': ['Rock', 'Pop', 'Jazz', 'Hip-Hop', 'Classical', 'EDM', 'Country',
'Blues', 'Metal', 'Reggae', 'Folk', 'Soul']
})

top_tracks = pd.DataFrame({
'tid': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
'gid': [1, 2, 2, 3, 6, 6, 7, 8, 10, 12] # Some genres appear in top_tracks
})

Use inner join to find matching gids, then use the isin() method to compare every gid in the genres table to the gid in the top_tracks table.

genres_tracks = genres.merge(top_tracks, on='gid', how='inner')

# Below line returns a table of boolean values
matchinggids = genres['gid'].isin(genres_tracks['gid'])

# Instead of boolean values, we can get the actual values by subsetting the genres table.
top_genres = genres[genres['gid'].isin(genres_tracks['gid']) ]
print(top_genres)

Output:

      gid      genre
0 1 Rock
1 2 Pop
2 3 Jazz
5 6 EDM
6 7 Country
7 8 Blues
9 10 Reggae
11 12 Soul

Anti Join

An anti join keeps only the rows from the left table that do not have a match in the right table.

Using the previous example, we can use anti join to find which genres are not in the table of top tracks. Setting indicator to True, a _merge column will be added which will show the source of each row, which means some rows might be a match from both tables, while some rows may only be found in the left table.

genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)

gid_list = genres_tracks.loc[
genres_tracks['_merge'] == 'left_only',
'gid']

Finally, use the isin() method to filter only for the rows with gid in the gid_list.

non_top_genres = genres[genres['gid'].isin(gid_list)]
print(not_top_genres)

Output:

   gid      genre
3 4 Hip-Hop
4 5 Classical
8 9 Metal
10 11 Folk

Concatenating Tables Vertically

Sometimes, data for different periods is stored in separate tables. To analyze them together, we need to combine them into one.

  • The tables have the same column names.
  • We use pandas.concat() to stack them vertically. '

Note that the Pandas concat() method can concatenate both vertical and horizontal.

Basic Concatenation

Consider the three sample invoice data from January to March.

import pandas as pd 

# Invoice data for three months
inv_jan = pd.DataFrame([
[101, 1, '2019-01-10', 100],
[102, 2, '2019-01-15', 200],
[103, 3, '2019-01-20', 150]
], columns=['iid', 'cid', 'invoice_date', 'total'])

inv_feb = pd.DataFrame([
[104, 4, '2019-02-05', 170],
[105, 5, '2019-02-12', 250],
[106, 6, '2019-02-18', 300],
[107, 7, '2019-02-25', 190]
], columns=['iid', 'cid', 'invoice_date', 'total'])

inv_mar = pd.DataFrame([
[108, 8, '2019-03-03', 210],
[109, 9, '2019-03-10', 280],
[110, 10, '2019-03-17', 190],
[111, 11, '2019-03-24', 240],
[112, 12, '2019-03-30', 270]
], columns=['iid', 'cid', 'invoice_date', 'total'])

To concatenate these tables:

combined = pd.concat([inv_jan, inv_feb, inv_mar])
print(combined)

Output:

   iid  cid invoice_date  total
0 101 1 2019-01-10 100
1 102 2 2019-01-15 200
2 103 3 2019-01-20 150
0 104 4 2019-02-05 170
1 105 5 2019-02-12 250
2 106 6 2019-02-18 300
3 107 7 2019-02-25 190
0 108 8 2019-03-03 210
1 109 9 2019-03-10 280
2 110 10 2019-03-17 190
3 111 11 2019-03-24 240
4 112 12 2019-03-30 270

Ignoring Index

By default, the index values from the original tables are kept. If the index contains no valuable information, we can ignore it using the ignore_index.

combined_reset = pd.concat([inv_jan, inv_feb, inv_mar], ignore_index=True)
print(combined_reset)

The result will be that the index will go from zero to N-1.

    iid  cid invoice_date  total
0 101 1 2019-01-10 100
1 102 2 2019-01-15 200
2 103 3 2019-01-20 150
3 104 4 2019-02-05 170
4 105 5 2019-02-12 250
5 106 6 2019-02-18 300
6 107 7 2019-02-25 190
7 108 8 2019-03-03 210
8 109 9 2019-03-10 280
9 110 10 2019-03-17 190
10 111 11 2019-03-24 240
11 112 12 2019-03-30 270

Add Labels to Original Tables

We can add labels to track which table each row came from using the keys argument.

combined_keys = pd.concat(
[inv_jan, inv_feb, inv_mar],
ignore_index=False,
keys=['Jan', 'Feb', 'Mar'])
print(combined_keys)

The result is a multi-index table, with label as the first level.

       iid  cid invoice_date  total
Jan 0 101 1 2019-01-10 100
1 102 2 2019-01-15 200
2 103 3 2019-01-20 150
Feb 0 104 4 2019-02-05 170
1 105 5 2019-02-12 250
2 106 6 2019-02-18 300
3 107 7 2019-02-25 190
Mar 0 108 8 2019-03-03 210
1 109 9 2019-03-10 280
2 110 10 2019-03-17 190
3 111 11 2019-03-24 240
4 112 12 2019-03-30 270

Tables with Different Columns

If tables have different columns, concat() includes all columns by default. Missing values appear as NaN.

Let's say the February invoice table has an additional column called 'billing_country`.

inv_feb = pd.DataFrame([
[104, 4, '2019-02-05', 170, 'US'],
[105, 5, '2019-02-12', 250, 'CA'],
[106, 6, '2019-02-18', 300, 'UK'],
[107, 7, '2019-02-25', 190, 'FR']
], columns=['iid', 'cid', 'invoice_date', 'total', 'billing_country'])

combined_diff = pd.concat([inv_jan, inv_feb])
print(combined_diff)

Output:

   iid  cid invoice_date  total billing_country
0 101 1 2019-01-10 100 NaN
1 102 2 2019-01-15 200 NaN
2 103 3 2019-01-20 150 NaN
0 104 4 2019-02-05 170 US
1 105 5 2019-02-12 250 CA
2 106 6 2019-02-18 300 UK
3 107 7 2019-02-25 190 FR

You can also sort the different column names alphabetically using the sort argument.

sort_diff = pd.concat([inv_jan, inv_feb], sort=True)
print(sort_diff)

Output:

  billing_country  cid  iid invoice_date  total
0 NaN 1 101 2019-01-10 100
1 NaN 2 102 2019-01-15 200
2 NaN 3 103 2019-01-20 150
0 US 4 104 2019-02-05 170
1 CA 5 105 2019-02-12 250
2 UK 6 106 2019-02-18 300
3 FR 7 107 2019-02-25 190

Keep Only Matching Columns

To keep only the columns that all tables have in common, set join="inner". The default value is outer, which is why the concat method will always return columns including the non-matching ones.

combined_inner = pd.concat([inv_jan, inv_feb], join="inner")
print(combined_inner)

Now, the billing_country column is gone, and only the common columns remain.

   iid  cid invoice_date  total
0 101 1 2019-01-10 100
1 102 2 2019-01-15 200
2 103 3 2019-01-20 150
0 104 4 2019-02-05 170
1 105 5 2019-02-12 250
2 106 6 2019-02-18 300
3 107 7 2019-02-25 190

Verifying integrity

When merging or concatenating tables, we need to ensure that data relationships are correct. Python provides ways to check for duplicates and unexpected structures.

Why Verify Data?

  • Prevents errors due to unexpected duplicates.
  • Ensures calculations (like averages) are accurate.
  • Helps maintain data consistency.

If an error occurs, clean the data or remove duplicates before merging or concatenating.

Checking Merges

We use merge() to combine tables based on a common column. However, unexpected duplicates can turn a one-to-one merge into a one-to-many. The validate argument helps detect such issues.

  • one_to_one: Ensures each value appears only once in both tables.
  • one_to_many: Allows duplicates in the right table but not in the left.

Consider the two tables below:

import pandas as pd

tracks = pd.DataFrame([
[1, 'Song A'],
[2, 'Song B'],
[3, 'Song C']
], columns=['tid', 'track'])

specs = pd.DataFrame([
[1, 'MP3'],
[2, 'FLAC'],
[2, 'WAV'],
[3, 'AAC']
], columns=['tid', 'format'])

When we attempt to do a one-to-one merge, this will return an error.

merged = pd.merge(tracks, specs, on='tid', validate='one_to_one')

Output:

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

Solution: Use one-to-many merging.

merged = pd.merge(tracks, specs, on='tid', validate='one_to_many')
print(merged)

Output:

   tid   track format
0 1 Song A MP3
1 2 Song B FLAC
2 2 Song B WAV
3 3 Song C AAC

Checking Concatenations

The concat() method stacks tables vertically. The verify_integrity argument checks for duplicate index values and raises an error if found.

Conside the two invoice tables below.

inv_feb = pd.DataFrame([
[8, 1, '2019-02-10', 100],
[9, 2, '2019-02-15', 200]
], columns=['iid', 'cid', 'invoice_date', 'total']).set_index('iid')

inv_mar = pd.DataFrame([
[9, 3, '2019-03-05', 250], # Duplicate iid=9
[10, 4, '2019-03-10', 300]
], columns=['iid', 'cid', 'invoice_date', 'total']).set_index('iid')

When we set the verify_integrity to True, it will check if there are duplicate records between the two tables.

combined = pd.concat([inv_feb, inv_mar], verify_integrity=True)

Expected error:

ValueError: Indexes have overlapping values

Solution: Ignore Integrity Check

combined = pd.concat([inv_feb, inv_mar], verify_integrity=False)
print(combined)

Output:

     cid invoice_date  total
iid
8 1 2019-02-10 100
9 2 2019-02-15 200
9 3 2019-03-05 250
10 4 2019-03-10 300