Advanced Merging

Updated Aug 17, 2021 ·

Filtering Joins

Filtering joins help select rows from one table based on matches (or lack of matches) in another table. Unlike mutating joins, they don't combine columns but filter data instead.

Semi Join

A semi join keeps only the rows from the left table that have a match in the right table. To perform a semi-join:

Merge the left and right tables on key column using an inner join.
Check if key column in left table is in the merged table using isin().
Subset the rows of the left table.

Consider the two tables below:

genres (genre ID, name)
top_tracks (track ID, genre ID)

import pandas as pd

genres = pd.DataFrame({
    'gid': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'genre': ['Rock', 'Pop', 'Jazz', 'Hip-Hop', 'Classical', 'EDM', 'Country', 
              'Blues', 'Metal', 'Reggae', 'Folk', 'Soul']
})

top_tracks = pd.DataFrame({
    'tid': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'gid': [1, 2, 2, 3, 6, 6, 7, 8, 10, 12]  # Some genres appear in top_tracks
})

Use inner join to find matching gids, then use the isin() method to compare every gid in the genres table to the gid in the top_tracks table.

genres_tracks = genres.merge(top_tracks, on='gid', how='inner')

# Below line returns a table of boolean values
matchinggids = genres['gid'].isin(genres_tracks['gid'])

# Instead of boolean values, we can get the actual values by subsetting the genres table.
top_genres = genres[genres['gid'].isin(genres_tracks['gid'])    ]
print(top_genres)

Output:

      gid      genre
  1      Rock
  2       Pop
  3      Jazz
  6       EDM
  7   Country
  8     Blues
 10    Reggae
12      Soul

Anti Join

An anti join keeps only the rows from the left table that do not have a match in the right table.

Using the previous example, we can use anti join to find which genres are not in the table of top tracks. Setting indicator to True, a _merge column will be added which will show the source of each row, which means some rows might be a match from both tables, while some rows may only be found in the left table.

genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)

gid_list = genres_tracks.loc[
                    genres_tracks['_merge'] == 'left_only',
                   'gid']

Finally, use the isin() method to filter only for the rows with gid in the gid_list.

non_top_genres = genres[genres['gid'].isin(gid_list)]
print(not_top_genres)

Output:

   gid      genre
  4   Hip-Hop
  5 Classical
  9    Metal
11     Folk

Concatenating Tables Vertically

Sometimes, data for different periods is stored in separate tables. To analyze them together, we need to combine them into one.

The tables have the same column names.
We use pandas.concat() to stack them vertically. '

Note that the Pandas concat() method can concatenate both vertical and horizontal.

Basic Concatenation

Consider the three sample invoice data from January to March.

import pandas as pd 

# Invoice data for three months
inv_jan = pd.DataFrame([
    [101, 1, '2019-01-10', 100], 
    [102, 2, '2019-01-15', 200], 
    [103, 3, '2019-01-20', 150]
], columns=['iid', 'cid', 'invoice_date', 'total'])

inv_feb = pd.DataFrame([
    [104, 4, '2019-02-05', 170], 
    [105, 5, '2019-02-12', 250], 
    [106, 6, '2019-02-18', 300], 
    [107, 7, '2019-02-25', 190]
], columns=['iid', 'cid', 'invoice_date', 'total'])

inv_mar = pd.DataFrame([
    [108, 8, '2019-03-03', 210], 
    [109, 9, '2019-03-10', 280], 
    [110, 10, '2019-03-17', 190], 
    [111, 11, '2019-03-24', 240], 
    [112, 12, '2019-03-30', 270]
], columns=['iid', 'cid', 'invoice_date', 'total'])

To concatenate these tables:

combined = pd.concat([inv_jan, inv_feb, inv_mar])
print(combined)

Output:

   iid  cid invoice_date  total
101    1   2019-01-10    100
102    2   2019-01-15    200
103    3   2019-01-20    150
104    4   2019-02-05    170
105    5   2019-02-12    250
106    6   2019-02-18    300
107    7   2019-02-25    190
108    8   2019-03-03    210
109    9   2019-03-10    280
110   10   2019-03-17    190
111   11   2019-03-24    240
112   12   2019-03-30    270

Ignoring Index

By default, the index values from the original tables are kept. If the index contains no valuable information, we can ignore it using the ignore_index.

combined_reset = pd.concat([inv_jan, inv_feb, inv_mar], ignore_index=True)
print(combined_reset)

The result will be that the index will go from zero to N-1.

    iid  cid invoice_date  total
 101    1   2019-01-10    100
 102    2   2019-01-15    200
 103    3   2019-01-20    150
 104    4   2019-02-05    170
 105    5   2019-02-12    250
 106    6   2019-02-18    300
 107    7   2019-02-25    190
 108    8   2019-03-03    210
 109    9   2019-03-10    280
 110   10   2019-03-17    190
111   11   2019-03-24    240
112   12   2019-03-30    270

Add Labels to Original Tables

We can add labels to track which table each row came from using the keys argument.

combined_keys = pd.concat(
              [inv_jan, inv_feb, inv_mar], 
              ignore_index=False,
              keys=['Jan', 'Feb', 'Mar'])
print(combined_keys)

The result is a multi-index table, with label as the first level.

       iid  cid invoice_date  total
Jan 0  101    1   2019-01-10    100
    1  102    2   2019-01-15    200
    2  103    3   2019-01-20    150
Feb 0  104    4   2019-02-05    170
    1  105    5   2019-02-12    250
    2  106    6   2019-02-18    300
    3  107    7   2019-02-25    190
Mar 0  108    8   2019-03-03    210
    1  109    9   2019-03-10    280
    2  110   10   2019-03-17    190
    3  111   11   2019-03-24    240
    4  112   12   2019-03-30    270

Tables with Different Columns

If tables have different columns, concat() includes all columns by default. Missing values appear as NaN.

Let's say the February invoice table has an additional column called 'billing_country`.

inv_feb = pd.DataFrame([
    [104, 4, '2019-02-05', 170, 'US'], 
    [105, 5, '2019-02-12', 250, 'CA'], 
    [106, 6, '2019-02-18', 300, 'UK'], 
    [107, 7, '2019-02-25', 190, 'FR']
], columns=['iid', 'cid', 'invoice_date', 'total', 'billing_country'])

combined_diff = pd.concat([inv_jan, inv_feb])
print(combined_diff)

Output:

   iid  cid invoice_date  total billing_country
101    1   2019-01-10    100             NaN
102    2   2019-01-15    200             NaN
103    3   2019-01-20    150             NaN
104    4   2019-02-05    170              US
105    5   2019-02-12    250              CA
106    6   2019-02-18    300              UK
107    7   2019-02-25    190              FR

You can also sort the different column names alphabetically using the sort argument.

sort_diff = pd.concat([inv_jan, inv_feb], sort=True)
print(sort_diff)

Output:

  billing_country  cid  iid invoice_date  total
           NaN    1  101   2019-01-10    100
           NaN    2  102   2019-01-15    200
           NaN    3  103   2019-01-20    150
            US    4  104   2019-02-05    170
            CA    5  105   2019-02-12    250
            UK    6  106   2019-02-18    300
            FR    7  107   2019-02-25    190

Keep Only Matching Columns

To keep only the columns that all tables have in common, set join="inner". The default value is outer, which is why the concat method will always return columns including the non-matching ones.

combined_inner = pd.concat([inv_jan, inv_feb], join="inner")
print(combined_inner)

Now, the billing_country column is gone, and only the common columns remain.

   iid  cid invoice_date  total
101    1   2019-01-10    100
102    2   2019-01-15    200
103    3   2019-01-20    150
104    4   2019-02-05    170
105    5   2019-02-12    250
106    6   2019-02-18    300
107    7   2019-02-25    190

Verifying integrity

When merging or concatenating tables, we need to ensure that data relationships are correct. Python provides ways to check for duplicates and unexpected structures.

Why Verify Data?

Prevents errors due to unexpected duplicates.
Ensures calculations (like averages) are accurate.
Helps maintain data consistency.

If an error occurs, clean the data or remove duplicates before merging or concatenating.

Checking Merges

We use merge() to combine tables based on a common column. However, unexpected duplicates can turn a one-to-one merge into a one-to-many. The validate argument helps detect such issues.

one_to_one: Ensures each value appears only once in both tables.
one_to_many: Allows duplicates in the right table but not in the left.

Consider the two tables below:

import pandas as pd

tracks = pd.DataFrame([
    [1, 'Song A'], 
    [2, 'Song B'], 
    [3, 'Song C']
], columns=['tid', 'track'])

specs = pd.DataFrame([
    [1, 'MP3'], 
    [2, 'FLAC'], 
    [2, 'WAV'], 
    [3, 'AAC']
], columns=['tid', 'format'])

When we attempt to do a one-to-one merge, this will return an error.

merged = pd.merge(tracks, specs, on='tid', validate='one_to_one')

Output:

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

Solution: Use one-to-many merging.

merged = pd.merge(tracks, specs, on='tid', validate='one_to_many')
print(merged)

Output:

   tid   track format
  1  Song A    MP3
  2  Song B   FLAC
  2  Song B    WAV
  3  Song C    AAC 

Checking Concatenations

The concat() method stacks tables vertically. The verify_integrity argument checks for duplicate index values and raises an error if found.

Conside the two invoice tables below.

inv_feb = pd.DataFrame([
    [8, 1, '2019-02-10', 100], 
    [9, 2, '2019-02-15', 200]
], columns=['iid', 'cid', 'invoice_date', 'total']).set_index('iid')

inv_mar = pd.DataFrame([
    [9, 3, '2019-03-05', 250],  # Duplicate iid=9
    [10, 4, '2019-03-10', 300]
], columns=['iid', 'cid', 'invoice_date', 'total']).set_index('iid')

When we set the verify_integrity to True, it will check if there are duplicate records between the two tables.

combined = pd.concat([inv_feb, inv_mar], verify_integrity=True)

Expected error:

ValueError: Indexes have overlapping values

Solution: Ignore Integrity Check

combined = pd.concat([inv_feb, inv_mar], verify_integrity=False)
print(combined)

Output:

     cid invoice_date  total
iid                         
8      1   2019-02-10    100
9      2   2019-02-15    200
9      3   2019-03-05    250
10     4   2019-03-10    300

Filtering Joins​

Semi Join​

Anti Join​

Concatenating Tables Vertically​

Basic Concatenation​

Ignoring Index​

Add Labels to Original Tables​

Tables with Different Columns​

Keep Only Matching Columns​

Verifying integrity​

Checking Merges​

Checking Concatenations​

Filtering Joins

Semi Join

Anti Join

Concatenating Tables Vertically

Basic Concatenation

Ignoring Index

Add Labels to Original Tables

Tables with Different Columns

Keep Only Matching Columns

Verifying integrity

Checking Merges

Checking Concatenations