Tuesday, March 15, 2016

Pandas join without replacement

Leave a Comment

This is a bit hard to explain but I'm going to try my best. What I've got right now is two tables I need to join together, but we don't really have a unique join id. I have a couple columns to join on that is the best I can do, and I just want to know when we don't have equal numbers on both sides of the joins. Right now, if the right table has 1 match to the 2 entries on the left table, that 1 match joins to both entries. That leaves me not knowing the right table only has 1 entry vs the 2 for the left.

What I want is to join a right table to a left (outer), but I don't want to join the right table more than once per entry. So if the right table index 3 could be joined on index 1 and 2 on the left, I only want it to be joined on index 1. Also, if index 3 and index 4 could be joined on index 1 and 2, I want index 1 to be matched with index 3, and index 2 to be matched with index 4. If there is only 1 match (Index 1 --> 3), but Index 2 on the left table could be matched to index 3, I want Index 2 to not be joined.

Examples may best describe this:

a_df = pd.DataFrame.from_dict({1: {'match_id': 2, 'uniq_id': 1}, 2: {'match_id': 2, 'uniq_id': 2}}, orient='index')  In [99]: a_df Out[99]:    match_id  uniq_id 1         2        1 2         2        2   In [100]: b_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 2, 'uniq_id': 4}}, orient='index')  In [101]: b_df Out[101]:    match_id  uniq_id 3         2        3 4         2        4 

In this example, I want a_df to join onto b_df. I want b_df uniq_id 3 to be matched with a_df uniq_id 1, and b_df 4 to a_df 2.

Output would look like this:

Out[106]:    match_id_right  match_id  uniq_id  uniq_id_right 1               2         2        1              3 2               2         2        2              4 

Now let's say we want to join a_df to c_df:

In [104]: c_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 3, 'uniq_id': 4}}, orient='index')  In [105]: c_df Out[105]:    match_id  uniq_id 3         2        3 4         3        4 

In this case, we have match_ids of 2 on a_df, and only 1 match_id of 2 on c_df.

In this case I just want uniq_id 1 to be matched with uniq_id 3, leaving both uniq_id 2 and uniq_id 4 to be unmatched

   match_id_right  match_id  uniq_id  uniq_id_right 1               2         2        1              3 2             NaN         2        2            NaN 4               3       NaN      NaN              4 

1 Answers

Answers 1

Alright guys, so the answer is actually pretty simple.

What you need to do is group each dataframe (left, right) by the matching column(s), and then add a new counter column for each group.

Now you do the outer join and include the counter column, so you'll match on 0,1, but if the right has a 2 then it doesn't match. If the left only has 0, it will match the right but if the right has 0,1, then the right '1' entry doesn't match!

Edit: Code request.

I don't have anything handy, but it is very simple. If you have, say, 2 columns you're matching on ['amount','date'], then you simply do a

left_df['Helper'] = left_df.groupby(['amount','date']).cumcount() right_df['RHelper'] = right_df.groupby(['amount','date']).cumcount() 

Then use the Helper column in the join.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment