This is a bit hard to explain but I'm going to try my best. What I've got right now is two tables I need to join together, but we don't really have a unique join id. I have a couple columns to join on that is the best I can do, and I just want to know when we don't have equal numbers on both sides of the joins. Right now, if the right table has 1 match to the 2 entries on the left table, that 1 match joins to both entries. That leaves me not knowing the right table only has 1 entry vs the 2 for the left.
What I want is to join a right table to a left (outer), but I don't want to join the right table more than once per entry. So if the right table index 3 could be joined on index 1 and 2 on the left, I only want it to be joined on index 1. Also, if index 3 and index 4 could be joined on index 1 and 2, I want index 1 to be matched with index 3, and index 2 to be matched with index 4. If there is only 1 match (Index 1 --> 3), but Index 2 on the left table could be matched to index 3, I want Index 2 to not be joined.
Examples may best describe this:
a_df = pd.DataFrame.from_dict({1: {'match_id': 2, 'uniq_id': 1}, 2: {'match_id': 2, 'uniq_id': 2}}, orient='index') In [99]: a_df Out[99]: match_id uniq_id 1 2 1 2 2 2 In [100]: b_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 2, 'uniq_id': 4}}, orient='index') In [101]: b_df Out[101]: match_id uniq_id 3 2 3 4 2 4
In this example, I want a_df to join onto b_df. I want b_df uniq_id 3 to be matched with a_df uniq_id 1, and b_df 4 to a_df 2.
Output would look like this:
Out[106]: match_id_right match_id uniq_id uniq_id_right 1 2 2 1 3 2 2 2 2 4
Now let's say we want to join a_df to c_df:
In [104]: c_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 3, 'uniq_id': 4}}, orient='index') In [105]: c_df Out[105]: match_id uniq_id 3 2 3 4 3 4
In this case, we have match_ids of 2 on a_df, and only 1 match_id of 2 on c_df.
In this case I just want uniq_id 1 to be matched with uniq_id 3, leaving both uniq_id 2 and uniq_id 4 to be unmatched
match_id_right match_id uniq_id uniq_id_right 1 2 2 1 3 2 NaN 2 2 NaN 4 3 NaN NaN 4
1 Answers
Answers 1
Alright guys, so the answer is actually pretty simple.
What you need to do is group each dataframe (left, right) by the matching column(s), and then add a new counter column for each group.
Now you do the outer join and include the counter column, so you'll match on 0,1, but if the right has a 2 then it doesn't match. If the left only has 0, it will match the right but if the right has 0,1, then the right '1' entry doesn't match!
Edit: Code request.
I don't have anything handy, but it is very simple. If you have, say, 2 columns you're matching on ['amount','date'], then you simply do a
left_df['Helper'] = left_df.groupby(['amount','date']).cumcount() right_df['RHelper'] = right_df.groupby(['amount','date']).cumcount()
Then use the Helper column in the join.
0 comments:
Post a Comment