Sample:df1 = pd.Dataframe({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 'Age': [34, 18, 44, 27, 30]})#print (df1)df3 = df1.copy()df2 = pd.Dataframe({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 'Sex': ['M', 'M', 'F', 'M', 'F']})#print (df2)
使用map由
Series创建人
set_index:
df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])print (df1) Name Age Sex0 Tom 34 M1 Sara 18 NaN2 Eva 44 F3 Jack 27 M4 Laura 30 NaN
df = df3.merge(df2[['Name','Sex']], on='Name', how='left')print (df) Name Age Sex0 Tom 34 M1 Sara 18 NaN2 Eva 44 F3 Jack 27 M4 Laura 30 NaN
如果需要通过多列映射(例如Year和Code),则需要merge左连接:
df1 = pd.Dataframe({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 'Year':[2000,2003,2003,2004,2007], 'Code':[1,2,3,4,4], 'Age': [34, 18, 44, 27, 30]})print (df1) Name Year Code Age0 Tom 2000 1 341 Sara 2003 2 182 Eva 2003 3 443 Jack 2004 4 274 Laura 2007 4 30df2 = pd.Dataframe({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 'Sex': ['M', 'M', 'F', 'M', 'F'], 'Year':[2001,2003,2003,2004,2007], 'Code':[1,2,3,5,3], 'Val':[21,34,23,44,67]})print (df2) Name Sex Year Code Val0 Tom M 2001 1 211 Paul M 2003 2 342 Eva F 2003 3 233 Jack M 2004 5 444 Michelle F 2007 3 67
#merge by all columnsdf = df1.merge(df2, on=['Year','Code'], how='left')print (df) Name_x Year Code Age Name_y Sex Val0 Tom 2000 1 34 NaN NaN NaN1 Sara 2003 2 18 Paul M 34.02 Eva 2003 3 44 Eva F 23.03 Jack 2004 4 27 NaN NaN NaN4 Laura 2007 4 30 NaN NaN NaN#specified columns - columns for join (Year, Code) need always + appended columns (Val)df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')print (df) Name Year Code Age Val0 Tom 2000 1 34 NaN1 Sara 2003 2 18 34.02 Eva 2003 3 44 23.03 Jack 2004 4 27 NaN4 Laura 2007 4 30 NaN
如果获取错误map意味着按连接列重复,则在这里Name:
df1 = pd.Dataframe({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 'Age': [34, 18, 44, 27, 30]})print (df1) Name Age0 Tom 341 Sara 182 Eva 443 Jack 274 Laura 30df3, df4 = df1.copy(), df1.copy()df2 = pd.Dataframe({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 'Val': [1,2,3,4,5]})print (df2) Name Val0 Tom 1 <-duplicated name Tom1 Tom 2 <-duplicated name Tom2 Eva 33 Jack 44 Michelle 5s = df2.set_index('Name')['Val']df1['New'] = df1['Name'].map(s)print (df1)
InvalidIndexError:重新索引仅对唯一值的Index对象有效
解决方案通过删除重复项Dataframe.drop_duplicates,或dict在最后一次重复匹配中使用map by :
#default keep first values = df2.drop_duplicates('Name').set_index('Name')['Val']print (s)NameTom 1Eva 3Jack 4Michelle 5Name: Val, dtype: int64df1['New'] = df1['Name'].map(s)print (df1) Name Age New0 Tom 34 1.01 Sara 18 NaN2 Eva 44 3.03 Jack 27 4.04 Laura 30 NaN
#add parameter for keep last value s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']print (s)NameTom 2Eva 3Jack 4Michelle 5Name: Val, dtype: int64df3['New'] = df3['Name'].map(s)print (df3) Name Age New0 Tom 34 2.01 Sara 18 NaN2 Eva 44 3.03 Jack 27 4.04 Laura 30 NaN
#map by dictionaryd = dict(zip(df2['Name'], df2['Val']))print (d){'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}df4['New'] = df4['Name'].map(d)print (df4) Name Age New0 Tom 34 2.01 Sara 18 NaN2 Eva 44 3.03 Jack 27 4.04 Laura 30 NaN
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)