熊猫-检查一个数据帧中的字符串列是否包含来自另一个数据帧的一对字符串_随笔

熊猫-检查一个数据帧中的字符串列是否包含来自另一个数据帧的一对字符串

考虑这种向量化方法：

from sklearn.feature_extraction.text import CountVectorizervect = CountVectorizer()X = vect.fit_transform(df1.consumption)Y = vect.transform(df2.creature + ' ' + df2.food)res = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))

结果：

In [67]: resOut[67]: array([ True, False,  True, False, False,  True, False,  True, False], dtype=bool)

说明：

In [68]: pd.Dataframe(X.toarray(), columns=vect.get_feature_names())Out[68]:   apple  ate  badger  banana  digs  eats  elephant  gets  giraffe  grass  huge  in  is  likes  loves  monkey  squirrel  tree0      1    1       0       0     0     0         0     0        0      0     0   0   0      0      0       0         1     01      1    0       0       0     0     0         0     0        0      0     0   0   0      1      0       1         0     02      0    0       0       1     0     0         0     1        0      0     0   0   0      0      0       1         0     03      0    0       1       1     0     0         0     1        0      0     0   0   0      0      0       0         0     04      0    0       0       0     0     1         0     0        1      1     0   0   0      0      0       0         0     05      1    0       1       0     0     0         0     0        0      0     0   0   0      0      1       0         0     06      0    0       0       0     0     0         1     0        0      0     1   0   1      0      0       0         0     07      0    0       0       1     0     1         1     0        0      0     0   0   0      0      0       0         0     18      0    0       0       0     1     0         0     0        0      1     0   1   0      0      0       0         1     0In [69]: pd.Dataframe(Y.toarray(), columns=vect.get_feature_names())Out[69]:   apple  ate  badger  banana  digs  eats  elephant  gets  giraffe  grass  huge  in  is  likes  loves  monkey  squirrel  tree0      1    0       0       0     0     0         0     0        0      0     0   0   0      0      0       0         1     01      1    0       1       0     0     0         0     0        0      0     0   0   0      0      0       0         0     02      0    0       0       1     0     0         0     0        0      0     0   0   0      0      0       1         0     03      0    0       0       1     0     0         1     0        0      0     0   0   0      0      0       0         0     0

更新：

In [92]: df1['match'] = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))In [93]: df1Out[93]:      consumption  match0         squirrel ate apple   True1         monkey likes apple  False2         monkey banana gets   True3         badger gets banana  False4         giraffe eats grass  False5         badger apple loves   True6elephant is huge  False7  elephant eats banana tree   True8     squirrel digs in grass  False9        squirrel.eats/apple   True   # <----- NOTE

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5632162.html

熊猫-检查一个数据帧中的字符串列是否包含来自另一个数据帧的一对字符串

发表评论

评论列表（0条）