您的原始图片
nlkt()在每行中循环3次。
def nlkt(val): val=repr(val) clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')] nopunc = [char for char in str(clean_txt) if char not in string.punctuation] nonum = [char for char in nopunc if not char.isdigit()] words_string = ''.join(nonum) return words_string
另外,每次调用时
nlkt(),都会一次又一次地重新初始化它们。
stopwords.words('english')
string.punctuation
这些应该是全球性的。
stoplist = stopwords.words('english') + list(string.punctuation)
逐行检查:
val=repr(val)
我不确定为什么需要这样做。但是您可以轻松地将列转换为
str类型。这应该在预处理功能之外完成。
希望这是不言而喻的:
>>> import pandas as pd>>> df = pd.Dataframe([[0, 1, 2], [2, 'xyz', 4], [5, 'abc', 'def']])>>> df 0 1 20 0 1 21 2 xyz 42 5 abc def>>> df[1]0 11 xyz2 abcName: 1, dtype: object>>> df[1].astype(str)0 11 xyz2 abcName: 1, dtype: object>>> list(df[1])[1, 'xyz', 'abc']>>> list(df[1].astype(str))['1', 'xyz', 'abc']
现在转到下一行:
clean_txt = [word for word in val.split() if word.lower() not in stopwords.words('english')]
使用
str.split()很尴尬,应该使用适当的标记器。否则,您的标点符号可能会卡在前面的单词上,例如
>>> from nltk.corpus import stopwords>>> from nltk import word_tokenize>>> import string>>> stoplist = stopwords.words('english') + list(string.punctuation)>>> stoplist = set(stoplist)>>> text = 'This is foo, bar and doh.'>>> [word for word in text.split() if word.lower() not in stoplist]['foo,', 'bar', 'doh.']>>> [word for word in word_tokenize(text) if word.lower() not in stoplist]['foo', 'bar', 'doh']
同时检查是否
.isdigit()应该一起检查:
>>> text = 'This is foo, bar, 234, 567 and doh.'>>> [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]['foo', 'bar', 'doh']
将它们放在一起,您
nlkt()应该看起来像这样:
def preprocess(text): return [word for word in word_tokenize(text) if word.lower() not in stoplist and not word.isdigit()]
您可以使用
Dataframe.apply:
data['Anylize_Text'].apply(preprocess)
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)