短文本分类 (一): 构建词向量_教程

我的目标是利用tenserflow得到一个可以对新闻标题进行准确分类的分类器。

首先我需要有新闻标题的原始数据，因此我从今日头条抓取了近十万条新闻标题用于接下来的训练工作。

得到原始标题数据后，我需要对其进行分词构建语料库，分词我使用 jieba 这个第三方库。

之后要通过语料库用Word2vec算法对分词进行训练，这里我使用 gensim 的 word2vec 。

梳理下准备条件：

我抓取的数据存放在MYSQL，因此我将查询出标题进行分词后写入语料文件： yuliao.txt 。

虽然 jieba 分词已经很不错了，但是对于某些热门新词和人名等还是不够准确，所以有必要自定义一些词汇提供给 jieba 。

我在 user_dict.txt 中定义了一些 jieba 没有正确分出来的词：

然后加载到我们的程序中：

执行 load_data 方法便会生成语料文件。

导入 gensim ，加载我们的语料文件，开始训练模型：

训练好模型保存为文件，下次可以直接从文件导入，不必再进行训练。

我们看下模型的效果，运行 print_most_similar 测试方法，输出:

效果还可以，如果语料再多一点会更好。

训练好的模型相近意思的词在向量空间中的位置也是相似的，这样我们依据词向量做分类训练，本质上是将相近意思的句子归类。

当然最终我们要得到某个词的向量表示形式：

词向量（word2vec）原始的代码是C写的，python也有对应的版本，被集成在一个非常牛逼的框架gensim中。

我在自己的开源语义网络项目graph-mind（其实是我自己写的小玩具）中使用了这些功能，大家可以直接用我在上面做的进一步的封装傻瓜式地完成一些 *** 作，下面分享调用方法和一些code上的心得。

1.一些类成员变量：

[python] view plain copy

def __init__(self, modelPath, _size=100, _window=5, _minCount=1, _workers=multiprocessing.cpu_count()):

self.modelPath = modelPath

self._size = _size

self._window = _window

self._minCount = _minCount

self._workers = _workers

modelPath是word2vec训练模型的磁盘存储文件（model在内存中总是不踏实），_size是词向量的维度，_window是词向量训练时的上下文扫描窗口大小，后面那个不知道，按默认来，_workers是训练的进程数（需要更精准的解释，请指正），默认是当前运行机器的处理器核数。这些参数先记住就可以了。

2.初始化并首次训练word2vec模型

完成这个功能的核心函数是initTrainWord2VecModel，传入两个参数：corpusFilePath和safe_model，分别代表训练语料的路径和是否选择“安全模式”进行初次训练。关于这个“安全模式”后面会讲，先看代码：

[python] view plain copy

def initTrainWord2VecModel(self, corpusFilePath, safe_model=False):

'''''

init and train a new w2v model

(corpusFilePath can be a path of corpus file or directory or a file directly, in some time it can be sentences directly

about soft_model:

if safe_model is true, the process of training uses update way to refresh model,

and this can keep the usage of os's memory safe but slowly.

and if safe_model is false, the process of training uses the way that load all

corpus lines into a sentences list and train them one time.)

'''

extraSegOpt().reLoadEncoding()

fileType = localFileOptUnit.checkFileState(corpusFilePath)

if fileType == u'error':

warnings.warn('load file error!')

return None

else:

model = None

if fileType == u'opened':

print('training model from singleFile!')

model = Word2Vec(LineSentence(corpusFilePath), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)

elif fileType == u'file':

corpusFile = open(corpusFilePath, u'r')

print('training model from singleFile!')

model = Word2Vec(LineSentence(corpusFile), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)

elif fileType == u'directory':

corpusFiles = localFileOptUnit.listAllFileInDirectory(corpusFilePath)

print('training model from listFiles of directory!')

if safe_model == True:

model = Word2Vec(LineSentence(corpusFiles[0]), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)

for file in corpusFiles[1:len(corpusFiles)]:

model = self.updateW2VModelUnit(model, file)

else:

sentences = self.loadSetencesFromFiles(corpusFiles)

model = Word2Vec(sentences, size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)

elif fileType == u'other':

# TODO add sentences list directly

pass

model.save(self.modelPath)

model.init_sims()

print('producing word2vec model ... ok!')

return model

首先是一些杂七杂八的，判断一下输入文件路径下访问结果的类型，根据不同的类型做出不同的文件处理反应，这个大家应该能看懂，以corpusFilePath为一个已经打开的file对象为例，创建word2vec model的代码为：

[python] view plain copy

model = Word2Vec(LineSentence(corpusFilePath), size=self._size, window=self._window, min_count=self._minCount, workers=self._workers)

其实就是这么简单，但是为了代码健壮一些，就变成了上面那么长。问题是在面对一个路径下的许多训练文档且数目巨大的时候，一次性载入内存可能不太靠谱了（没有细研究gensim在Word2Vec构造方法中有没有考虑这个问题，只是一种习惯性的警惕），于是我设定了一个参数safe_model用于判断初始训练是否开启“安全模式”，所谓安全模式，就是最初只载入一篇语料的内容，后面的初始训练文档通过增量式学习的方式，更新到原先的model中。

上面的代码里，corpusFilePath可以传入一个已经打开的file对象，或是一个单个文件的地址，或一个文件夹的路径，通过函数checkFileState已经做了类型的判断。另外一个函数是updateW2VModelUnit，用于增量式训练更新w2v的model，下面会具体介绍。loadSetencesFromFiles函数用于载入一个文件夹中全部语料的所有句子，这个在源代码里有，很简单，哥就不多说了。

3.增量式训练更新word2vec模型

增量式训练w2v模型，上面提到了一个这么做的原因：避免把全部的训练语料一次性载入到内存中。另一个原因是为了应对语料随时增加的情况。gensim当然给出了这样的solution，调用如下：

[python] view plain copy

def updateW2VModelUnit(self, model, corpusSingleFilePath):

'''''

(only can be a singleFile)

'''

fileType = localFileOptUnit.checkFileState(corpusSingleFilePath)

if fileType == u'directory':

warnings.warn('can not deal a directory!')

return model

if fileType == u'opened':

trainedWordCount = model.train(LineSentence(corpusSingleFilePath))

print('update model, update words num is: ' + trainedWordCount)

elif fileType == u'file':

corpusSingleFile = open(corpusSingleFilePath, u'r')

trainedWordCount = model.train(LineSentence(corpusSingleFile))

print('update model, update words num is: ' + trainedWordCount)

else:

# TODO add sentences list directly (same as last function)

pass

return model

简单检查文件type之后，调用model对象的train方法就可以实现对model的更新，这个方法传入的是新语料的sentences，会返回模型中新增词汇的数量。函数全部执行完后，return更新后的model，源代码中在这个函数下面有能够处理多类文件参数（同2）的增强方法，这里就不多介绍了。

4.各种基础查询

当你确定model已经训练完成，不会再更新的时候，可以对model进行锁定，并且据说是预载了相似度矩阵能够提高后面的查询速度，但是你的model从此以后就read only了。

[python] view plain copy

def finishTrainModel(self, modelFilePath=None):

'''''

warning: after this, the model is read-only (can't be update)

'''

if modelFilePath == None:

modelFilePath = self.modelPath

model = self.loadModelfromFile(modelFilePath)

model.init_sims(replace=True)

可以看到，所谓的锁定模型方法，就是init_sims，并且把里面的replace参数设定为True。

然后是一些word2vec模型的查询方法：

[python] view plain copy

def getWordVec(self, model, wordStr):

'''''

get the word's vector as arrayList type from w2v model

'''

return model[wordStr]

[python] view plain copy

def queryMostSimilarWordVec(self, model, wordStr, topN=20):

'''''

MSimilar words basic query function

return 2-dim List [0] is word [1] is double-prob

'''

similarPairList = model.most_similar(wordStr.decode('utf-8'), topn=topN)

return similarPairList

[python] view plain copy

def culSimBtwWordVecs(self, model, wordStr1, wordStr2):

'''''

two words similar basic query function

return double-prob

'''

similarValue = model.similarity(wordStr1.decode('utf-8'), wordStr2.decode('utf-8'))

return similarValue

上述方法都很简单，基本上一行解决，在源代码中，各个函数下面依然是配套了相应的model文件处理版的函数。其中，getWordVec是得到查询词的word2vec词向量本身，打印出来是一个纯数字的array；queryMostSimilarWordVec是得到与查询词关联度最高的N个词以及对应的相似度，返回是一个二维list（注释里面写的蛮清楚）；culSimBtwWordVecs是得到两个给定词的相似度值，直接返回double值。

5.Word2Vec词向量的计算

研究过w2v理论的童鞋肯定知道词向量是可以做加减计算的，基于这个性质，gensim给出了相应的方法，调用如下：

[python] view plain copy

def queryMSimilarVecswithPosNeg(self, model, posWordStrList, negWordStrList, topN=20):

'''''

pos-neg MSimilar words basic query function

return 2-dim List [0] is word [1] is double-prob

'''

posWordList = []

negWordList = []

for wordStr in posWordStrList:

posWordList.append(wordStr.decode('utf-8'))

for wordStr in negWordStrList:

negWordList.append(wordStr.decode('utf-8'))

pnSimilarPairList = model.most_similar(positive=posWordList, negative=negWordList, topn=topN)

return pnSimilarPairList

由于用的是py27，所以之前对传入的词列表数据进行编码过滤，这里面posWordList可以认为是对结果产生正能量的词集，negWordList则是对结果产生负能量的词集，同时送入most_similar方法，在设定return答案的topN，得到的返回结果形式同4中的queryMostSimilarWordVec函数，大家可以这样数学地理解这个 *** 作：

下面一个 *** 作是我自创的，假设我想用上面词向量topN“词-关联度”的形式展现两个词或两组词之间的关联，我是这么做的：

[python] view plain copy

def copeMSimilarVecsbtwWordLists(self, model, wordStrList1, wordStrList2, topN_rev=20, topN=20):

'''''

range word vec res for two wordList from source to target

use wordVector to express the relationship between src-wordList and tag-wordList

first, use the tag-wordList as neg-wordList to get the rev-wordList,

then use the scr-wordList and the rev-wordList as the new src-tag-wordList

topN_rev is topN of rev-wordList and topN is the final topN of relationship vec

'''

srcWordList = []

tagWordList = []

srcWordList.extend(wordStr.decode('utf-8') for wordStr in wordStrList1)

tagWordList.extend(wordStr.decode('utf-8') for wordStr in wordStrList2)

revSimilarPairList = self.queryMSimilarVecswithPosNeg(model, [], tagWordList, topN_rev)

revWordList = []

revWordList.extend(pair[0].decode('utf-8') for pair in revSimilarPairList)

stSimilarPairList = self.queryMSimilarVecswithPosNeg(model, srcWordList, revWordList, topN)

return stSimilarPairList

这个 *** 作的思路就是，首先用两组词中的一组作为negWordList，传入上面的queryMSimilarVecswithPosNeg函数，得到topN一组的中转词，在使用这些中转词与原先的另一组词进行queryMSimilarVecswithPosNeg *** 作，很容易理解，第一步得到的是一组词作为negWordList的反向结果，再通过这个反向结果与另一组词得到“负负得正”的效果。这样就可以通过一组topN的“词-关联度”配对List表示两组词之间的关系。

整体过程就是：首先拿到文档集合，使用分词工具进行分词，得到词组序列；第二步为每个词语分配ID，既corpora.Dictionary；分配好ID后，整理出各个词语的词频，使用“词ID：词频”的形式形成稀疏向量，使用LDA模型进行训练。

这是分词过程，然后每句话/每段话构成一个单词的列表，结果如下所示：

[['美国', '输给', '中国女排', '输给', '郎平'],

['美国', '无缘', '四强', '主教练'],

['中国女排', '晋级', '世锦赛', '四强', '主教练', '郎平', '执教', '艺术'],

['买', 'MPV', 'SUV', '跑', '长途'],

['跑', '长途', 'SUV', '轿车', '差距'],

['家用', '轿车', '买']]

{'中国女排': 0, '美国': 1, '输给': 2, '郎平': 3, '主教练': 4, '四强': 5, '无缘': 6, '世锦赛': 7, '执教': 8, '晋级': 9, '艺术': 10, 'MPV': 11, 'SUV': 12, '买': 13, '跑': 14, '长途': 15, '差距': 16, '轿车': 17, '家用': 18}

按照词ID：词频构成corpus：

[[(0, 1), (1, 1), (2, 2), (3, 1)],

[(1, 1), (4, 1), (5, 1), (6, 1)],

[(0, 1), (3, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 1), (10, 1)],

[(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],

[(12, 1), (14, 1), (15, 1), (16, 1), (17, 1)],

[(13, 1), (17, 1), (18, 1)]]

LdaModel(num_terms=19, num_topics=2, decay=0.5, chunksize=2000)

前面设置了num_topics = 2 所以这里有两个主题，很明显第一个是汽车相关topic，第二个是体育相关topic。

(0, '0.089 "跑" + 0.088 "SUV" + 0.088 "长途" + 0.069 "轿车"')

(1, '0.104 "美国" + 0.102 "输给" + 0.076 "中国女排" + 0.072 "郎平"')

上面语料属于哪个主题：

(array([[5.13748 , 0.86251986],

[0.6138436 , 4.386156 ],

[8.315966 , 0.68403417],

[5.387934 , 0.612066 ],

[5.3367395 , 0.6632605 ],

[0.59680593, 3.403194 ]], dtype=float32), None)

美国教练坦言，没输给中国女排，是输给了郎平

主题0推断值0.62

主题1推断值5.38

美国无缘四强，听听主教练的评价

主题0推断值1.35

主题1推断值3.65

中国女排晋级世锦赛四强，全面解析主教练郎平的执教艺术

主题0推断值0.82

主题1推断值8.18

为什么越来越多的人买MPV，而放弃SUV？跑一趟长途就知道了

主题0推断值1.63

主题1推断值4.37

跑了长途才知道，SUV和轿车之间的差距

主题0推断值0.65

主题1推断值5.35

家用的轿车买什么好

主题0推断值3.38

主题1推断值0.62

做了几次不知道是不是因为语料太短的原因，效果比较差，分类很不准确。

中国女排将在郎平的率领下向世界女排三大赛的三连冠发起冲击

主题0推断值2.40

主题1推断值0.60

【长途】与【主题0】的关系值：1.61%

【长途】与【主题1】的关系值：7.41%

原文参考： http://www.pianshen.com/article/636768367/

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/tougao/8145612.html

短文本分类 (一): 构建词向量

发表评论

评论列表（0条）