一、Deep contextualized word representations
摘要:We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
全文链接: Deep contextualized word representations——学术范
二、Glove: Global Vectors for Word Representation
摘要:Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
全文链接: Glove: Global Vectors for Word Representation——学术范
三、SQuAD: 100,000+ Questions for Machine Comprehension of Text
摘要:We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at this https URL
全文链接: SQuAD: 100,000+ Questions for Machine Comprehension of Text——学术范
四、GloVe: Global Vectors for Word Representation
摘要:Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
全文链接: GloVe: Global Vectors for Word Representation——学术范
五、Sequence to Sequence Learning with Neural Networks
摘要:Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
全文链接: Sequence to Sequence Learning with Neural Networks——学术范
六、The Stanford CoreNLP Natural Language Processing Toolkit
摘要:We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
全文链接: The Stanford CoreNLP Natural Language Processing Toolkit——学术范
七、Distributed Representations of Words and Phrases and their Compositionality
摘要:The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
全文链接: Distributed Representations of Words and Phrases and their Compositionality——学术范
八、Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
摘要:Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.
全文链接: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank——学术范
希望可以对大家有帮助, 学术范 是一个新上线的一站式学术讨论社区,在这里,有海量的计算机外文文献资源与研究领域最新信息、好用的文献阅读及管理工具,更有无数志同道合的同学以及学术科研工作者与你一起,展开热烈且高质量的学术讨论!快来加入我们吧!
数据预处理
模型能聊的内容也取决于选取的语料。如果已经具备了原始聊天数据,可以用SQL通过关键字查询一些对话,也就是从大库里选取出一个小库来训练。从一些论文上,很多算法都是在数据预处理层面的,比如Mechanism-Aware Neural Machine for Dialogue Response Generation就介绍了,从大库中抽取小库,然后再进行融合,训练出有特色的对话来。
对于英语,需要了解NLTK,NLTK提供了加载语料,语料标准化,语料分类,PoS词性标注,语意抽取等功能。
另一个功能强大的工具库是CoreNLP,作为 Stanford开源出来的工具,特色是实体标注,语意抽取,支持多种语言。
下面主要介绍两个内容:
中文分词
现在有很多中文分词的SDK,分词的算法也比较多,也有很多文章对不同SDK的性能做比较。做中文分词的示例代码如下。
# coding:utf8
'''
Segmenter with Chinese
'''
import jieba
import langid
def segment_chinese_sentence(sentence):
'''
Return segmented sentence.
'''
seg_list = jieba.cut(sentence, cut_all=False)
seg_sentence = u" ".join(seg_list)
return seg_sentence.strip().encode('utf8')
def process_sentence(sentence):
'''
Only process Chinese Sentence.
'''
if langid.classify(sentence)[0] == 'zh':
return segment_chinese_sentence(sentence)
return sentence
if __name__ == "__main__":
print(process_sentence('飞雪连天射白鹿'))
print(process_sentence('I have a pen.'))
以上使用了langid先判断语句是否是中文,然后使用jieba进行分词。
在功能上,jieba分词支持全切分模式,精确模式和搜索引擎模式。
全切分:输出所有分词。
精确:概率上的最佳分词。
所有引擎模式:对精确切分后的长句再进行分词。
jieba分词的实现
主要是分成下面三步:
1、加载字典,在内存中建立字典空间。
字典的构造是每行一个词,空格,词频,空格,词性。
上诉书 3 n
上诉人 3 n
上诉期 3 b
上诉状 4 n
上课 650 v
建立字典空间的是使用python的dict,采用前缀数组的方式。
使用前缀数组的原因是树结构只有一层 - word:freq,效率高,节省空间。比如单词"dog", 字典中将这样存储:
{
"d": 0,
"do": 0,
"dog": 1 # value为词频
}
字典空间的主要用途是对输入句子建立有向无环图,然后根据算法进行切分。算法的取舍主要是根据模式 - 全切,精确还是搜索。
2、对输入的语句分词,首先是建立一个有向无环图。
有向无环图, Directed acyclic graph (音 /ˈdæɡ/)。
【图 3-2】 DAG
DAG对于后面计算最大概率路径和使用HNN模型识别新词有直接关系。
3、按照模式,对有向无环图进行遍历,比如,在精确模式下,便利就是求最大权重和的路径,权重来自于在字典中定义的词频。对于没有出现在词典中的词,连续的单个字符也许会构成新词。然后用HMM模型和Viterbi算法识别新词。
精确模型切词:使用动态规划对最大概率路径进行求解。
最大概率路径:求route = (w1, w2, w3 ,.., wn),使得Σweight(wi)最大。Wi为该词的词频。
更多的细节还需要读一下jieba的源码。
自定义字典
jieba分词默认的字典是:1998人民日报的切分语料还有一个msr的切分语料和一些txt小说。开发者可以自行添加字典,只要符合字典构建的格式就行。
jieba分词同时提供接口添加词汇。
Word embedding
使用机器学习训练的语言模型,网络算法是使用数字进行计算,在输入进行编码,在输出进行解码。word embedding就是编解码的手段。
【图 3-3】 word embedding, Ref. #7
word embedding是文本的数值化表示方法。表示法包括one-hot,bag of words,N-gram,分布式表示,共现矩阵等。
Word2vec
近年来,word2vec被广泛采用。Word2vec输入文章或者其他语料,输出语料中词汇建设的词向量空间。详细可参考word2vec数学原理解析。
使用word2vec
安装完成后,得到word2vec命令行工具。
word2vec -train "data/review.txt" \
-output "data/review.model" \
-cbow 1 \
-size 100 \
-window 8 \
-negative 25 \
-hs 0 \
-sample 1e-4 \
-threads 20 \
-binary 1 \
-iter 15
-train "data/review.txt" 表示在指定的语料库上训练模型
-cbow 1 表示用cbow模型,设成0表示用skip-gram模型
-size 100 词向量的维度为100
-window 8 训练窗口的大小为8 即考虑一个单词的前八个和后八个单词
-negative 25 -hs 0 是使用negative sample还是HS算法
-sample 1e-4 采用阈值
-threads 20 线程数
-binary 1 输出model保存成2进制
-iter 15 迭代次数
在训练完成后,就得到一个model,用该model可以查询每个词的词向量,在词和词之间求距离,将不同词放在数学公式中计算输出相关性的词。比如:
vector("法国") - vector("巴黎) + vector("英国") = vector("伦敦")"
对于训练不同的语料库,可以单独的训练词向量模型,可以利用已经训练好的模型。
其它训练词向量空间工具推荐:Glove。
Seq2Seq
2014年,Sequence to Sequence Learning with Neural Networks提出了使用深度学习技术,基于RNN和LSTM网络训练翻译系统,取得了突破,这一方法便应用在更广泛的领域,比如问答系统,图像字幕,语音识别,撰写诗词等。Seq2Seq完成了【encoder + decoder ->target】的映射,在上面的论文中,清晰的介绍了实现方式。
【图 3-4】 Seq2Seq, Ref. #1
也有很多文章解读它的原理。在使用Seq2Seq的过程中,虽然也研究了它的结构,但我还不认为能理解和解释它。下面谈两点感受:
a. RNN保存了语言顺序的特点,这和CNN在处理带有形状的模型时如出一辙,就是数学模型的设计符合物理模型。
【图 3-5】 RNN, Ref. #6
b. LSTM Cell的复杂度对应了自然语言处理的复杂度。
【图 3-6】 LSTM, Ref. #6
理由是,有人将LSTM Cell尝试了多种其它方案传递状态,结果也很好。
【图 3-7】 GRU, Ref. #6
LSTM的一个替代方案:GRU。只要RNN的Cell足够复杂,它就能工作的很好。
使用DeepQA2训练语言模型
准备工作,下载项目:
git clone https://github.com/Samurais/DeepQA2.git
cd DeepQA2
open README.md # 根据README.md安装依赖包
DeepQA2将工作分成三个过程:
数据预处理:从语料库到数据字典。
训练模型:从数据字典到语言模型。
提供服务:从语言模型到RESt API。
预处理
DeepQA2使用Cornell Movie Dialogs Corpus作为demo语料库。
原始数据就是movie_lines.txt 和movie_conversations.txt。这两个文件的组织形式参考README.txt
deepqa2/dataset/preprocesser.py是将这两个文件处理成数据字典的模块。
train_max_length_enco就是问题的长度,train_max_length_deco就是答案的长度。在语料库中,大于该长度的部分会被截断。
程序运行后,会生成dataset-cornell-20.pkl文件,它加载到python中是一个字典:
word2id存储了{word: id},其中word是一个单词,id是int数字,代表这个单词的id。
id2word存储了{id: word}。
trainingSamples存储了问答的对话对。
比如 [[[1,2,3],[4,5,6]], [[7,8,9], [10, 11, 12]]]
1,2,3 ... 12 都是word id。
[1,2,3] 和 [4,5,6] 构成一个问答。 [7,8,9] 和 [10, 11, 12] 构成一个问答。
开始训练
cp config.sample.ini config.ini # modify keys
python deepqa2/train.py
config.ini是配置文件, 根据config.sample.ini进行修改。训练的时间由epoch,learning rate, maxlength和对话对的数量而定。
deepqa2/train.py大约100行,完成数据字典加载、初始化tensorflow的session,saver,writer、初始化神经元模型、根据epoch进行迭代,保存模型到磁盘。
session是网络图,由placeholder, variable, cell, layer, output 组成。
saver是保存model的,也可以用来恢复model。model就是实例化variable的session。
writer是查看loss fn或者其他开发者感兴趣的数据的收集器。writer的结果会被saver保存,然后使用tensorboard查看。
Model
Model的构建要考虑输入,状态,softmax,输出。
定义损耗函数,使用AdamOptimizer进行迭代。
最后,参考一下训练的loop部分。
每次训练,model会被存储在 save路径下,文件夹的命名根据机器的hostname,时间戳生成。
提供服务
在TensorFlow中,提供了标准的serving模块 - tensorflow serving。但研究了很久,还专门看了一遍 《C++ Essentials》,还没有将它搞定,社区也普遍抱怨tensorflow serving不好学,不好用。训练结束后,使用下面的脚本启动服务,DeepQA2的serve部分还是调用TensorFlow的python api。
cd DeepQA2/save/deeplearning.cobra.vulcan.20170127.175256/deepqa2/serve
cp db.sample.sqlite3 db.sqlite3
python manage.py runserver 0.0.0.0:8000
测试
POST /api/v1/question HTTP/1.1
Host: 127.0.0.1:8000
Content-Type: application/json
Authorization: Basic YWRtaW46cGFzc3dvcmQxMjM=
Cache-Control: no-cache
{"message": "good to know"}
response
{
"rc": 0,
"msg": "hello"
}
serve的核心代码在serve/api/chatbotmanager.py中。
使用脚本
scripts/start_training.sh 启动训练
scripts/start_tensorboard.sh 启动Tensorboard
scripts/start_serving.sh 启动服务
对模型的评价
目前代码具有很高的维护性,这也是从DeepQA项目进行重构的原因,更清晰的数据预处理、训练和服务。有新的变更可以添加到deepqa2/models中,然后在train.py和chatbotmanager.py变更一下。
有待改进的地方
a. 新建models/rnn2.py, 使用dropout。目前DeepQA中已经使用了Drop.
b. tensorflow rc0.12.x中已经提供了seq2seq network,可以更新成tf版本.
c. 融合训练,目前model只有一个库,应该是设计一个新的模型,支持一个大库和小库,不同权重进行,就如Mechanism-Aware Neural Machine for Dialogue Response Generation的介绍。
d. 代码支持多机多GPU运行。
e. 目前训练的结果都是QA对,对于一个问题,可以有多个答案。
f. 目前没有一个方法进行accuracy测试,一个思路是在训练中就提供干扰项,因为当前只有正确的答案,如果提供错误的答案(而且越多越好),就可以使用recall_at_k方法进行测试。
机器人家上了解到的,希望对你有用
中文主要有:NLTK,HanLP,Ansj,THULAC,结巴分词,FNLP,哈工大LTP,中科院ICTCLAS分词,GATE,SnowNLP,东北大学NiuTrans,NLPIR;英文主要有:NLTK,Genism,TextBlob,Stanford NLP,Spacy。英文的开源NLP工具主要参见StackoverFlow-java or python for nlp。HanLP:HanLP是由一系列模型与算法组成的Java工具包,目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。开发语言:Java,网址:hankcs/HanLP,开发机构:大快公司,协议:Apache-2.0功能:非常多,主要有中文分词,词性标注,命名实体识别,关键词提取,自动摘要,短语提取,拼音转换,简繁转换,文本推荐,依存句法分析,文本分类:情感分析,word2vec,语料库工具。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)