CS224N 2019年课程第一次作业复现_python

本次作业主要介绍

余弦相似性
两种求词向量的方法
- 基于计数（词共现矩阵 + SVD）
- 基于预测（word2vec）

完整代码：CS 224N | Home

一、环境及数据问题

1.gensim安装

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple gensim

2.reuters数据集读取

直接运行上面的代码会报错，解决方案如下：

直接下载数据

链接: https://pan.baidu.com/s/1zo87Uq7_dKbu773ZT_cwTw 提取码: 9jni

在打开Jupyter Notebook的路径下新建文件夹 nltk_data 在nltk_data文件夹下将reuters放进去

然后运行，就不会报错了，撒花🎉～

3.加载数据集‘word2vec-google-news-300’又又又报错了

解决方案：

unable to read local cache ‘C:\\Users\\kingS/gensim-data\\information.json‘ during fallback, connec_紧到长不胖的博客-CSDN博客

4.由于gensim版本问题，导致几处代码跑不通，本环境版本。

此处跑不通

解决方案参考： Migrating from Gensim 3.x to 4 · RaRe-Technologies/gensim Wiki · GitHub

经过10086次 *** 作之后，第一次作业的代码可以完美复现啦，啦啦啦～

二、内容理解

1.导入包

# All Import Statements Defined Here
# Note: Do not add to this list.
# All the dependencies you need, can be installed by running .
# ----------------

import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
# import nltk
# nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = ''
END_TOKEN = ''

np.random.seed(0)
random.seed(0)
# ----------------

2.Part 1 Count-Based Word Vectors (10 points)

思路框架：

对于语录料库中的文档单词，得先构建一个词典（唯一单词且排好序）
然后我们就是基于词典和语料库，为每个单词构建词向量，也就是共现矩阵
对共现矩阵降维，就得到了最终的词向量
可视化

2.1 构建词典（基于语料库）

2.1.1 共现矩阵定义

2.1.2 遍历每一篇文档，获得所有的单词，去重排序，记录单词总数。

def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)

读取578篇语料

加载每一篇文章中的单词且全部变成小写字母

在每一篇文章的开头和结尾分别加[START_TOKEN和 END_TOKEN

计算出语料库中的不同单词，然后排序

#21计算出语料库中的不同单词 然后排序
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    for everylist in corpus:
        corpus_words.extend(everylist)
    corpus_words = sorted(set(corpus_words))
    num_corpus_words = len(corpus_words)

    # ------------------

    return corpus_words, num_corpus_words
# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# ---------------------

# Define toy corpus
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)

2.2构建共现矩阵

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    word2Ind = {k: v for (k, v) in zip(words, range(num_words))}
    M = np.zeros((num_words, num_words))
    	# 接下来是遍历语料库 对于每一篇文档， 我们得遍历每个单词
    # 对于每个单词， 我们得找到窗口的范围， 然后再去遍历它窗口内的每个单词
    # 对于这每个单词， 我们就可以在我们的M词典中进行计数， 但是要注意每个单词其实有两个索引
    # 一个是词典里面的索引， 一个是文档中的索引， 我们统计的共现频率是基于字典里面的索引， 
    # 所以这里涉及到一个索引的转换
    
    # 首先遍历语料库
    for every_doc in corpus:
        for cword_doc_ind, cword in enumerate(every_doc):  # 遍历当前文档的每个单词和在文档中的索引
            # 对于当前的单词， 我们先找到它在词典中的位置
            cword_dic_ind = word2Ind[cword]
            
            # 找窗口的起始和终止位置  开始位置就是当前单词的索引减去window_size, 终止位置
            # 是当前索引加上windo_size+1， 
            window_start = cword_doc_ind - window_size
            window_end = cword_doc_ind + window_size + 1
            
            # 有了窗口， 我们就要遍历窗口里面的每个单词， 然后往M里面记录就行了
            # 但是还要注意一点， 就是边界问题， 因为开始单词左边肯定不够窗口大小， 结束单词
            # 右边肯定不够窗口大小， 所以遍历之后得判断一下是不是左边或者右边有单词
            for j in range(window_start, window_end):
                # 前面两个条件控制不越界， 最后一个条件控制不是它本身
                if j >=0 and j < len(every_doc) and j != cword_doc_ind:
                    # 想办法加入到M， 那么得获取这个单词在词典中的位置
                    oword = every_doc[j]   # 获取到上下文单词
                    oword_dic_ind = word2Ind[oword]
                    # 加入M
                    M[cword_dic_ind, oword_dic_ind] += 1

    # ------------------

    return M, word2Ind

2.3 共现矩阵降维

sklearn.decomposition.TruncatedSVD — scikit-learn 1.1.0 documentation

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
        # ------------------
        # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=2020)
    M_reduced = svd.fit_transform(M)
        # ------------------

    print("Done.")
    return M_reduced


# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness 
# In fact we only check that your M_reduced has the right dimensions.
# ---------------------

# Define toy corpus and run student code
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)
M_test_reduced = reduce_to_k_dim(M_test, k=2)

# Test proper dimensions
assert (M_test_reduced.shape[0] == 10), "M_reduced has {} rows; should have {}".format(M_test_reduced.shape[0], 10)
assert (M_test_reduced.shape[1] == 2), "M_reduced has {} columns; should have {}".format(M_test_reduced.shape[1], 2)

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

2.4 降维后可视化

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    # 遍历句子， 获得每个单词的x，y坐标
    for word in words:
        word_dic_index = word2Ind[word]
        x = M_reduced[word_dic_index][0]
        y = M_reduced[word_dic_index][1]
        plt.scatter(x, y, marker='x', color='red')
        # plt.text()给图形添加文本注释
        plt.text(x+0.0002, y+0.0002, word, fontsize=9)  # # x、y上方0.002处标注文字说明，word标注的文字，fontsize：文字大小
    plt.show()
    # ------------------
# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# The plot produced should look like the "test solution plot" depicted below. 
# ---------------------

print ("-" * 80)
print ("Outputted Plot:")

M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])
word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)

print ("-" * 80)

Part 1 Summary:

# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

范数：np.linalg.norm()用法总结_小k同学！的博客-CSDN博客_np.linalg.norm

3. Part 2 Prediction-Based Word Vectors (15 points)

用Word2Vec训练好的词向量矩阵去测试一些有趣的效果，看看词向量到底是干啥用的。所以用gensim包下载了一个词向量矩阵：

def load_word2vec():
    """ Load Word2Vec Vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
    """
    import gensim.downloader as api
    wv_from_bin = api.load("word2vec-google-news-300")
#     vocab = list(wv_from_bin.vocab.keys())
    vocab = list(wv_from_bin.key_to_index.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin

降维并且可视化

def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):
    """ Put the word2vec vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 300) containing the vectors
            word2Ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.vocab.keys())
    print("Shuffling words ...")
    random.shuffle(words)
    words = words[:10000]       # 选10000个加入
    print("Putting %i words into word2Ind and matrix M..." % len(words))
    word2Ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2Ind

# -----------------------------------------------------------------
# Run Cell to Reduce 300-Dimensinal Word Embeddings to k Dimensions
# Note: This may take several minutes
# -----------------------------------------------------------------
M, word2Ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)         # 减到了2维

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced, word2Ind, words)

4. 余弦相似性

Cosine Similarity

基于这个方式，我们就可以找到单词的多义词，同义词，反义词还能实现单词的类比推理等好玩的事情。

4.1 我们找和某个单词最相近的10个单词：
可以使用gensim里面的most_similar函数， GenSim documentation

# ------------------
# Write your polysemous word exploration code here.

wv_from_bin.most_similar("dream")

# ------------------

4.2 同义词和反义词

When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.

Find three words (w1,w2,w3) where w1 and w2 are synonyms and w1 and w3 are antonyms, but Cosine Distance(w1,w3) < Cosine Distance(w1,w2). For example, w1="happy" is closer to w3="sad" than to w2="cheerful".

Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.

You should use the the wv_from_bin.distance(w1, w2) function here in order to compute the cosine distance between two words. Please see the GenSim documentation for further assistance.

# ------------------
# Write your synonym & antonym exploration code here.

w1 = "man"
w2 = "king"
w3 = "woman"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

# ------------------

Synonyms man, king have cosine distance: 0.7705732733011246
Antonyms man, woman have cosine distance: 0.2335987687110901

4.3 实现类比

还可以实现类比关系：
比如： China : Beijing = Japan : ?，那么我们可以用下面的代码求这样的类别关系，注意下面的positive和negative里面的单词顺序，我们求得？其实和Japan和Beijing相似，和China远。

#China:Japan :: Beijing:x x最有可能是什么

# Run this cell to answer the analogy -- man : king :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['Beijing', 'Japan'], negative=['China']))

[('Tokyo', 0.8115593791007996),
 ('Osaka', 0.6796455383300781),
 ('Seoul', 0.6568831205368042),
 ('Japanese', 0.6475988030433655),
 ('Nagoya', 0.6425850987434387),
 ('Maebashi', 0.6409165859222412),
 ('Yokohama', 0.626289427280426),
 ('Fukuoka', 0.6085069179534912),
 ('Osaka_Japan', 0.6067586541175842),
 ('Sapporo', 0.6054472923278809)]

man : woman :: him : x x最有可能是什么

# ------------------
# Write your analogy exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['woman','him'], negative=['man']))

# ------------------

[('her', 0.694490909576416),
 ('she', 0.6385233402252197),
 ('me', 0.628451406955719),
 ('herself', 0.6239798665046692),
 ('them', 0.5843965411186218),
 ('She', 0.5237804651260376),
 ('myself', 0.4885627031326294),
 ('saidshe', 0.48337963223457336),
 ('he', 0.48184284567832947),
 ('Gail_Quets', 0.4784894585609436)]

4.4 错误的类比

树：叶子：：花： x x最有可能是什么

# ------------------
# Write your incorrect analogy exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))

# ------------------

[('floral', 0.5532569289207458),
 ('marigold', 0.5291937589645386),
 ('tulip', 0.5213128924369812),
 ('rooted_cuttings', 0.5189827084541321),
 ('variegation', 0.5136324763298035),
 ('Asiatic_lilies', 0.5132641792297363),
 ('gerberas', 0.5106234550476074),
 ('gerbera_daisies', 0.5101010203361511),
 ('Verbena_bonariensis', 0.5070016384124756),
 ('violet', 0.5058107972145081)]

树：树叶：：花：花瓣但是结果中没有花瓣

4.5

偏见分析

注意偏见是很重要的比如性别歧视、种族歧视等，执行下面代码，分析两个问题：

(a) 哪个词与“woman”和“boss”最相似，和“man”最不相似?

(b) 哪个词与“man”和“boss”最相似，和“woman”最不相似?

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit to our word embeddings.

Run the cell below, to examine (a) which terms are most similar to "woman" and "boss" and most dissimilar to "man", and (b) which terms are most similar to "man" and "boss" and most dissimilar to "woman". What do you find in the top 10?

man : woman :: boss : x x最有可能是什么

woman : man :: boss : x x最有可能是什么

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))

[('bosses', 0.5522644519805908),
 ('manageress', 0.49151360988616943),
 ('exec', 0.45940810441970825),
 ('Manageress', 0.4559843838214874),
 ('receptionist', 0.4474116563796997),
 ('Jane_Danson', 0.44480547308921814),
 ('Fiz_Jennie_McAlpine', 0.4427576959133148),
 ('Coronation_Street_actress', 0.44275563955307007),
 ('supremo', 0.4409853219985962),
 ('coworker', 0.43986251950263977)]

[('supremo', 0.6097397804260254),
 ('MOTHERWELL_boss', 0.5489562749862671),
 ('CARETAKER_boss', 0.5375303030014038),
 ('Bully_Wee_boss', 0.5333974957466125),
 ('YEOVIL_Town_boss', 0.5321705341339111),
 ('head_honcho', 0.5281980037689209),
 ('manager_Stan_Ternent', 0.525971531867981),
 ('Viv_Busby', 0.5256163477897644),
 ('striker_Gabby_Agbonlahor', 0.5250812768936157),
 ('BARNSLEY_boss', 0.5238943099975586)]

第一个类比男人:女人 :: 老板:___，最合适的词应该是"landlady"（老板娘）之类的，但是top-10里只有"manageress"（女经理），“receptionist”（接待员）之类的词。

第二个类比女人:男人 :: 老板:___，输出的不知道是些什么东西/捂

4.6 词向量偏差的独立分析

man:woman :: doctor :x x最有可能是什么

woman:man :: doctor:x x最有可能是什么

# ------------------
# Write your bias exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man','doctor'], negative=['woman']))

# ------------------

[('gynecologist', 0.7093892097473145),
 ('nurse', 0.6477287411689758),
 ('doctors', 0.6471460461616516),
 ('physician', 0.6438996195793152),
 ('pediatrician', 0.6249487996101379),
 ('nurse_practitioner', 0.6218312978744507),
 ('obstetrician', 0.6072013974189758),
 ('ob_gyn', 0.5986713171005249),
 ('midwife', 0.5927063226699829),
 ('dermatologist', 0.5739566683769226)]

[('physician', 0.6463665962219238),
 ('doctors', 0.5858404040336609),
 ('surgeon', 0.5723941326141357),
 ('dentist', 0.552364706993103),
 ('cardiologist', 0.5413816571235657),
 ('neurologist', 0.5271127820014954),
 ('neurosurgeon', 0.5249835848808289),
 ('urologist', 0.5247740149497986),
 ('Doctor', 0.5240625143051147),
 ('internist', 0.5183223485946655)]

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/langs/942400.html

CS224N 2019年课程第一次作业复现

发表评论

评论列表（0条）