distinct_words
思路:
codecorpus_words: 语料库中不同单词经过排序后的列表
1.根据提示使用列表推导将语料库中所有单词放入;corpus_words列表, 也可用双重for循环, 但速度会不如列表推导;
2.使用函数set()的特性, 进行一个去重 *** 作, 注意前加list(), 转换为列表;
3.函数sorted()进行排序;n_corpus_words: 语料库中不同单词的个数
1.求corpus_words的长度,使用函数len();
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct words across the corpus
n_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
n_corpus_words = -1
# ------------------
# Write your implementation here.
corpus_words = [word for sen in corpus for word in sen]
corpus_words = sorted(list(set(corpus_words)))
n_corpus_words = len(corpus_words)
# ------------------
return corpus_words, n_corpus_words
Question 1.2: Implement compute_co_occurrence_matrix
思路:
codewor2ind: 存储词语对应矩阵M中索引的字典
1.使用字典推导,遍历一遍words即可得到;M: 共现矩阵
1.使用函数numpy.zeros(),创建一个空矩阵,形状依据n_words;
2.双重for循环,进入到语料库中的任意一条语句;
3.根据窗口的大小,再设置一次for,注意range()范围为左开右闭;
4.设置两个if,判断是否出边界;
5.求出两个单词在矩阵M中对应的索引号, 矩阵M的对应单词的出现次数+1;
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document " All that glitters is not gold " with window size of 4,
"All" will co-occur with "", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, n_words = distinct_words(corpus)
M = None
word2ind = {}
# ------------------
# Write your implementation here.
M = np.zeros((n_words, n_words))
word2ind = {words[i]: i for i in range(len(words))}
for sen in corpus:
for word in range(len(sen)):
for i in range(1, window_size+1):
index_i = word2ind[sen[word]]
if word - i >= 0:
index_j = word2ind[sen[word-i]]
M[index_i][index_j] += 1
if word + i <= len(sen)-1:
index_j = word2ind[sen[word+i]]
M[index_i][index_j] += 1
# ------------------
return M, word2ind
Question 1.3: Implement reduce_to_k_dim
思路:
codeM_reduced: 降维后的k维词嵌入矩阵
1.根据提示使用sklearn.decomposition.TruncatedSVD库函数计算,使用属性components_得到最后结果,注意转置;
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=k, n_iter=n_iters)
M_reduced = svd.fit(M).components_.T
# ------------------
print("Done.")
return M_reduced
Question 1.4: Implement plot_embeddings
思路:
根据例子,改编代码,实现画图
1.观察测试样例和结果,画图的主要工作就是确定测试点所在的x, y轴坐标
2.每个测试点索引也对应着M_reduced矩阵的索引,那么该位置[0]即为x轴点,[1]即为y轴点
3.根据所给的代码进行修改
def plot_embeddings(M_reduced, word2ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words".
NOTE: do not plot all the words listed in M_reduced / word2ind.
Include a label next to each point.
Params:
M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
word2ind (dict): dictionary that maps word to indices for matrix M
words (list of strings): words whose embeddings we want to visualize
"""
# ------------------
# Write your implementation here.
for i,type in enumerate(words):
x = M_reduced[i][0]
y = M_reduced[i][1]
plt.scatter(x, y, marker='x', color='red')
plt.text(x, y, type, fontsize=9)
plt.show()
# ------------------
Question 1.5: Co-Occurrence Plot Analysis [written]
grain和corn在二维空间中聚集在一起
grain和grains没有聚集在一起
Part 2: Prediction-Based Word Vectors Question 2.1: GloVe Plot Analysis written不同点:
以(0,0)为圆心的单位圆来看,该图的词语分布更为均匀,且能明显看出圆的轮廓,一些近义词更加聚集;而在part1使用共现矩阵生成的图,所有单词都集中于单位圆的右侧,近义词不聚集;
# ------------------
# Write your implementation here.
word = "code"
wv_from_bin.most_similar(word)
# ------------------
written
原因:可能与语料库有关,多义词的熟义出现频率比生义要高
Question 2.3: Synonyms & Antonyms code# ------------------
# Write your implementation here.
w1 = "advantage"
w2 = "virtue"
w3 = "disadvantage"
dis_1 = wv_from_bin.distance(w1, w2)
dis_2 = wv_from_bin.distance(w1, w3)
print("Cosine Distance ({},{}): {}".format(w1, w2, dis_1))
print("Cosine Distance ({},{}): {}".format(w1, w3, dis_2))
# ------------------
written
原因:可能是w1和w3单词上下文语境更为相符
Question 2.4: Analogies with Word Vectors writteng g g + w w w - m m m
Question 2.5: Finding Analogies code# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))
# ------------------
Question 2.6: Incorrect Analogy
code
# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))
# ------------------
Question 2.7: Guided Analysis of Bias in Word Vectors
written
女孩的玩具中多出现娃娃
男孩的玩具中多出现机器人、制造
# ------------------
# Write your implementation here.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'engineer'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'engineer'], negative=['woman']))
# ------------------
[('technician', 0.5853330492973328),
('engineers', 0.5717717409133911),
('educator', 0.5450620055198669),
('engineering', 0.48699596524238586),
('contractor', 0.4856792092323303),
('nurse', 0.48517873883247375),
('schoolteacher', 0.4825061857700348),
('teacher', 0.47406384348869324),
('mechanic', 0.4704253673553467),
('married', 0.4676802158355713)]
[('engineers', 0.5697532892227173),
('engineering', 0.5532492995262146),
('mechanic', 0.537360429763794),
('technician', 0.47810807824134827),
('officer', 0.4660565257072449),
('inventor', 0.46498754620552063),
('scientist', 0.46378421783447266),
('worked', 0.46068844199180603),
('colonel', 0.45147472620010376),
('commander', 0.4491448998451233)]
written
选取了woman、man、engineer作为样例,在结果可以看到woman多集中于教育、护士和已婚方面, 而man多集中于各种技术类方面的职业
Question 2.9: Thinking About Bias written1.语料集的范围不够大,且重复的内容过多,但偏见主要来源于民众的认知
2.统计偏见词语在每篇文章的出现频率,降序排列
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)