如何在python中计算skipgrams?

如何在python中计算skipgrams?,第1张

如何在python中计算skipgrams?

在OP链接的文件中,以下字符串:

叛乱分子在持续的战斗中丧生

产量:

2-skip-bi-grams =

2-skip-tri-grams =
{叛乱分子被杀,叛乱分子被杀,正在进行中的叛乱分子被杀,正在进行中的叛乱分子,在战斗中的叛乱分子,叛乱分子在进行中的战斗,在进行中被杀,在战斗中被杀,在战斗中被杀,在进行中的战斗}。

略微修改NLTK的

ngrams
代码(https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):

from itertools import chain, combinationsimport copyfrom nltk.util import ngramsdef pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):    if pad_left:        sequence = chain((pad_symbol,) * (n-1), sequence)    if pad_right:        sequence = chain(sequence, (pad_symbol,) * (n-1))    return sequencedef skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):    sequence_length = len(sequence)    sequence = iter(sequence)    sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)    if sequence_length + pad_left + pad_right < k:        raise Exception("The length of sentence + padding(s) < skip")    if n < k:        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")    history = []    nk = n+k    # Return point for recursion.    if nk < 1:         return    # If n+k longer than sequence, reduce k by 1 and recur    elif nk > sequence_length:         for ng in skipgrams(list(sequence), n, k-1): yield ng    while nk > 1: # Collects the first instance of n+k length history        history.append(next(sequence))        nk -= 1    # Iterative drop first item in history and picks up the next    # while yielding skipgrams for each iteration.    for item in sequence:        history.append(item)        current_token = history.pop(0)   # Iterates through the rest of the history and         # pick out all combinations the n-1grams        for idx in list(combinations(range(len(history)), n-1)): ng = [current_token] for _id in idx:     ng.append(history[_id]) yield tuple(ng)    # Recursively yield the skigrams for the rest of seqeunce where    # len(sequence) < n+k    for ng in list(skipgrams(history, n, k-1)):        yield ng

让我们做一些doctest来匹配本文中的示例:

>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

但请注意,如果使用

n+k > len(sequence)
,它将产生与相同的效果
skipgrams(sequence, n,k-1)
(这不是错误,它是故障安全功能),例如

>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))>>> four_skip_fourgrams  = list(skipgrams(text, n=4, k=4))>>> four_skip_fivegrams  = list(skipgrams(text, n=5, k=4))>>>>>> print len(three_skip_trigrams), three_skip_trigrams10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]>>> print len(three_skip_fourgrams), three_skip_fourgrams 5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]>>> print len(four_skip_fourgrams), four_skip_fourgrams 5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]>>> print len(four_skip_fivegrams), four_skip_fivegrams 1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]

这是允许的,

n == k
但不允许这样
n > k
做,如各行所示:

if n < k:        raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")

为了理解起见,让我们尝试理解“神秘”这一行:

for idx in list(combinations(range(len(history)), n-1)):    pass # Do something

给定唯一项列表,组合会产生以下结果:

>>> from itertools import combinations>>> x = [0,1,2,3,4,5]>>> list(combinations(x,2))[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]

并且由于令牌列表的索引始终是唯一的,例如

>>> sent = ['this', 'is', 'a', 'foo', 'bar']>>> current_token = sent.pop(0) # i.e. 'this'>>> range(len(sent))[0,1,2,3]

可以计算范围的可能组合(不替换):

>>> n = 3>>> list(combinations(range(len(sent)), n-1))[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

如果我们将索引映射回令牌列表:

>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]

然后,将串联起来

current_token
,得到当前标记和context + skip窗口的跳过图:

>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)][('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]

因此,在此之后,我们继续下一个单词。



欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5630683.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-16
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存