python——thefuzzy、difflib详解_python

preface：最近业务上涉及一些文本匹配计算的东西，包括以往也涉及到，用到模糊匹配，但之前并没有深究原理。这次详细看了下模糊计算的得分怎么计算的。编辑距离计算略。

thefuzzy：

python的模糊匹配包，java也有实现。里面基本上基于difflib来实现的。
地址：https://github.com/seatgeek/thefuzz
安装：pip install thefuzz

difflib：计算两个字符串差异的包。有主要的SequenceMatcher类。

SequenceMatcher类：主要的ratio计算相似得分。get_opcodes计算四种(增删改等) *** 作。

一、difflib

ratio计算关键原理：

Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.

解释：T分别是两个序列的元素长度和，M是匹配到的个数。

abc、a：匹配了a，故结果是(2*1)/(3+1) = 0.5

aabc、a：也只匹配了a，虽然前面有两个a，但可以理解为后面匹配到了的a用掉了，故(2*1)/(4+1)=0.4

aabc、aa：匹配到了第一个a、第二个a，共两个，故(2*2)/(4+2)=2/3

get_opcodes关键原理：

把一个序列变成另外一个序列所需要增加、删除、改动、等于四种 *** 作的情况。

二、thefuzz

四种模糊匹配计算方式：

简单匹配（Ratio）、非完全匹配（Partial Ratio）、忽略顺序匹配（Token Sort Ratio）和去重子集匹配（Token Set Ratio）

Ratio：直接调用了difflib的SequenceMatcher

token_sort_ratio、token_set_ratio：

做了简单的处理。再调用Ratio
处理：将两个字符串空格分割开来，得到两个集合a、b。其中
- a&b排序拼接在一起得到sorted_sect，交集
- (a-b、a&b)排序拼接一起得到combined_1to2。差集+交集
- (b-a、a&b)排序拼接一起combined_2to1。另外一个差集+交集
- 计算ratio(sorted_sect)、ratio(combined_1to2)、ratio(combined_2to1)三者之间的最大值。

源代码：/opt/anaconda3/lib/python3.8/site-packages/thefuzz.py

@utils.check_for_none
def _token_set(s1, s2, partial=True, force_ascii=True, full_process=True):
    """Find all alphanumeric tokens in each string...
        - treat them as a set
        - construct two strings of the form:
            
        - take ratios of those two strings
        - controls for unordered partial matches"""

    if not full_process and s1 == s2:
        return 100

    p1 = utils.full_process(s1, force_ascii=force_ascii) if full_process else s1
    p2 = utils.full_process(s2, force_ascii=force_ascii) if full_process else s2

    if not utils.validate_string(p1):
        return 0
    if not utils.validate_string(p2):
        return 0

    # pull tokens
    tokens1 = set(p1.split())
    tokens2 = set(p2.split())

    intersection = tokens1.intersection(tokens2)
    diff1to2 = tokens1.difference(tokens2)
    diff2to1 = tokens2.difference(tokens1)

    sorted_sect = " ".join(sorted(intersection))
    sorted_1to2 = " ".join(sorted(diff1to2))
    sorted_2to1 = " ".join(sorted(diff2to1))

    combined_1to2 = sorted_sect + " " + sorted_1to2
    combined_2to1 = sorted_sect + " " + sorted_2to1

    # strip
    sorted_sect = sorted_sect.strip()
    combined_1to2 = combined_1to2.strip()
    combined_2to1 = combined_2to1.strip()

    if partial:
        ratio_func = partial_ratio
    else:
        ratio_func = ratio

    pairwise = [
        ratio_func(sorted_sect, combined_1to2),
        ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
    return max(pairwise)

其他：略。如：from fuzz import process

process.extractOne

process.extract

~~注意：中文使用fuzz.token_set_ratio(t1, t2)，t1、t2需要用空格隔开一个个字~~

参考：

一个非常好用的 Python 魔法库
FuzzyWuzzy：简单易用的字符串模糊匹配工具
python fuzzywuzzy模块模糊字符串匹配详细用法

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/715881.html

python——thefuzzy、difflib详解

发表评论

评论列表（0条）