Python:正则表达式无法正常工作

Python:正则表达式无法正常工作,第1张

概述我正在使用以下正则表达式,它假设找到字符串’U.S.A.’,但它只获得’A’,是否有人知道什么是错的? #INPUTimport retext = 'That U.S.A. poster-print costs $12.40...'print re.findall(r'([A-Z]\.)+', text)#OUTPUT['A.'] 预期产出: ['U.S.A.'] 我正在关注NLT 我正在使用以下正则表达式,它假设找到字符串’U.S.A.’,但它只获得’A’,是否有人知道什么是错的?

#inputimport retext = 'That U.S.A. poster-print costs .40...'print re.findall(r'([A-Z]\.)+',text)#OUTPUT['A.']

预期产出:

['U.S.A.']

我正在关注NLTK Book,第3.7章here,它有一套正则表达式,但它只是不起作用.我在Python 2.7和3.4中都尝试过它.

>>> text = 'That U.S.A. poster-print costs .40...'>>> pattern = r'''(?x)    # set flag to allow verbose regexps...     ([A-Z]\.)+        # abbreviations,e.g. U.S.A....   | \w+(-\w+)*        # words with optional internal hyphens...   | $?\d+(\.\d+)?%?  # currency and percentages,e.g. .40,82%...   | \.\.\.            # ellipsis...   | [][.,;"'?():-_`]  # these are separate tokens; includes ],[... '''>>> nltk.regexp_tokenize(text,pattern)['That','U.S.A.','poster-print','costs','.40','...']

nltk.regexp_tokenize()与re.findall()的工作方式相同,我想我的python在某种程度上无法按预期识别正则表达式.上面列出的正则表达式输出:

[('','',''),('A.',('','-print','.40'),'')]
解决方法 可能,这与先前使用在v3.1中废除的nltk.internals.compile_regexp_to_noncapturing()编译正则表达式有关,参见 here)

>>> import nltk>>> nltk.__version__'3.0.5'>>> pattern = r'''(?x)               # set flag to allow verbose regexps...               ([A-Z]\.)+         # abbreviations,e.g. U.S.A....               | $?\d+(\.\d+)?%? # numbers,incl. currency and percentages...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe...               | [+/\-@&*]        # special characters with meanings...             '''>>> >>> from nltk.tokenize.regexp import RegexpTokenizer>>> tokeniser=RegexpTokenizer(pattern)>>> line="My weight is about 68 kg,+/- 10 grams.">>> tokeniser.tokenize(line)['My','weight','is','about','68','kg','+','/','-','10','grams']

但它在NLTK v3.1中不起作用:

>>> import nltk>>> nltk.__version__'3.1'>>> pattern = r'''(?x)               # set flag to allow verbose regexps...               ([A-Z]\.)+         # abbreviations,incl. currency and percentages...               | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe...               | [+/\-@&*]        # special characters with meanings...             '''>>> from nltk.tokenize.regexp import RegexpTokenizer>>> tokeniser=RegexpTokenizer(pattern)>>> line="My weight is about 68 kg,+/- 10 grams.">>> tokeniser.tokenize(line)[('','')]

稍微修改一下你的正则表达式组的定义,你可以使用这个正则表达式在NLTK v3.1中使用相同的模式:

pattern = r"""(?x)                   # set flag to allow verbose regexps              (?:[A-Z]\.)+           # abbreviations,e.g. U.S.A.              |\d+(?:\.\d+)?%?       # numbers,incl. currency and percentages              |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe              |(?:[+/\-@&*])         # special characters with meanings            """

在代码中:

>>> import nltk>>> nltk.__version__'3.1'>>> pattern = r"""... (?x)                   # set flag to allow verbose regexps... (?:[A-Z]\.)+           # abbreviations,e.g. U.S.A.... |\d+(?:\.\d+)?%?       # numbers,incl. currency and percentages... |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe... |(?:[+/\-@&*])         # special characters with meanings... """>>> from nltk.tokenize.regexp import RegexpTokenizer>>> tokeniser=RegexpTokenizer(pattern)>>> line="My weight is about 68 kg,'grams']

如果没有NLTK,使用python的re模块,我们发现本机不支持旧的正则表达式模式:

>>> pattern1 = r"""(?x)               # set flag to allow verbose regexps...               ([A-Z]\.)+         # abbreviations,e.g. U.S.A....               |$?\d+(\.\d+)?%? # numbers,incl. currency and percentages...               |\w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe...               |[+/\-@&*]        # special characters with meanings...               |\S\w*                       # any sequence of word characters# ... """            >>> text="My weight is about 68 kg,+/- 10 grams.">>> re.findall(pattern1,text)[('','')]>>> pattern2 = r"""(?x)                   # set flag to allow verbose regexps...                       (?:[A-Z]\.)+           # abbreviations,e.g. U.S.A....                       |\d+(?:\.\d+)?%?       # numbers,incl. currency and percentages...                       |\w+(?:[-']\w+)*       # words w/ optional internal hyphens/apostrophe...                       |(?:[+/\-@&*])         # special characters with meanings...                     """>>> text="My weight is about 68 kg,+/- 10 grams.">>> re.findall(pattern2,text)['My','grams']

注意:NLTK的RegexpTokenizer如何编译正则表达式的变化会使NLTK’s Regular Expression Tokenizer上的示例过时.

总结

以上是内存溢出为你收集整理的Python:正则表达式无法正常工作全部内容,希望文章能够帮你解决Python:正则表达式无法正常工作所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1193517.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-03
下一篇 2022-06-03

发表评论

登录后才能评论

评论列表(0条)

保存