#inputimport retext = 'That U.S.A. poster-print costs .40...'print re.findall(r'([A-Z]\.)+',text)#OUTPUT['A.']
预期产出:
['U.S.A.']
我正在关注NLTK Book,第3.7章here,它有一套正则表达式,但它只是不起作用.我在Python 2.7和3.4中都尝试过它.
>>> text = 'That U.S.A. poster-print costs .40...'>>> pattern = r'''(?x) # set flag to allow verbose regexps... ([A-Z]\.)+ # abbreviations,e.g. U.S.A.... | \w+(-\w+)* # words with optional internal hyphens... | $?\d+(\.\d+)?%? # currency and percentages,e.g. .40,82%... | \.\.\. # ellipsis... | [][.,;"'?():-_`] # these are separate tokens; includes ],[... '''>>> nltk.regexp_tokenize(text,pattern)['That','U.S.A.','poster-print','costs','.40','...']
nltk.regexp_tokenize()与re.findall()的工作方式相同,我想我的python在某种程度上无法按预期识别正则表达式.上面列出的正则表达式输出:
[('','',''),('A.',('','-print','.40'),'')]解决方法 可能,这与先前使用在v3.1中废除的nltk.internals.compile_regexp_to_noncapturing()编译正则表达式有关,参见 here)
>>> import nltk>>> nltk.__version__'3.0.5'>>> pattern = r'''(?x) # set flag to allow verbose regexps... ([A-Z]\.)+ # abbreviations,e.g. U.S.A.... | $?\d+(\.\d+)?%? # numbers,incl. currency and percentages... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe... | [+/\-@&*] # special characters with meanings... '''>>> >>> from nltk.tokenize.regexp import RegexpTokenizer>>> tokeniser=RegexpTokenizer(pattern)>>> line="My weight is about 68 kg,+/- 10 grams.">>> tokeniser.tokenize(line)['My','weight','is','about','68','kg','+','/','-','10','grams']
但它在NLTK v3.1中不起作用:
>>> import nltk>>> nltk.__version__'3.1'>>> pattern = r'''(?x) # set flag to allow verbose regexps... ([A-Z]\.)+ # abbreviations,incl. currency and percentages... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe... | [+/\-@&*] # special characters with meanings... '''>>> from nltk.tokenize.regexp import RegexpTokenizer>>> tokeniser=RegexpTokenizer(pattern)>>> line="My weight is about 68 kg,+/- 10 grams.">>> tokeniser.tokenize(line)[('','')]
稍微修改一下你的正则表达式组的定义,你可以使用这个正则表达式在NLTK v3.1中使用相同的模式:
pattern = r"""(?x) # set flag to allow verbose regexps (?:[A-Z]\.)+ # abbreviations,e.g. U.S.A. |\d+(?:\.\d+)?%? # numbers,incl. currency and percentages |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe |(?:[+/\-@&*]) # special characters with meanings """
在代码中:
>>> import nltk>>> nltk.__version__'3.1'>>> pattern = r"""... (?x) # set flag to allow verbose regexps... (?:[A-Z]\.)+ # abbreviations,e.g. U.S.A.... |\d+(?:\.\d+)?%? # numbers,incl. currency and percentages... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe... |(?:[+/\-@&*]) # special characters with meanings... """>>> from nltk.tokenize.regexp import RegexpTokenizer>>> tokeniser=RegexpTokenizer(pattern)>>> line="My weight is about 68 kg,'grams']
如果没有NLTK,使用python的re模块,我们发现本机不支持旧的正则表达式模式:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps... ([A-Z]\.)+ # abbreviations,e.g. U.S.A.... |$?\d+(\.\d+)?%? # numbers,incl. currency and percentages... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe... |[+/\-@&*] # special characters with meanings... |\S\w* # any sequence of word characters# ... """ >>> text="My weight is about 68 kg,+/- 10 grams.">>> re.findall(pattern1,text)[('','')]>>> pattern2 = r"""(?x) # set flag to allow verbose regexps... (?:[A-Z]\.)+ # abbreviations,e.g. U.S.A.... |\d+(?:\.\d+)?%? # numbers,incl. currency and percentages... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe... |(?:[+/\-@&*]) # special characters with meanings... """>>> text="My weight is about 68 kg,+/- 10 grams.">>> re.findall(pattern2,text)['My','grams']
注意:NLTK的RegexpTokenizer如何编译正则表达式的变化会使NLTK’s Regular Expression Tokenizer上的示例过时.
总结以上是内存溢出为你收集整理的Python:正则表达式无法正常工作全部内容,希望文章能够帮你解决Python:正则表达式无法正常工作所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)