从python中的xml文档中提取文本_随笔

从python中的xml文档中提取文本

您可以简单地去除所有标签：

>>> import re>>> txt = """<bookstore>...     <book category="COOKING">...         <title lang="english">Everyday Italian</title>...         <author>Giada De Laurentiis</author>...         <year>2005</year>...         <price>300.00</price>...     </book>......     <book category="CHILDREN">...         <title lang="english">Harry Potter</title>...         <author>J K. Rowling </author>...         <year>2005</year>...         <price>625.00</price>...     </book>... </bookstore>""">>> exp = re.compile(r'<.*?>')>>> text_only = exp.sub('',txt).strip()>>> text_only'Everyday Italiann        Giada De Laurentiisn        2005n        300.00n  nn    n        Harry Pottern        J K. Rowling n        2005n        625.00'

但是，如果您只想在Linux中搜索文件中的某些文本，则可以使用

grep

：

burhan@sandbox:~$ grep "Harry Potter" file.xml        <title lang="english">Harry Potter</title>

如果要搜索文件，请使用

grep

上面的命令，或打开文件并在Python中搜索：

>>> import re>>> exp = re.compile(r'<.*?>')>>> with open('file.xml') as f:...     lines = ''.join(line for line in f.readlines())...     text_only = exp.sub('',lines).strip()...>>> if 'Harry Potter' in text_only:...    print 'It exists'... else:...    print 'It does not'...It exists

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5675130.html

从python中的xml文档中提取文本

发表评论

评论列表（0条）