如何解决西里尔符号解析html文件的问题？_随笔

如何解决西里尔符号解析html文件的问题？ xml文件

（正如所观察到的，这在系统编码之间有点模糊，尽管在Linux中确实很明显，但在Windows XP中显然无法正常工作。）

我通过解码源字符串-来使其工作

tree = html.fromstring(source.depre('utf-8'))

：

# -*- coding:cp1251 -*-import lxmlfrom lxml import htmlfilename = "t.html"fread = open(filename, 'r')source = fread.read()tree = html.fromstring(source.depre('utf-8'))fread.close()tags = tree.xpath('//span[@ and text()="Text"]') #This OKprint "name: ",tags[0].textprint "value: ",tags[0].tailtags = tree.xpath('//span[@ and text()="Привет"]') #This is now OK tooprint "name: ",tags[0].textprint "value: ",tags[0].tail

这意味着实际的树是所有

unipre

对象。如果仅将xpath参数设置为a，

unipre

则会找到0个匹配项。

美丽汤

无论如何，我更喜欢将BeautifulSoup用于此类事情。这是我的互动环节；我将文件保存在cp1251中。

>>> from BeautifulSoup import BeautifulSoup>>> filename = '/tmp/cyrillic'>>> fread = open(filename, 'r')>>> source = fread.read()>>> source  # Scary'<html>n<body>n<span >Text</span>some text</br>n<span >xcfxf0xe8xe2xe5xf2</span>xd2xe5xeaxf1xf2 xedxe0 xf0xf3xf1xf1xeaxeexec</br>n</body>n</html>n'>>> source = source.depre('cp1251')  # Let's try getting this right.u'<html>n<body>n<span >Text</span>some text</br>n<span >u041fu0440u0438u0432u0435u0442</span>u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c</br>n</body>n</html>n'>>> soup = BeautifulSoup(source)>>> soup  # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning.<html><body><span >Text</span>some text<span >Привет</span>Текст на русском</body></html>>>> soup.find('span', 'one').findNextSibling(text=True)u'some text'>>> soup.find('span', 'two').findNextSibling(text=True)  # This looks a bit daunting ...u'u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c'>>> print _  # ... but it's not, really. Just Unipre chars.Текст на русском>>> # Then you may also wish to get things by text:>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True)Текст на русском>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation.

最后，考虑尝试

source.depre('cp1251')

而不是

source.depre('utf-8')

从文件系统中获取时可能值得。然后，lxml实际上可以工作。

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/zaji/5667716.html

如何解决西里尔符号解析html文件的问题？

发表评论

评论列表（0条）