如何解决西里尔符号解析html文件的问题?

如何解决西里尔符号解析html文件的问题?,第1张

如何解决西里尔符号解析html文件的问题? xml文件

(正如所观察到的,这在系统编码之间有点模糊,尽管在Linux中确实很明显,但在Windows XP中显然无法正常工作。)

我通过解码源字符串-来使其工作

tree = html.fromstring(source.depre('utf-8'))

# -*- coding:cp1251 -*-import lxmlfrom lxml import htmlfilename = "t.html"fread = open(filename, 'r')source = fread.read()tree = html.fromstring(source.depre('utf-8'))fread.close()tags = tree.xpath('//span[@ and text()="Text"]') #This OKprint "name: ",tags[0].textprint "value: ",tags[0].tailtags = tree.xpath('//span[@ and text()="Привет"]') #This is now OK tooprint "name: ",tags[0].textprint "value: ",tags[0].tail

这意味着实际的树是所有

unipre
对象。如果仅将xpath参数设置为a,
unipre
则会找到0个匹配项。

美丽汤

无论如何,我更喜欢将BeautifulSoup用于此类事情。这是我的互动环节;我将文件保存在cp1251中。

>>> from BeautifulSoup import BeautifulSoup>>> filename = '/tmp/cyrillic'>>> fread = open(filename, 'r')>>> source = fread.read()>>> source  # Scary'<html>n<body>n<span >Text</span>some text</br>n<span >xcfxf0xe8xe2xe5xf2</span>xd2xe5xeaxf1xf2 xedxe0 xf0xf3xf1xf1xeaxeexec</br>n</body>n</html>n'>>> source = source.depre('cp1251')  # Let's try getting this right.u'<html>n<body>n<span >Text</span>some text</br>n<span >u041fu0440u0438u0432u0435u0442</span>u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c</br>n</body>n</html>n'>>> soup = BeautifulSoup(source)>>> soup  # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning.<html><body><span >Text</span>some text<span >Привет</span>Текст на русском</body></html>>>> soup.find('span', 'one').findNextSibling(text=True)u'some text'>>> soup.find('span', 'two').findNextSibling(text=True)  # This looks a bit daunting ...u'u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c'>>> print _  # ... but it's not, really. Just Unipre chars.Текст на русском>>> # Then you may also wish to get things by text:>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True)Текст на русском>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation.

最后,考虑尝试

source.depre('cp1251')
而不是
source.depre('utf-8')
从文件系统中获取时可能值得。然后,lxml实际上可以工作。



欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/zaji/5667716.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-16
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存