(正如所观察到的,这在系统编码之间有点模糊,尽管在Linux中确实很明显,但在Windows XP中显然无法正常工作。)
我通过解码源字符串-来使其工作
tree = html.fromstring(source.depre('utf-8')):
# -*- coding:cp1251 -*-import lxmlfrom lxml import htmlfilename = "t.html"fread = open(filename, 'r')source = fread.read()tree = html.fromstring(source.depre('utf-8'))fread.close()tags = tree.xpath('//span[@ and text()="Text"]') #This OKprint "name: ",tags[0].textprint "value: ",tags[0].tailtags = tree.xpath('//span[@ and text()="Привет"]') #This is now OK tooprint "name: ",tags[0].textprint "value: ",tags[0].tail
这意味着实际的树是所有
unipre对象。如果仅将xpath参数设置为a,
unipre则会找到0个匹配项。美丽汤
无论如何,我更喜欢将BeautifulSoup用于此类事情。这是我的互动环节;我将文件保存在cp1251中。
>>> from BeautifulSoup import BeautifulSoup>>> filename = '/tmp/cyrillic'>>> fread = open(filename, 'r')>>> source = fread.read()>>> source # Scary'<html>n<body>n<span >Text</span>some text</br>n<span >xcfxf0xe8xe2xe5xf2</span>xd2xe5xeaxf1xf2 xedxe0 xf0xf3xf1xf1xeaxeexec</br>n</body>n</html>n'>>> source = source.depre('cp1251') # Let's try getting this right.u'<html>n<body>n<span >Text</span>some text</br>n<span >u041fu0440u0438u0432u0435u0442</span>u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c</br>n</body>n</html>n'>>> soup = BeautifulSoup(source)>>> soup # OK, that's looking right now. Note the </br> was dropped as that's bad HTML with no meaning.<html><body><span >Text</span>some text<span >Привет</span>Текст на русском</body></html>>>> soup.find('span', 'one').findNextSibling(text=True)u'some text'>>> soup.find('span', 'two').findNextSibling(text=True) # This looks a bit daunting ...u'u0422u0435u043au0441u0442 u043du0430 u0440u0443u0441u0441u043au043eu043c'>>> print _ # ... but it's not, really. Just Unipre chars.Текст на русском>>> # Then you may also wish to get things by text:>>> print soup.find(text=u'Привет').findParent().findNextSibling(text=True)Текст на русском>>> # You can't get things by attributes and the contained NavigableString at the same time, though. That may be a limitation.
最后,考虑尝试
source.depre('cp1251')而不是
source.depre('utf-8')从文件系统中获取时可能值得。然后,lxml实际上可以工作。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)