我试图检索所有可能的维基百科支持的语言,并通过遍历List_of_Wikipedias上的表将它们输出到文本文件.
这是我到目前为止的python代码,它只是试图检索其中一个表:
import httplibfrom lxml import etreedef main(): conn = httplib.httpconnection("Meta.wikimedia.org") conn.request("GET","/wiki/List_of_Wikipedias") res = conn.getresponse() root = etree.fromstring(res.read()) table = root.xpath('//table') print tablemain()
在我的机器上,这只打印一个空列表.为了提高速度,我在本地缓存了页面并使用了:
wikipage = open("wikipage.HTML")root = lxml.parse(wikipage)
但这没有任何影响(除了显而易见的加速).我也试过了
lxml.find('table')
和:
for element in root.iter(): print("%s - %s" % (element.tag,element.text))
它成功地打印出所有元素,所以我知道正在创建树.
我究竟做错了什么?
任何帮助,将不胜感激.
谢谢.
I am trying to retrIEve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias
您的问题是文档中的元素名称位于默认命名空间中.如何编写涉及这些元素名称的XPath表达式是XPath中最常见的FAQ,并且在SO xpath标记中有很多好的答案.只是搜索它们.
这是一个完整的解决方案:
使用:
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()
您已注册绑定到前缀“x”的xhtml命名空间(“http://www.w3.org/1999/xhtml”).
当我根据从http://s23.org/wikistats/wikipedias_html获得的文档评估此XPath表达式时
我需要在文档的开头添加以下内容,因为我在本地工作并且没有xhtml的DTD – 也许你不需要这些:
<!DOCTYPE HTML [<!ENTITY uarr "↑"><!ENTITY darr "↓"><!ENTITY ccedil "Ç"><!ENTITY oslash "Ø"><!ENTITY aacute "á"><!ENTITY aring "å"><!ENTITY agrave "À"><!ENTITY egrave "è"><!ENTITY ograve "Ò"><!ENTITY ocirc "ô">]>
将上述XPath表达式应用于此文档的结果是:
English German french Polish Italian Japanese Spanish Portuguese Dutch Russian Swedish Chinese Catalan norwegian (Bokmål) Finnish Ukrainian Czech Hungarian Romanian Korean Turkish VIEtnamese Indonesian Danish arabic Esperanto Serbian lithuanian Slovak Volapük Persian Hebrew Bulgarian Slovenian Malay Waray-Waray Croatian Estonian Newar / Nepal Bhasa Simple English hindi galician Thai Basque norwegian (Nynorsk) Aromanian Greek Haitian Azerbaijani Tagalog Latin Telugu Georgian Macedonian Cebuano Serbo-Croatian Breton PIEdmontese Marathi Latvian Luxembourgish Javanese Belarusian (TaraškIEvica) Welsh Icelandic Bosnian Albanian Tamil Belarusian Bishnupriya Manipuri Aragonese Occitan Bengali Swahili IDo Lombard West Frisian Gujarati Afrikaans Low Saxon Malayalam Quechua Sicilian Urdu Kurdish Cantonese Sundanese Asturian Neapolitan Samogitian Armenian Yoruba Irish Chuvash Walloon Nepali Ripuarian Western Panjabi Kannada Tajik Tarantino Venetian YIDdish Scottish Gaelic Tatar Min Nan Ossetian Uzbek Alemannic Kapampangan Sakha Egyptian arabic Kazakh Maori limburgian Amharic Nahuatl Upper Sorbian Gilaki Corsican Gan Mongolian Scots Interlingua Central_Bicolano Burmese Faroese Võro Dutch Low Saxon Sinhalese Turkmen West Flemish Sanskrit Bavarian Malagasy Manx Ilokano divehi norman Pangasinan Banyumasan Sorani Romansh northern Sami Zazaki Mazandarani Wu Friulian Uyghur ligurian Maltese Bihari Novial Tibetan Anglo-Saxon Kashubian Sardinian Classical Chinese Fiji hindi Khmer Ladino Zamboanga Chavacano Pali Franco-Provençal/Arpitan Pashto Hakka Cornish Punjabi Navajo Silesian Kalmyk Pennsylvania German Hawaiian Saterland Frisian Interlingue Somali Komi Karachay-Balkar Crimean Tatar Tongan Acehnese Meadow Mari Picard Erzya lingala Kinyarwanda Extremaduran Guarani Kirghiz Emilian-Romagnol Assyrian Neo-aramaic PAPIamentu Aymara Chechen Lojban Wolof Banjar Bashkir north Frisian Greenlandic Tok Pisin Udmurt Kabyle Tahitian Sranan Zealandic Hill Mari Komi-Permyak Lower Sorbian Abkhazian Gagauz Igbo Oriya Lao Kongo Avar Moksha Mirandese Romani old Church Slavonic Karakalpak Samoan Moldovan Tetum Gothic Kashmiri Bambara Inupiak Sindhi Bislama Lak Nauruan norfolk Inuktitut Pontic Assamese Cherokee Min Dong Swati Palatinate German Hausa Ewe Tigrinya Oromo Zulu Zhuang venda Tsonga Kirundi Dzongkha Sango Cree Chamorro Luganda BUGinese Buryat (Russia) Fijian Chichewa Akan Sesotho Xhosa Fula Tswana Kikuyu Tumbuka Shona Twi Cheyenne Ndonga Sichuan Yi Choctaw Marshallese Afar Kuanyama Hiri Motu Muscogee Kanuri Herero
请注意:每隔一个选定节点都是一个仅限空格的文本节点.如果您不想选择这些,请使用:
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]总结
以上是内存溢出为你收集整理的如何使用lxml在XHTML文档中查找元素文本全部内容,希望文章能够帮你解决如何使用lxml在XHTML文档中查找元素文本所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)