下列命令获取title的内容:
cat test.html | tr [TITLE] [title] | grep '<test>.*</title>' | sed 's/<title>\(.*\)<\/title>/\1/g'
description的语法我看不大明白,所以不知道怎么提取。
只简单测试了一下,LZ可以试试。
import sysfrom lxml import etree
reload(sys)
sys.setdefaultencoding("utf8")
import requests
r = requests.get('http://best.pconline.com.cn/')
html = r.text
xmlhtml = etree.HTML(html)
content = xmlhtml.xpath('//div[starts-with(@id,"topic")]/div[1]/a[2]/text()')
urllist = xmlhtml.xpath('//div[starts-with(@id,"topic")]/div[1]/a[2]/@href')
lastime = xmlhtml.xpath('//div[starts-with(@id,"topic")]/div[2]/div[2]/span[2]/text()')
data_text = [ text for text in content ]
data_url = [ url for url in urllist ]
data_time = [ t.strip() for t in lastime ]
for i in xrange(0, len(data_text), 1):
print "%s, %s, %s" % (data_text[i], data_url[i], data_time[i])
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)