LXML安装:pip install lxml
——xPath使用-----------------------------------------------------------------------------------
获取文本:
//标签1[@属性1="属性值1"]/标签2[@属性2="属性值2"]/.../text()
获取属性值
//标签1[@属性1="属性值1"]/标签2[@属性2="属性值2"]/.../@属性n
eg:-------------------------------------------------------------------
from lxml import html
def parse():
"""将html文件中内容 使用xpath进行提取"""
#读取文件中的内容
f =open('./static/index.html', 'r', encoding='utf-8')
s = f.read()
selector = html.fromstring(s)
# 解析a 标签内容
a = selector.xpath('//div[@id="container"]/a/text()')
print(a[0])
# 解析href属性
alink = selector.xpath('//div[@id="container"]/a/@href')
print(alink[0])
f.close()
if __name__=='__main__':
parse()
sel.xpath() 得到的依旧是一个SelectorList参看原文档
xpath(query)
Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened. List elements implement Selector interface too.
query is a string containing the XPATH query to apply.
那么实际上就是去看Selector 相关的函数了。
————————————————
<p>
AA
<sub>1</sub>
<sub>2</sub>
<sub>3</sub>
</p>
<p>
BB
<sub>1</sub>
<sub>2</sub>
<sub>3</sub>
</p>
对于上述例子,其实可以考虑获取到p之后,对其内容再进行一次查找,即可获得 1 、2 、3的内容。
mport urllib.requestimport re
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
html = html.decode('GBK')
return html
def getMeg(html):
reg = re.compile(r'******')
meglist = re.findall(reg,html)
for meg in meglist:
with open('out.txt',mode='a',encoding='utf-8') as file:
file.write('%s\n' % meg)
if __name__ == "__main__":
html = getHtml(url)
getMeg(html)
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)