一.安装
pip3 install lxml
pip3 install beautifulsoup4
二.导入与实例化导入
from bs4 import BeautifulSoup
实例化
- 本地对象
fp=open(',/test.html','r',encoding='utf-8)
html=BeautifulSoup(fp,'lxml')
- 网络对象(page_text为requests请求获得)
html=BeautifulSoup(page_text,'lxml')
三.Beautiful Soup规则实例
html='''
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
1. 选择节点
1.1 获取节点
result=html.head
#运行结果
The Dormouse's story
1.2 多个节点只获取第一个匹配的节点
result=html.p
#运行结果
The Dormouse's story
1.3 嵌套选择
result=html.head.title
#运行结果
The Dormouse's story
2. 提取信息
2.1 获取节点名称
result=html.title.name
#运行结果
title
2.2 获取节点属性
result=html.p['name']
#运行结果
dromouse
2.3 获取节点内容
result=html.title.string
#运行结果
The Dormouse's story
实例
html='''
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
HELLO
Lacie
and
Tillie
and they lived at the bottom of a well.
...
'''
3. 子节点
3.1 直接子节点
- contents:返回列表
result=html.p.contents
#运行结果
['\n Once upon a time there were three little sisters; and their names were\n ',
Elsie
, '\n HELLO\n ', Lacie, '\n and\n ', Tillie, '\n and they lived at the bottom of a well.\n ']
- children:返回迭代器
result=html.p.children
for i,child in enumerate(result):
print(i,child)
#运行结果
0
Once upon a time there were three little sisters; and their names were
1
Elsie
2
HELLO
3 Lacie
4
and
5 Tillie
6
and they lived at the bottom of a well.
3.2 所有子孙节点
- descendants:返回迭代器
result=html.p.descendants
for i,child in enumerate(result):
print(i,child)
#运行结果
0
Once upon a time there were three little sisters; and their names were
1
Elsie
2
3 Elsie
4 Elsie
5
6
HELLO
7 Lacie
8 Lacie
9
and
10 Tillie
11 Tillie
12
and they lived at the bottom of a well.
4. 兄弟节点
4.1 上一个兄弟节点
result=html.a.previous_sibling
#运行结果
Once upon a time there were three little sisters; and their names were
4.2 下一个兄弟节点
result=html.next_sibling
#运行结果
HELLO
4.3 前面的所有兄弟节点
result=list(enumerate(html.a.previous_siblings))
#运行结果
[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
4.4 后面的所有兄弟节点
result=list(enumerate(html.a.next_siblings))
#运行结果
[(0, '\n HELLO\n '), (1, Lacie), (2, '\n and\n '), (3, Tillie), (4, '\n and they lived at the bottom of a well.\n ')]
实例
html='''
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
...
'''
5. 父节点
5.1 直接父节点
result=html.a.parent
#运行结果
Once upon a time there were three little sisters; and their names were
Elsie
5.2 所有祖先节点
result=list(enumerate(html.a.parents))
#运行结果
[(0,
Once upon a time there were three little sisters; and their names were
Elsie
), (1,
Once upon a time there were three little sisters; and their names were
Elsie
...
), (2,
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
...
), (3,
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
...
)]
实例
html='''
Once upon a time there were three little sisters; and their names were
BobLacie
'''
6. 提取信息
6.1 调用string、attrs等属性
result=html.a.next_sibling.string
#运行结果
Lacie
6.2 将包含多个节点的生成器转为列表,取出元素,再调用属性获取
print(list(html.a.parents)[0].attrs['class'])
#运行结果
['story']
实例
html='''
Hello
- Foo
- Bar
- Jay
- Foo
- Bar
'''
7. 方法选择器
7.1 find 查找
- 只返回第一个匹配的元素
result=html.find(name='ul')
#运行结果
- Foo
- Bar
- Jay
-
其他用法,下文find_all同样适用
find_parents —— find_parent
find_next_sibling —— find_next_siblings find_previous_sibling —— find_previous_siblings
find_next —— find_all_next find_previous —— find_all_previous
7.2 find_all 查找所有- 传入参数name,参数值为ul,意即查询所有ul节点,返回列表
result=html.find_all(name='ul')
#运行结果
[
- Foo
- Bar
- Jay
,
- Foo
- Bar
]
- 输出为Tag类型,可继续嵌套
result=html.find_all(name='ul')
for ul in result:
print(ul.find_all(name='li'))
#运行结果
[Foo , Bar , Jay ]
[Foo , Bar ]
- 遍历每个li节点,获取文本内容
result=html.find_all(name='ul')
for ul in result:
a=ul.find_all(name='li')
for li in a:
print(li.string)
#运行结果
Foo
Bar
Jay
Foo
Bar
7.3 查询特定属性节点
result=html.find_all(id='list-1')
#运行结果
[
- Foo
- Bar
- Jay
]
- 特例:class属性之后必须加_,即class_
result=html.find_all(class_='list')
#运行结果
[
- Foo
- Bar
- Jay
,
- Foo
- Bar
]
7.4 text 匹配文本
- 可传入正则表达式,也可传入字符串
result=html.find_all(text=re.compile('oo'))
#运行结果
['Foo', 'Foo']
8. CSS选择器
8.1 嵌套选择
result=html.select('ul')
for ul in result:
print(ul.select('li'))
8.2 获取属性
- 依然可用原来的方法获取属性
result=html.select('ul')
for ul in result:
print(ul.['id'])
8.3 获取文本
- 依然可用原来的方法获取属性
result=html.select('li')
for li in result:
print(li.string)
- 或者使用get_text()
result=html.select('li')
for li in result:
print(li.get_text())
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)