Error[8]: Undefined offset: 114, File: /www/wwwroot/outofmemory.cn/tmp/plugin_ss_superseo_model_superseo.php, Line: 121
File: /www/wwwroot/outofmemory.cn/tmp/plugin_ss_superseo_model_superseo.php, Line: 473, decode(

1.正则表达式

1.regex
案例:词组正则字符串
规则:a至少出现一次;b重复5次;c重复偶数次;最后是d或e。



表达:aa*bbbbb(cc)*(d|e)
2.常用正则表达式及符号

符号含义
*匹配前面的内容,0或多个
+匹配前面的内容,至少1个
[]匹配任意字符
()表达式编组
{m,n}m到n次
[^]匹配不在里面的字符
|匹配任意一个由竖线分割的字符
.匹配任意单个字符
^指开始位置
\转义字符
$表达式末尾
?!不包含

3.其他
正则表达式并不是通用的!python与java里面的正则表达式好像就不太一样!
案例:[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net),这是一个邮箱的正则表达式

2. BeautifulSoup应用

1.案例代码

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('https://pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:
    print(image['src'])

运行结果:

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

解释:直接通过商品图片的文件路径查找。


re.compilere两个内容很重要!

3.获取属性

获取属性,输入代码:print(bs.html.body.img.attrs),结果为:{'src': '../img/gifts/logo.jpg', 'style': 'float:left;'}
获取属性的某个属性值,输入代码:print(bs.html.body.img.attrs['src']),结果为:../img/gifts/logo.jpg
公式就是myTag.attrsmyTag.attrs['shuxing'],前者返回一个字典对象,后者则提取其中的某个值。


4.Lambda表达式

该表达式本质上是一个函数,可作为变量传入另一个函数。



示例代码:

print(bs.find_all(lambda tag:len(tag.attrs)==2))
##表示获取有两个属性的所有标签

结果为:

[, 
Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

.00


, 
Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

,000.52


, 
Fish Painting

If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!

,005.00


, 
Dead Parrot

This is an ex-parrot! Or maybe he's only resting?



.50

Mystery Box
, Keep your friends guessing!
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. 
.50



print(bs.find_all(lambda tag:tag.get_text()=='Or maybe he\'s only resting?'))]

运行代码:Or maybe he's only resting?
结果为:[print(bs.find_all('',text='Or maybe he\'s only resting?'))]
运行代码:["Or maybe he's only resting?"]
结果为:[+++]
上面两个书里面说效果是一样的,但是,其实并不一样结果!
书上的解释:BeautifulSoup允许把特定类型的函数作为参数传入find_all函数,限制条件就是这些函数必须把一个标签对象作为参数并且返回布尔类型的结果。


即是输入标签,符合函数的留下,不符合的不要!

PS:今天就到这里,最近在减肥,加油!!!!祝福我,顺利毕业!!!1

)
File: /www/wwwroot/outofmemory.cn/tmp/route_read.php, Line: 126, InsideLink()
File: /www/wwwroot/outofmemory.cn/tmp/index.inc.php, Line: 165, include(/www/wwwroot/outofmemory.cn/tmp/route_read.php)
File: /www/wwwroot/outofmemory.cn/index.php, Line: 30, include(/www/wwwroot/outofmemory.cn/tmp/index.inc.php)
python与爬虫-02HTML相关内容_python_内存溢出

python与爬虫-02HTML相关内容

python与爬虫-02HTML相关内容,第1张

1.正则表达式

1.regex
案例:词组正则字符串
规则:a至少出现一次;b重复5次;c重复偶数次;最后是d或e。



表达:aa*bbbbb(cc)*(d|e)
2.常用正则表达式及符号

符号含义
*匹配前面的内容,0或多个
+匹配前面的内容,至少1个
[]匹配任意字符
()表达式编组
{m,n}m到n次
[^]匹配不在里面的字符
|匹配任意一个由竖线分割的字符
.匹配任意单个字符
^指开始位置
\转义字符
$表达式末尾
?!不包含

3.其他
正则表达式并不是通用的!python与java里面的正则表达式好像就不太一样!
案例:[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net),这是一个邮箱的正则表达式

2. BeautifulSoup应用

1.案例代码

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('https://pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
images = bs.find_all('img',{'src':re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:
    print(image['src'])

运行结果:

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

解释:直接通过商品图片的文件路径查找。


re.compilere两个内容很重要!

3.获取属性

获取属性,输入代码:print(bs.html.body.img.attrs),结果为:{'src': '../img/gifts/logo.jpg', 'style': 'float:left;'}
获取属性的某个属性值,输入代码:print(bs.html.body.img.attrs['src']),结果为:../img/gifts/logo.jpg
公式就是myTag.attrsmyTag.attrs['shuxing'],前者返回一个字典对象,后者则提取其中的某个值。


4.Lambda表达式

该表达式本质上是一个函数,可作为变量传入另一个函数。



示例代码:

print(bs.find_all(lambda tag:len(tag.attrs)==2))
##表示获取有两个属性的所有标签

结果为:

[, 
Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

.00


, 
Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

,000.52


, 
Fish Painting

If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!

,005.00


, 
Dead Parrot

This is an ex-parrot! Or maybe he's only resting?



.50

Mystery Box
, Keep your friends guessing!
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. 
.50



print(bs.find_all(lambda tag:tag.get_text()=='Or maybe he\'s only resting?'))]

运行代码:Or maybe he's only resting?
结果为:[print(bs.find_all('',text='Or maybe he\'s only resting?'))]
运行代码:["Or maybe he's only resting?"]
结果为:
上面两个书里面说效果是一样的,但是,其实并不一样结果!
书上的解释:BeautifulSoup允许把特定类型的函数作为参数传入find_all函数,限制条件就是这些函数必须把一个标签对象作为参数并且返回布尔类型的结果。


即是输入标签,符合函数的留下,不符合的不要!

PS:今天就到这里,最近在减肥,加油!!!!祝福我,顺利毕业!!!1

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/568673.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-04-09
下一篇 2022-04-09

发表评论

登录后才能评论

评论列表(0条)

保存