是根据蚂蚁学Python的视频做的笔记,方便自己后续回顾
视频链接:BV1bK411A7tV
老师的源码
这一份笔记对应的是视频的P5
文章目录- python并发编程笔记3
- P5-Python实现生产者消费者爬虫
- 1、多组件的Pipeline技术架构
- 2、生产者消费者爬虫的架构
- 3、多线程数据通信的queue.Queue
- 3.1、导入类库
- 3.2、创建Queue
- 3.3、添加元素
- 3.4、获取元素
- 3.5、查询状态
- 4、代码编写实现生产者消费者爬虫
- 5、输出的日志分析
复杂的事情一般都不会一下子做完,而是会分很多中间步骤一步步完成
模块与模块之间协同处理数据的架构叫Pipeline,而中间的模块(处理器)叫Processor
生产者生产结果通过中间数据传给消费者进行消费
生产者以输入数据作为他的原料,消费者以它的输出作为最终的输出数据
2、生产者消费者爬虫的架构 3、多线程数据通信的queue.Queuequeue.Queue可以用于多线程之间的、线程安全的数据通信
3.1、导入类库import queue
3.2、创建Queue
q=queue.Queue
3.3、添加元素
q.put(item)
3.4、获取元素
item = q.get()
3.5、查询状态
# 查看元素的多少
q.qsize()
#判断是否为空
q.empty()
# 判断是否为满
q.full()
4、代码编写实现生产者消费者爬虫
先提前安装好Bs4这个包
pip3 install Beautifulsoup4
# blog_spider.py
import requests
from bs4 import BeautifulSoup
# 注意这里不要使用老师提供的这个(可能加入了防爬),否则你后面的爬取的永远只有第一页的数据
# urls = [f"https://www.cnblogs.com/#p{page}" for page in range(1, 51)]
# 请将urls改成这个
urls = [
f"https://www.cnblogs.com/sitehome/p/{page}"
for page in range(1, 50 + 1)
]
# 生产者,生产的结果是HTML
def craw(url):
r = requests.get(url)
# 返回的是该网页的HTML信息
return r.text
# 消费者 该函数是解析html信息中的信息,其中href是链接,get_text是标题
def parse(html):
#
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all("a", class_="post-item-title")
return [(link["href"], link.get_text()) for link in links]
if __name__ == '__main__':
for result in parse(craw(urls[3])):
print(result)
import queue
import blog_spider
import time
import random
import threading
# :后面是标明类型,方便方法的提示
def do_craw(url_queue: queue.Queue, html_queue: queue.Queue):
while True:
# 取得一个url
url = url_queue.get()
# 返回这个url的HTML信息
html = blog_spider.craw(url)
# 将结果放到这个html_queue里面
html_queue.put(html)
# 打印日志
print(threading.current_thread().name, f"craw{url}",
"url_queue_size=", url_queue.qsize())
# 随机睡眠,防止请求过快被封ip
time.sleep(random.randint(1, 2))
# 将结果写进一个文件当中,指定的文件作为参数fout传入这个函数当中
def do_parse(html_queue: queue.Queue, fout):
while True:
# 获取生产者put进html_queue的数据
html = html_queue.get()
# 解析HTNL信息
results = blog_spider.parse(html)
# 将获取到的results列表的里面的元组内容写进fout这个文件当中
for result in results:
fout.write(str(result) + "\n")
# 打印日志
print(threading.current_thread().name, f"results_size=", len(results),
"html_queue_size=", html_queue.qsize())
# 随机睡眠,防止请求过快被封ip
time.sleep(random.randint(1, 2))
if __name__ == '__main__':
# 创建Queue对象
url_queue = queue.Queue()
html_queue = queue.Queue()
# 将urls里的每一页的url(https://www.cnblogs.com/#p{page}) put 进url_queue里面
for url in blog_spider.urls:
url_queue.put(url)
# 生产者3个线程:
for idx in range(3):
t = threading.Thread(target=do_craw, args=(url_queue, html_queue),
name=f"craw{idx}")
t.start()
# 构建/打开保存数据的文件对象
# 这里后面也要添加编码类型
fout = open("02_data.txt", "w", encoding='utf-8')
# 消费者2个线程:
for idx in range(2):
t = threading.Thread(target=do_parse, args=(html_queue, fout),
name=f"parse{idx}")
t.start()
5、输出的日志分析
craw0、1、2是创建的3个生产者线程
parse0、1是创建的2个消费者线程
url_queue作为输入数据(原料)经过生产者处理不断的减少,将生产出的html信息输入(put)到html_queue中
所以url_queue_size的大小不断变小,html_queue_size在浮动变化,因为新添加进去的html信息在不断被消费者获取(get)
随后后面因为生产者的线程比消费者多,消费者处理html_queue处理不过来,html_queue_size逐渐变大,直到生产者将url_queue全部处理完(此时的url_queue_size也为0),html_queue_size的大小将不会继续增大,只会随着消费者的处理逐渐减少为0
craw1 crawhttps://www.cnblogs.com/#p2 url_queue_size= 47
craw2 crawhttps://www.cnblogs.com/#p3 url_queue_size= 47
craw0 crawhttps://www.cnblogs.com/#p1 url_queue_size= 47
parse0 results_size= 20 html_queue_size= 1
parse1 results_size= 20 html_queue_size= 1
parse0 results_size= 20 html_queue_size= 0
craw2 crawhttps://www.cnblogs.com/#p4 url_queue_size= 46
parse1 results_size= 20 html_queue_size= 0
craw0 crawhttps://www.cnblogs.com/#p6 url_queue_size= 44
craw1 crawhttps://www.cnblogs.com/#p5 url_queue_size= 44
parse0 results_size= 20 html_queue_size= 1
parse1 results_size= 20 html_queue_size= 0
craw2 crawhttps://www.cnblogs.com/#p7 url_queue_size= 43
parse0 results_size= 20 html_queue_size= 0
craw0 crawhttps://www.cnblogs.com/#p8 url_queue_size= 41
parse1 results_size= 20 html_queue_size= 0
craw1 crawhttps://www.cnblogs.com/#p9 url_queue_size= 40
craw2 crawhttps://www.cnblogs.com/#p10 url_queue_size= 40
parse0 results_size= 20 html_queue_size= 1
craw1 crawhttps://www.cnblogs.com/#p11 url_queue_size= 38
craw2 crawhttps://www.cnblogs.com/#p12 url_queue_size= 38
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
craw0 crawhttps://www.cnblogs.com/#p13 url_queue_size= 37
craw2 crawhttps://www.cnblogs.com/#p14 url_queue_size= 36
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
craw1 crawhttps://www.cnblogs.com/#p15 url_queue_size= 35
craw0 crawhttps://www.cnblogs.com/#p16 url_queue_size= 34
craw2 crawhttps://www.cnblogs.com/#p17 url_queue_size= 33
parse0 results_size= 20 html_queue_size= 3
parse1 results_size= 20 html_queue_size= 2
craw1 crawhttps://www.cnblogs.com/#p18 url_queue_size= 32
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
craw0 crawhttps://www.cnblogs.com/#p19 url_queue_size= 31
craw1 crawhttps://www.cnblogs.com/#p20 url_queue_size= 29
craw2 crawhttps://www.cnblogs.com/#p21 url_queue_size= 29
craw1 crawhttps://www.cnblogs.com/#p22 url_queue_size= 27
craw2 crawhttps://www.cnblogs.com/#p23 url_queue_size= 27
parse0 results_size= 20 html_queue_size= 5
parse1 results_size= 20 html_queue_size= 4
craw0 crawhttps://www.cnblogs.com/#p24 url_queue_size= 26
parse0 results_size= 20 html_queue_size= 4
parse1 results_size= 20 html_queue_size= 3
craw0 crawhttps://www.cnblogs.com/#p25 url_queue_size= 25
craw1 crawhttps://www.cnblogs.com/#p26 url_queue_size= 23
craw2 crawhttps://www.cnblogs.com/#p27 url_queue_size= 23
parse0 results_size= 20 html_queue_size= 5
craw0 crawhttps://www.cnblogs.com/#p28 url_queue_size= 22
parse1 results_size= 20 html_queue_size= 5
craw2 crawhttps://www.cnblogs.com/#p29 url_queue_size= 21
parse0 results_size= 20 html_queue_size= 5
craw1 crawhttps://www.cnblogs.com/#p30 url_queue_size= 20
craw2 crawhttps://www.cnblogs.com/#p31 url_queue_size= 19
craw0 crawhttps://www.cnblogs.com/#p32 url_queue_size= 18
parse1 results_size= 20 html_queue_size= 7
craw2 crawhttps://www.cnblogs.com/#p33 url_queue_size= 17
parse0 results_size= 20 html_queue_size= 6
parse1 results_size= 20 html_queue_size= 6
craw0 crawhttps://www.cnblogs.com/#p34 url_queue_size= 15
craw1 crawhttps://www.cnblogs.com/#p35 url_queue_size= 15
parse1 results_size= 20 html_queue_size= 7
craw2 crawhttps://www.cnblogs.com/#p36 url_queue_size= 14
parse0 results_size= 20 html_queue_size= 7
craw0 crawhttps://www.cnblogs.com/#p37 url_queue_size= 12
craw1 crawhttps://www.cnblogs.com/#p38 url_queue_size= 12
parse0 results_size= 20 html_queue_size= 8
parse1 results_size= 20 html_queue_size= 7
craw1 crawhttps://www.cnblogs.com/#p39 url_queue_size= 10
craw2 crawhttps://www.cnblogs.com/#p40 url_queue_size= 10
parse0 results_size= 20 html_queue_size= 8
parse1 results_size= 20 html_queue_size= 7
craw0 crawhttps://www.cnblogs.com/#p41 url_queue_size= 9
parse1 results_size= 20 html_queue_size= 7
craw1 crawhttps://www.cnblogs.com/#p42 url_queue_size= 7
craw2 crawhttps://www.cnblogs.com/#p43 url_queue_size= 7
parse0 results_size= 20 html_queue_size= 8
craw0 crawhttps://www.cnblogs.com/#p44 url_queue_size= 6
parse1 results_size= 20 html_queue_size= 8
craw0 crawhttps://www.cnblogs.com/#p45 url_queue_size= 4
craw1 crawhttps://www.cnblogs.com/#p46 url_queue_size= 3
craw2 crawhttps://www.cnblogs.com/#p47 url_queue_size= 3
parse0 results_size= 20 html_queue_size= 9
parse1 results_size= 20 html_queue_size= 9
craw0 crawhttps://www.cnblogs.com/#p48 url_queue_size= 2
parse0 results_size= 20 html_queue_size= 9
craw1 crawhttps://www.cnblogs.com/#p49 url_queue_size= 0
craw2 crawhttps://www.cnblogs.com/#p50 url_queue_size= 0
parse1 results_size= 20 html_queue_size= 10
parse0 results_size= 20 html_queue_size= 9
parse1 results_size= 20 html_queue_size= 8
parse0 results_size= 20 html_queue_size= 7
parse1 results_size= 20 html_queue_size= 6
parse1 results_size= 20 html_queue_size= 5
parse0 results_size= 20 html_queue_size= 4
parse1 results_size= 20 html_queue_size= 3
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
parse0 results_size= 20 html_queue_size= 0
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)