用多进程配套多进程短时间大量爬取图网25000张图.
问题:出现程序无结果不出问题也不报错,正常结束的异常,经多方检查调试.原因:彼岸图网cookie隔30min刷新一次,用旧的cookie爬第一级页面只能爬到含有"跳转中"的源代码,导致xpath什么都解析不到,,而不用cookie也是如此.
解决:唯一解决办法是将旧的cookie换成新的cookie,若不用多线程多进程,则不会出现此问题.推测是因为cookie能跳过含有"跳转中"的页面,直接请求目标页面,应该不是网站站长主动反扒的措施,而就算用多进程套多线程,30min内只能爬到15000张图片,与目标数量相差甚远,所以换成使用selenium.目前正在尝试中....
这是第一版的代码,错误和算法问题最多,留录在此,方便往后再次观摩.--1.23记录# -*- coding: UTF-8 -*- """ @Author: 王散 Creative @Time: 2022/1/22 18:50 @IDE_Name/Software: PyCharm @File: 应对彼岸图网的极限反爬 """ import requests from lxml import etree import time from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor import threading from multiprocessing import Lock def task(url): # lock = threading.Lock() Squence = 0 header = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.469" "2.99 Safari/537.36", } resp1 = requests.get(url=url, headers=header) resp1.encoding = 'gbk' # print(resp1.text) tree = etree.HTML(resp1.text) analysis1 = tree.xpath('//*[@id="main"]/div[3]/ul/li//a/@href') analysis2 = tree.xpath('//*[@id="main"]/div[3]/ul/li/a/b/text()') for ItemTwo in analysis1: url_two_page = 'https://pic.netbian.com' + ItemTwo resp2 = requests.get(url=url_two_page, headers=header) # time.sleep(0.5) resp2.encoding = 'gbk' tree_two = etree.HTML(resp2.text) analysis3 = tree_two.xpath('//*[@id="img"]/img/@src') for ItemThree in analysis3: url_image_page = 'https://pic.netbian.com' + ItemThree resp3 = requests.get(url=url_image_page, headers=header) # lock.acquire() image_file = open(f'D:python_write_file爬虫NumberTwoImage彼岸网爬的好图2\{analysis2[Squence]}.jpg', 'wb') image_file.write(resp3.content) image_file.close() # lock.release() print(f'{analysis2[Squence]}==>爬取完毕') # lock.acquire() Squence = Squence + 1 # lock.release() # def main(num): # with ThreadPoolExecutor(252) as exe_Pool: # for item in range(num, num+126): # lock.acquire() # if item == 1: # exe_Pool.submit(task, 'https://pic.netbian.com/new/') # lock.release() # else: # exe_Pool.submit(task, f'https://pic.netbian.com/new/index_{item}.html') # lock.release() # return num+30 if __name__ == "__main__": # num = 1 # lock = Lock() with ProcessPoolExecutor(45) as Process_Pool: for item in range(1, 1261): if item == 1: Process_Pool.submit(task, 'https://pic.netbian.com/new/') else: Process_Pool.submit(task, f'https://pic.netbian.com/new/index_{item}.html') # for number in range(1, 11): # lock.acquire() # num = Process_Pool.submit(main, num) # lock.release()
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)