大家可以在Github上clone全部源码。
Github:https://github.com/williamzxl/Scrapy_CrawlMeiziTu
Scrapy官方文档:http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
基本上按照文档的流程走一遍就基本会用了。
Step1:
在开始爬取之前,必须创建一个新的Scrapy项目。 进入打算存储代码的目录中,运行下列命令:
scrapy startproject CrawlMeiziTu
该命令将会创建包含下列内容的 tutorial 目录:
CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/ __init__.py items.py pipelines.py settings.py mIDdlewares.py spIDers/ __init__.py ...cd CrawlMeiziTuscrapy genspIDer Meizitu http://www.meizitu.com/a/List_1_1.HTML
该命令将会创建包含下列内容的 tutorial 目录:
CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/ __init__.py items.py pipelines.py settings.py mIDdlewares.py spIDers/ Meizitu.py __init__.py ...
我们主要编辑的就如下图箭头所示:
main.py是后来加上的,加了两条命令,
from scrapy import cmdlinecmdline.execute("scrapy crawl Meizitu".split())
主要为了方便运行。
Step2:编辑Settings,如下图所示
BOT_name = 'CrawlMeiziTu' SPIDER_MODulES = ['CrawlMeiziTu.spIDers'] NEWSPIDER_MODulE = 'CrawlMeiziTu.spIDers' ITEM_PIPElines = { 'CrawlMeiziTu.pipelines.CrawlmeizituPipeline': 300,} IMAGES_STORE = 'D://pic2' DOWNLOAD_DELAY = 0.3 USER_AGENT = 'Mozilla/5.0 (windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36' ROBOTSTXT_OBEY = True
主要设置USER_AGENT,下载路径,下载延迟时间
Step3:编辑Items.
Items主要用来存取通过SpIDer程序抓取的信息。由于我们爬取妹子图,所以要抓取每张图片的名字,图片的连接,标签等等
# -*- Coding: utf-8 -*-# define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.HTMLimport scrapyclass CrawlmeizituItem(scrapy.Item): # define the fIElds for your item here like: # name = scrapy.FIEld() #Title为文件夹名字 Title = scrapy.FIEld() url = scrapy.FIEld() Tags = scrapy.FIEld() #图片的连接 src = scrapy.FIEld() #alt为图片名字 alt = scrapy.FIEld()
Step4:编辑Pipelines
Pipelines主要对items里面获取的信息进行处理。比如说根据Title创建文件夹或者图片的名字,根据图片链接下载图片。
# -*- Coding: utf-8 -*-import osimport requestsfrom CrawlMeiziTu.settings import IMAGES_STOREclass CrawlmeizituPipeline(object): def process_item(self,item,spIDer): fold_name = "".join(item['Title']) header = { 'USER-Agent': 'User-Agent:Mozilla/5.0 (windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36','cookie': 'b963ef2d97e050aaf90fd5fab8e78633',#需要查看图片的cookie信息,否则下载的图片无法查看 } images = [] # 所有图片放在一个文件夹下 dir_path = '{}'.format(IMAGES_STORE) if not os.path.exists(dir_path) and len(item['src']) != 0: os.mkdir(dir_path) if len(item['src']) == 0: with open('..//check.txt','a+') as fp: fp.write("".join(item['Title']) + ":" + "".join(item['url'])) fp.write("\n") for jpg_url,name,num in zip(item['src'],item['alt'],range(0,100)): file_name = name + str(num) file_path = '{}//{}'.format(dir_path,file_name) images.append(file_path) if os.path.exists(file_path) or os.path.exists(file_name): continue with open('{}//{}.jpg'.format(dir_path,file_name),'wb') as f: req = requests.get(jpg_url,headers=header) f.write(req.content) return item
Step5:编辑Meizitu的主程序。
最重要的主程序:
# -*- Coding: utf-8 -*-import scrapyfrom CrawlMeiziTu.items import CrawlmeizituItem#from CrawlMeiziTu.items import CrawlmeizituItemPageimport timeclass MeizituSpIDer(scrapy.SpIDer): name = "Meizitu" #allowed_domains = ["meizitu.com/"] start_urls = [] last_url = [] with open('..//url.txt','r') as fp: crawl_urls = fp.readlines() for start_url in crawl_urls: last_url.append(start_url.strip('\n')) start_urls.append("".join(last_url[-1])) def parse(self,response): selector = scrapy.Selector(response) #item = CrawlmeizituItemPage() next_pages = selector.xpath('//*[@ID="wp_page_numbers"]/ul/li/a/@href').extract() next_pages_text = selector.xpath('//*[@ID="wp_page_numbers"]/ul/li/a/text()').extract() all_urls = [] if '下一页' in next_pages_text: next_url = "http://www.meizitu.com/a/{}".format(next_pages[-2]) with open('..//url.txt','a+') as fp: fp.write('\n') fp.write(next_url) fp.write("\n") request = scrapy.http.Request(next_url,callback=self.parse) time.sleep(2) yIEld request all_info = selector.xpath('//h3[@]/a') #读取每个图片夹的连接 for info in all_info: links = info.xpath('//h3[@]/a/@href').extract() for link in links: request = scrapy.http.Request(link,callback=self.parse_item) time.sleep(1) yIEld request # next_link = selector.xpath('//*[@ID="wp_page_numbers"]/ul/li/a/@href').extract() # next_link_text = selector.xpath('//*[@ID="wp_page_numbers"]/ul/li/a/text()').extract() # if '下一页' in next_link_text: # nextPage = "http://www.meizitu.com/a/{}".format(next_link[-2]) # item['page_url'] = nextPage # yIEld item #抓取每个文件夹的信息 def parse_item(self,response): item = CrawlmeizituItem() selector = scrapy.Selector(response) image_Title = selector.xpath('//h2/a/text()').extract() image_url = selector.xpath('//h2/a/@href').extract() image_Tags = selector.xpath('//div[@]/p/text()').extract() if selector.xpath('//*[@ID="picture"]/p/img/@src').extract(): image_src = selector.xpath('//*[@ID="picture"]/p/img/@src').extract() else: image_src = selector.xpath('//*[@ID="maincontent"]/div/p/img/@src').extract() if selector.xpath('//*[@ID="picture"]/p/img/@alt').extract(): pic_name = selector.xpath('//*[@ID="picture"]/p/img/@alt').extract() else: pic_name = selector.xpath('//*[@ID="maincontent"]/div/p/img/@alt').extract() #//*[@ID="maincontent"]/div/p/img/@alt item['Title'] = image_Title item['url'] = image_url item['Tags'] = image_Tags item['src'] = image_src item['alt'] = pic_name print(item) time.sleep(1) yIEld item
总结
以上所述是小编给大家介绍的Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码,希望对大家有所帮助,如果大家啊有任何疑问欢迎给我留言,小编会及时回复大家的!
您可能感兴趣的文章:Python爬虫框架Scrapy实例代码讲解Python的Scrapy爬虫框架使用代理进行采集的方法Python的Scrapy爬虫框架简单学习笔记深入剖析Python的爬虫框架Scrapy的结构与运作流程实践Python的爬虫框架Scrapy来抓取豆瓣电影TOP250Python爬虫框架Scrapy实战之批量抓取招聘信息零基础写python爬虫之使用Scrapy框架编写爬虫零基础写python爬虫之爬虫框架Scrapy安装配置Python爬虫框架Scrapy安装使用步骤 总结以上是内存溢出为你收集整理的Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码全部内容,希望文章能够帮你解决Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)