Scrapy是基于Python的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据。
2. 基本架构 3. 组件 4. 数据处理流程 二. 在PyCharm中打开Scrapy项目 1. 创建Scrapy项目- 在指定文件路径,按住shift,然后鼠标点击右键,打开命令提示符
- 创建项目:在命令提示符中输入 scrapy startproject project_name(project_name:自定义,不要有中文)
- 创建spider:在命令提示符中输入scrapy genspider example example.com(example:蜘蛛程序名称,example.com:允许爬数据的局域网)
-
找到创建好的Scrapy项目,鼠标右键用Pycharm打开
-
创建虚拟环境:在Pycharm下找到Settings,找到Python Interpreter,选择Add新建虚拟环境
-
Pycharm终端输入 pip install scrapy,重新安装Scrapy
import scrapy class DoubanMovieItem(scrapy.Item): # 字段,用来保存数据(实质上是一个字典) title = scrapy.Field() rating = scrapy.Field() motto = scrapy.Field()c. 编写页面解析程序
import scrapy from scrapy import Request from scrapy.http import Response from day32.items import DoubanMovieItem class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250?start=0&filter='] def parse(self, resp: Response): selectors = resp.css('#content > div > div.article > ol > li > div > div.info') for selector in selectors: # type: scrapy.Selector # 返回给引擎一个item对象(保存数据),用来区分request对象(其他url) item = DoubanMovieItem() item['title'] = selector.css('div.hd > a > span:nth-child(1)::text').extract_first() item['rating'] = selector.css('div.bd > div > span.rating_num::text').extract_first() item['motto'] = selector.css('div.bd > p.quote > span::text').extract_first() # 用yield关键字产出item数据 yield item # request对象,其他的url selector = resp.css('#content > div > div.article > div.paginator > span.next') href = selector.xpath('./a/@href').extract_first() # 产出request对象,然后再用上述解析方法继续解析新的url yield Request( url=f'https://movie.douban.com/top250{href}', callback=self.parse )d. 编写settings文件
- headers里添加user_agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
- 是否遵守爬虫协议
ROBOTSTXT_OBEY = False
- 多线程
CONCURRENT_REQUESTS = 4
- 设置延迟
DOWNLOAD_DELAY = 3 RANDOMIZE_DOWNLOAD_DELAY = True
- middlewares文件中定义的类
DOWNLOADER_MIDDLEWARES = { 'day32.middlewares.DoubanDownloaderMiddleware': 543, }
数字大的优先执行请求,最后得到返回响应。(请求:数字由大到小;响应:数字由小到大)
- pipelines文件中定义的类
ITEM_PIPELINES = { 'day32.pipelines.MovieItemPipeline': 300, } # 有多个pipeline的情况下,数字大的先执行e. 编写middlewares文件
- 添加headers参数
def process_request(self, request: Request, spider): if spider.name == 'douban': request.cookies[''] = '' return None2. 运行爬虫
- 在Terminal中输入scrapy crawl spider_name (spider_name需要和蜘蛛程序中的name保持一致)
- 在Terminal中输入 scrapy crawl spider_name -o file_name.csv,可以直接将爬取到的数据写成csv文件
import openpyxl as openpyxl class MovieItemPipeline: def open_spider(self, spider): if spider.name == 'douban': self.workbook = openpyxl.Workbook() self.sheet = self.workbook.active self.sheet.title = 'Top250' self.sheet.append(('标题', '评分', '名句')) def process_item(self, item, spider): if spider.name == 'douban': self.sheet.append((item['title'], item['rating'], item['motto'])) return item def close_spider(self, spider): if spider.name == 'douban': self.workbook.save('movie_top250.xlsx')五. 下载中间件的编写和应用 1. 动态内容解析
- scrapy的下载器拿不到动态内容,所有需要自定义下载中间件并且通过selenium启动浏览器,爬取动态内容。
- 主要是编写process_request
import time from scrapy import signals, Request from scrapy.crawler import Crawler from scrapy.http import Response, HtmlResponse from selenium import webdriver class Image360DownloaderMiddleware: # 创建对象的时候会自动调用 def __init__(self): options = webdriver.ChromeOptions() options.add_argument('--headless') self.browser = webdriver.Chrome(options=options) # 对象没人引用时(下载中间件被销毁时)会自动调用 def __del__(self): self.browser.close() def process_request(self, request: Request, spider): if spider.name == 'image360': self.browser.get(request.url) for y in range(500, 10001, 500): # 执行Javascript代码 self.browser.execute_script(f'window.scrollTo(0, {y})') time.sleep(0.5) # body参数最为重要,就是动态页面 # encoding参数必须要写 return HtmlResponse( url=request.url, request=request, encoding='utf-8', headers=request.headers, body=self.browser.page_source ) def process_response(self, request: Request, response: Response, spider): return response def process_exception(self, request, exception, spider): pass2. settings配置
DOWNLOADER_MIDDLEWARES = { 'day30.middlewares.Image360DownloaderMiddleware': 500, 'day30.middlewares.DoubanDownloaderMiddleware': 543, }3. 编写items文件
class ImageItem(scrapy.Item): url = scrapy.Field()4. 蜘蛛程序
import scrapy from day30.items import ImageItem class Image360Spider(scrapy.Spider): name = 'image360' allowed_domains = ['image.so.com'] start_urls = ['https://image.so.com/z?ch=car'] # response就是自己编写的中间件中请求后返回的HtmlResponse def parse(self, response): sources = response.xpath('//img/@src').extract() for image_source in sources: # type: str if not image_source.endswith('.gif'): item = ImageItem() item['url'] = image_source yield item
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)