1.安装
pip install configparser # 依赖pip install Scrapy
2.官网的一个简单例子
https://docs.scrapy.org/en/latest/intro/overview.html
#!/usr/bin/env python# Coding=utf-8import scrapyclass QuotesspIDer(scrapy.SpIDer): name = 'quotes' start_urls = ['http://quotes.toscrape.com/tag/humor/'] def parse(self,response): for quote in response.CSS('div.quote'): yIEld { 'text': quote.CSS('span.text::text').get(),'author': quote.xpath('span/small/text()').get() } next_page = response.CSS('li.next a::attr("href")').get() if next_page is not None: yIEld response.follow(next_page,self.parse)
在linux CentOS 6-10下:
$cd$vi quotes_spIDer.py
然后把上面的代码粘上去,:wq
保存并退出vi
。然后运行这个爬虫。
$scrapy runspIDer quotes_spIDer.py -o ./quotes.Json
跑完后,会在当前目录下生成quotes.Json
。
$cat quotes.Json
显示输出文件的内容:
[{"text": "\u201cThe person,be it gentleman or lady,who has not pleasure in a good novel,must be intolerably stupID.\u201d","author": "Jane Austen"},{"text": "\u201cA day without sunshine is like,you kNow,night.\u201d","author": "Steve Martin"},{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d","author": "Garrison Keillor"},{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupID or misinformed beholder a black eye.\u201d","author": "Jim Henson"},{"text": "\u201cAll you need is love. But a little chocolate Now and then doesn't hurt.\u201d","author": "Charles M. Schulz"},{"text": "\u201cRemember,we're madly in love,so it's all right to kiss me anytime you feel like it.\u201d","author": "Suzanne Collins"},{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d","author": "Charles Bukowski"},{"text": "\u201cThe trouble with having an open mind,of course,is that people will insist on coming along and trying to put things in it.\u201d","author": "Terry Pratchett"},{"text": "\u201cThink left and think right and think low and think high. Oh,the thinks you can think up if only you try!\u201d","author": "Dr. Seuss"},{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d","author": "George Carlin"},{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d","author": "W.C. FIElds"},{"text": "\u201cA lady's imagination is very rAPId; it jumps from admiration to love,from love to matrimony in a moment.\u201d","author": "Jane Austen"}]
3.执行过程
3.1.当执行命令
$scrapy runspIDer quotes_spIDer.py -o ./quotes.Json
Scrapy查找定义在quotes_spIDer.py
的SpIDer(这里是SpIDer的子类,QuotesspIDer
),然后通过爬虫Engine开始运行。
3.2.爬虫入口
在QuotesspIDer
里定义的start_urls
是爬虫的初始URL列表。爬虫从这个列表中拿到URL,创建Request,得到返回的Response,然后调用默认的回调方法parse
,传入的参数就是返回的Response。
3.3.处理响应
在parse
方法中,通过css选择器来提取想要的引用元素,用提取到的内容生成一个字典。对于额外的链接,由调度器生成另外的Request,回调方法同样是parse
。
以上是内存溢出为你收集整理的Scrapy初试牛刀全部内容,希望文章能够帮你解决Scrapy初试牛刀所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)