scrapy---拉勾网Ajax爬虫

scrapy---拉勾网Ajax爬虫,第1张

scrapy---拉勾网Ajax爬虫

期末大作业做一个全程数据展示,数据来源就是要用爬虫,想来想去还是用scrapy框架好一点。

上课的时候老师说了句,拉勾网有难度,哎!!这我就不服了,嘎嘎嘎嘎嘎,我就爬它。

首先:

去拉勾网那里按F12,然后点击网络(注意蓝色下划线),只看Fetch/XHR,

 第二:重点:点2页,3页多试试看页面出现了什么不同之处:

 注:红色圈圈的是刚刚进页面的数据,蓝色圈圈的是点击第2页时候加载出来的数据,黑色圈圈则是点击第3页时候的数据。然后要选大小最大的文件!!!!

就是那些posittionAjax.json(),点击它们,点击预览,这时候你会出现如下图:

 然后在点content,positionResult,result(依次进行),我们要的数据在0到14之内

吐槽一下:如果你好奇我为什么会找到这里,其实我就是看了很多次网站的结构,然后才发现数据在这里的,感觉自己对Ajax的知识了解好少哎。。。。

分析完数据位置之后,就是代码环节啦(开心-。-):

 在项目位置的控制台敲创建命令:

scrapy startproject jobs_crwal_lagou

然后cd jobs_crwal_lagou(两次)

cd spider

然后

scrapy genspider lagou_spider www.baidu.com

出现如下文件结构(csv是我已经运行好的结构):

 写代码必须明确不用页面之间的网址规律:

但是再拉勾网网址是https://www.lagou.com/jobs/v2/positionAjax.json

 所以咱们url这么写:

start_urls = 'https://www.lagou.com/jobs/v2/positionAjax.json'

在Ajax的网站中需要使用post请求,所以我们需要重写start_requests()方法:

def start_requests(self,):
    for pn in range(1,301):    #这里是爬取300页
        headers = { 
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
            'referer': 'https://www.lagou.com/wn/jobs?px=new&pn=2&fromSearch=true&kd=%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88',
            'cookie': 'user_trace_token=20211108091120-8b0afd09-e510-42ce-a04c-f23917df57a4; _ga=GA1.2.418688220.1636333882; LGUID=20211108091121-eaa74fc9-4fdf-4643-a0e2-115418dc5c7f; RECOMMEND_TIP=true; index_location_city=%E5%85%A8%E5%9B%BD; __lg_stoken__=33bc62cc28b671aedf0e11535f4cbff311ae530554395f66a58a3d64ca49c4c49bc0d3d26568c5d7a60d1e2de619166f2e7e321b42a4bc3a5207f2b9cfc5f3e1deb6421fcb8e; _gid=GA1.2.356362994.1636624612; SEARCH_ID=b2b4c1c76447482aac70fee163179cab; JSESSIonID=ABAAAECABIEACCA678016FE3945EC4E6C42B8A2CFC2D753; WEBTJ-ID=20211112230816-17d14b1310445b-0bc56c987c842e-57b1a33-1327104-17d14b13105545; X_HTTP_TOKEN=ed095ffe505ce6c5696927636173744ecc1824c2d7; _gat=1; privacyPolicyPopup=false; PRE_UTM=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; sensorsdata2015session=%7B%7D; LGSID=20211112230816-1fb915d8-980f-4df0-807f-4db720dfe0cf; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Doy4wudHBJeFCLVmRef7GwCu%5FM5uL2pbjTdWtXzorEpy%26wd%3D%26eqid%3D87562fb90002ea7e00000006618e835d; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1636624612,1636625176,1636642907,1636729697; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1636729697; TG-TRACK-CODE=index_search; LGRID=20211112230821-9c4a26bd-2a4c-453f-be6a-f2e1bb964c86; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217cfd198b39e27-07e0cb7bb065aa-57b1a33-1327104-17cfd198b3aa6d%22%2C%22%24device_id%22%3A%2217cfd198b39e27-07e0cb7bb065aa-57b1a33-1327104-17cfd198b3aa6d%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24os%22%3A%22Windows%22%2C%22%24browser%22%3A%22Chrome%22%2C%22%24browser_version%22%3A%2295.0.4638.69%22%7D%7D'
        }
        data = {
            'first': 'true',
            'needAddtionalResult': 'false',
            'city': '全国',
            'px': 'new',
            'pn': str(pn),
            'fromSearch': 'true',
            'kd': '大数据开发工程师',
        }
        yield scrapy.FormRequest(url=self.start_urls,formdata=data,headers=headers, callback=self.parse)

请大家注意!!!headers中的User-Agent、referer、cookie不要复制本菜鸟的,请自己从下图拿。

表单数据也要自己拿到!!!pn代表的是页数。

 

接下来就是解析数据啦:

def parse(self, response):
    # res = response.post(url=self.start_urls, headers=headers, data=data)
    text = response.json()["content"]["positionResult"]["result"]
    print(text)
    for i in text:
        item = JobsCrawlLagouItem()
        # 公司全名
        item["companyFullName"] = i["companyFullName"]
        # 公司规模
        item["companySize"] = i["companySize"]
        # 招聘领域
        item["industryField"] = i["industryField"]
        # 融资状况
        item["financeStage"] = i["financeStage"]
        # 职位名称
        item["positionName"] = i["positionName"]
        # 工作经验
        item["workYear"] = i["workYear"]
        # 工作薪酬
        item["salary"] = i["salary"]
        # 学历要求
        item["education"] = i["education"]
        yield item

以上代码都是lagou_spider.py的

接下来是setting的:

将ROBOTSTXT_OBEY改为False

ROBOTSTXT_OBEY = False

启用管道:

ITEM_PIPELINES = {
   'jobs_crawl_lagou.pipelines.JobsCrawlLagouPipeline': 300,
    'jobs_crawl_lagou.pipelines.savecsvPipeline': 300,
}

管道pipelines.py(我保存为csv格式):

import csv

class JobsCrawlLagouPipeline:
    def process_item(self, item, spider):
        return item
class savecsvPipeline(object):
    def __init__(self):
        self.file = open('lagou_spider.csv','w',newline='')
        self.csvwriter = csv.writer(self.file)
        self.csvwriter.writerow(['公司全名','公司规模','招聘领域','融资状况','职位名称','工作经验','工作薪酬','学历要求'])
    def process_item(self,item,spider):
        self.csvwriter.writerow([item["companyFullName"],item["companySize"],item["industryField"],item["financeStage"],item["positionName"],item["workYear"],item["salary"],item["education"]])
        return item
    def close_spider(self,spider):
        self.file.close()

items.py:

companyFullName = scrapy.Field()
companySize = scrapy.Field()
industryField = scrapy.Field()
financeStage = scrapy.Field()
positionName = scrapy.Field()
workYear = scrapy.Field()
salary = scrapy.Field()
education= scrapy.Field()

以下是所有代码:

1.lagou_spider.py:

import scrapy
from ..items import JobsCrawlLagouItem

class LagouSpiderSpider(scrapy.Spider):
    name = 'lagou_spider'
    # allowed_domains = ['www.baidu.com']
    start_urls = 'https://www.lagou.com/jobs/v2/positionAjax.json'
    def start_requests(self,):
        for pn in range(1,301): #这里是爬取300页
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
                'referer': 'https://www.lagou.com/wn/jobs?px=new&pn=2&fromSearch=true&kd=%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%BC%80%E5%8F%91%E5%B7%A5%E7%A8%8B%E5%B8%88',
                'cookie': 'user_trace_token=20211108091120-8b0afd09-e510-42ce-a04c-f23917df57a4; _ga=GA1.2.418688220.1636333882; LGUID=20211108091121-eaa74fc9-4fdf-4643-a0e2-115418dc5c7f; RECOMMEND_TIP=true; index_location_city=%E5%85%A8%E5%9B%BD; __lg_stoken__=33bc62cc28b671aedf0e11535f4cbff311ae530554395f66a58a3d64ca49c4c49bc0d3d26568c5d7a60d1e2de619166f2e7e321b42a4bc3a5207f2b9cfc5f3e1deb6421fcb8e; _gid=GA1.2.356362994.1636624612; SEARCH_ID=b2b4c1c76447482aac70fee163179cab; JSESSIonID=ABAAAECABIEACCA678016FE3945EC4E6C42B8A2CFC2D753; WEBTJ-ID=20211112230816-17d14b1310445b-0bc56c987c842e-57b1a33-1327104-17d14b13105545; X_HTTP_TOKEN=ed095ffe505ce6c5696927636173744ecc1824c2d7; _gat=1; privacyPolicyPopup=false; PRE_UTM=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; sensorsdata2015session=%7B%7D; LGSID=20211112230816-1fb915d8-980f-4df0-807f-4db720dfe0cf; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Doy4wudHBJeFCLVmRef7GwCu%5FM5uL2pbjTdWtXzorEpy%26wd%3D%26eqid%3D87562fb90002ea7e00000006618e835d; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1636624612,1636625176,1636642907,1636729697; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1636729697; TG-TRACK-CODE=index_search; LGRID=20211112230821-9c4a26bd-2a4c-453f-be6a-f2e1bb964c86; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217cfd198b39e27-07e0cb7bb065aa-57b1a33-1327104-17cfd198b3aa6d%22%2C%22%24device_id%22%3A%2217cfd198b39e27-07e0cb7bb065aa-57b1a33-1327104-17cfd198b3aa6d%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Flink%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24os%22%3A%22Windows%22%2C%22%24browser%22%3A%22Chrome%22%2C%22%24browser_version%22%3A%2295.0.4638.69%22%7D%7D'
            }
            data = {
                'first': 'true',
                'needAddtionalResult': 'false',
                'city': '全国',
                'px': 'new',
                'pn': str(pn),
                'fromSearch': 'true',
                'kd': '大数据开发工程师',
            }
            yield scrapy.FormRequest(url=self.start_urls,formdata=data,headers=headers, callback=self.parse)
    def parse(self, response):
        # res = response.post(url=self.start_urls, headers=headers, data=data)
        text = response.json()["content"]["positionResult"]["result"]
        print(text)
        for i in text:
            item = JobsCrawlLagouItem()
            # 公司全名
            item["companyFullName"] = i["companyFullName"]
            # 公司规模
            item["companySize"] = i["companySize"]
            # 招聘领域
            item["industryField"] = i["industryField"]
            # 融资状况
            item["financeStage"] = i["financeStage"]
            # 职位名称
            item["positionName"] = i["positionName"]
            # 工作经验
            item["workYear"] = i["workYear"]
            # 工作薪酬
            item["salary"] = i["salary"]
            # 学历要求
            item["education"] = i["education"]
            yield item

2.pipelines.py:

import csv

class JobsCrawlLagouPipeline:
    def process_item(self, item, spider):
        return item
class savecsvPipeline(object):
    def __init__(self):
        self.file = open('lagou_spider.csv','w',newline='')
        self.csvwriter = csv.writer(self.file)
        self.csvwriter.writerow(['公司全名','公司规模','招聘领域','融资状况','职位名称','工作经验','工作薪酬','学历要求'])
    def process_item(self,item,spider):
        self.csvwriter.writerow([item["companyFullName"],item["companySize"],item["industryField"],item["financeStage"],item["positionName"],item["workYear"],item["salary"],item["education"]])
        return item
    def close_spider(self,spider):
        self.file.close()

3.items.py:

# name = scrapy.Field()
companyFullName = scrapy.Field()
companySize = scrapy.Field()
industryField = scrapy.Field()
financeStage = scrapy.Field()
positionName = scrapy.Field()
workYear = scrapy.Field()
salary = scrapy.Field()
education= scrapy.Field()

4.settings.py

 

 结果数据就是这样的:

 

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5480565.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-12
下一篇 2022-12-12

发表评论

登录后才能评论

评论列表(0条)

保存