scrapy 报错401

scrapy 报错401,第1张

scrapy 报错401

新人学scrapy,最近在爬金华信义居的房屋信息新房 - 列表,楼盘详细信息都很顺利的爬取成功了,但抓楼盘单元就报401的错,抓破脑袋不知道咋搞!

百度了说是401 是需要验证用户信息,但具体应该怎么 *** 作呢???

贴上代码

import datetime
import json
import pandas as pd
import scrapy
from jinhua.items import HouseItem


class ProjectSpider(scrapy.Spider):
    name = 'jhhouse'
    allowed_domains = ['https://www.jhtmsf.com']
    start_urls = ['https://www.jhtmsf.com/House/GetPageForRoom']
    spiderTime = datetime.datetime.now().strftime('%Y-%m-%d')

    def start_requests(self):
        df = pd.read_excel('../采集结果/py_jhtmsf_loupan.xlsx')
        for idx, row in df.iterrows():
            project_id = row['project_id']
            referer_url = 'https://www.jhtmsf.com/House/Room/' + str(project_id)
            yield scrapy.FormRequest(
                self.start_urls[0],
                headers={'Referer': referer_url},
                cookies={'__RequestVerificationToken': 'XpgK_gMXlG71JzgTKt27kPr9ZQE1Ptbm6DhRfN7Ol7OMuuS_p43T6XOKkwg48zNUlI5jSYlJA97oO_KoupElbXV5Zm-1ldmVCCjUkltPB8c1',
                         'ASP.NET_SessionId': 'vwtw2pdanxrkdnfbqimzol1u',
                         'Hm_lvt_88b265ab6b07373c61ffa7d36d6db2c3': '1634609773,1634711852,1634882161,1634883812',
                         'Hm_lpvt_88b265ab6b07373c61ffa7d36d6db2c3': '1634883832'},
                formdata={'eid': str(project_id), 'bulid': '', 'layer': '', 'status': '0', 'pageNumber': '1', 'pageSize': '15', 'sortName': 'StartDate', 'sortOrder': 'desc'},
                callback=self.parse,
                meta={'project_id': project_id, 'referer_url': referer_url})

    def parse(self, response):
        jsonBody = json.loads(response.body)
        page = jsonBody["TotalPage"]
        total = jsonBody["Total"]
        project_id = response.meta['project_id']
        referer_url = response.meta['referer_url']
        for pg in range(1, int(page) + 1):
            yield scrapy.FormRequest(
                self.start_urls[0], headers={'Referer': str(referer_url)},
                formdata={'eid': str(project_id), 'bulid': '', 'layer': '', 'status': "0", 'pageNumber': str(pg),
                          'pageSize': "15", 'sortName': 'StartDate', 'sortOrder': 'desc'},
                callback=self.content_parse, meta={'project_id': project_id, 'total': total})

    def content_parse(self, response):
        jsonBody = json.loads(response.body)
        jrows = jsonBody["Rows"]
        if jrows:
            for row in jrows:
                item = HouseItem()
                item['project_id'] = response.meta['project_id']
                item['total'] = response.meta['total']
                item['area'] = row['Area']
                item['build_nb'] = row['Bulid']
                item['on_layer'] = row['Layer']
                item['price'] = row['Price']
                item['room_nb'] = row['RoomNO']
                item['start_time'] = row['StartDate']
                item['house_status'] = row['Status']
                item['spider_time'] = self.spiderTime
                yield item

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/4678736.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-11-06
下一篇 2022-11-06

发表评论

登录后才能评论

评论列表(0条)

保存