【python爬虫】纵横中文网python实战_python

文章目录

前言
- 📕往期知识点
- 🎈最终效果
- 💡基本开发环境
- ✨分析网页
- 🎁思路分析
- 🏆实现步骤
- 🎲实现结果
- 💯完整代码

前言

作为一个python学习者，那么今天我就教大家用python实现写一个小实战，轻轻松松的把网址的所需要的信息保存到你的电脑里。

📕往期知识点

📕往期内容回顾

💡 【python】字典使用教程（超级详细）不看你怎么够别人卷
💡【python教程】保姆版教使用pymysql模块连接MySQL实现增删改查
💡 selenium自动化测试实战案例哔哩哔哩信息至Excel
💡舍友打一把游戏的时间，我实现了一个selenium自动化测试并把数据保存到MySQL

🎈最终效果

看一下实现的效果

💡基本开发环境

pycharm
Python 3.8

主要相关模块

request
BeautifulSoup
csv

✨分析网页

在实现之前第一步还是先对网页进行分析，确定网页是静态的还是动态的，知己知彼才好下手，是吧！以避开爬取难点，节约时间，我们打开网页右键检查输入关键字发现可以找到信息，我们大致可以确定这个网站是静态的。那么我们就可以根据普通的方法对网页进行抓取。

既然我们确定了网页是静态的，那再继续分析网页看看还有我们什么所需要的信息，比如我们翻页看看会有怎样的变化，这里我们发现URLp这里的数字变了，这不代表着要实现翻页我们只需变化这数字不就ok。
/p2/
/p3/
/p4/

而这些小说都存在div=class=“rankpage_box” 下面的每一个div标签中，后面通过BeautifulSoup拿到它们就能获取里面所需的信息了。

🎁思路分析

1、确定想要实现的网址及入口url
2、在入口url通过解析获取小说所有章节名称及各详情页href
3、得到所有章节详情页的地址发起请求
4、提取详情页所需信息
5、将全部信息保存至excl

🏆实现步骤

导入相对应的库，发起请求。
注意：现在各大网站都有反爬机制，所以我们要对我们的爬虫进行伪装，让它模仿浏览器访问，这样网站就检测不到访问他的是爬虫程序啦。所以我们要给爬虫设置请求头，将网页的User-Agent复制到代码里

拿到网页源码后BeautifulSoup实例化对象，找到全部小说的div，遍历提取里面所需信息，这里提取了（书名，图片封面，月票数，详情页网址）

有了详情页对发起请求，实例化对象，提取详情页提取其他信息，把字典信息填进列表最后来个返回值给函数。

调用其他函数进行其相关的 *** 作，这里是将信息保存至Excel中。

封面的保存

最后在主函数中设置翻页实现，一共10页的内容。

🎲实现结果

建议在网络良好下进行代码的运行。

💯完整代码

import requests
from bs4 import BeautifulSoup
import re
import csv
from fake_useragent import UserAgent
import os.path

num = 1
class spider(object):
    # 魔法方法
    def __init__(self):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'}
        # 对象属性
        self.header = headers


    def content(self,url):
        try:
            # 发起请求
            response = requests.get(url,self.header)
            if response.status_code == 200:
                return response.text
        except Exception as e:
            print(e)


    def get_data(self,response):
        #
        all_list = []
        # 实例化
        soup = BeautifulSoup(response,'lxml')
        # 全部div
        all_data = soup.find('div',class_="rankpage_box").find_all('div',class_="rank_d_list borderB_c_dsh clearfix")
        # 遍历提取信息
        for i in all_data:
            item = {}   # 字典
            item['title'] = i.find('div',class_="rank_d_b_name").find('a').text
            item['images'] = i.find('a').find('img').get('src')
            item['num'] = i.find(class_="rank_d_b_ticket").text
            # 详情页
            detalis = i.find('a').get('href')


            # 详情页请求
            new_rsponse = requests.get(url=detalis,headers=self.header).text
            # 实例化
            new_di = BeautifulSoup(new_rsponse,'lxml')
            # 提取信息
            try:
                item['scroc'] = new_di.find(class_="nums").text.replace(' ','')
            except:
                item['scroc'] = 'NOT'
            try:
                item['manages'] = new_di.find(class_="book-dec Jbook-dec hide").find('p').text
            except:
                item['manages'] = 'NOT'
            # 获取字典图片
            images = item.get('images')
            #
            self.save_images(images)
            #
            all_list.append(item)
        return all_list


    def save_csv(self,all_list):
        # 打开文件
        with open('纵横网.csv',mode='a+',newline='',encoding='utf-8')as f:
            writer = csv.DictWriter(
                f,fieldnames=['书名','图片','月票','评分信息','详情']
            )
            writer.writeheader()
            # 写入内容
            for i in all_list:
                writer.writerow(
                    {
                        '书名':i['title'],
                        '图片': i['images'],
                        '月票': i['num'],
                        '评分信息': i['scroc'],
                        '详情': i['manages'],
                    }
                )


    def save_images(self,images):
        global num
        #
        if not os.path.exists('./纵横小说/'):
            os.mkdir('./纵横小说/')
        # 请求
        images_re = requests.get(images,self.header).content
        # 保存
        with open('./纵横小说/' + str(num) + '.jpg',mode='wb')as f:
            f.write(images_re)
            num += 1


    def main(self):
        for i in range(1,11):
            url = f'http://www.zongheng.com/rank/details.html?rt=1&d=1&p={i}'
            print(f'================保存第{i}页的内容=============')
            response = self.content(url)
            all_list = self.get_data(response)
            self.save_csv(all_list)


if __name__ == '__main__':
    mood = spider()
    mood.main()

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/943572.html

【python爬虫】纵横中文网python实战

发表评论

评论列表（0条）