使用爬虫爬取两种数据：结构化非结构化文本_python

一、非结构化文本的爬取

微博上有一篇关于“#学校里的男生有多温柔#”的话题，点进去一看感觉评论很真实，于是想把评论给爬下来看一看，并生成词云。

刚开始思路是通过网页端微博爬取，通过开发者工具查看分析后，发现并没有看到相关评论。

百度搜索之后得知web做了一些反爬虫策略，不太容易爬取（踩了相当时间的坑）。

但是微博手机端相对容易些，于是转战手机端获取该评论链接，然后使用谷歌浏览器登录该链接，一阵分析后，发现评论是隐藏在这里的，于是获得了相应的url为“

https://m.weibo.cn/comments/hotflow?id=4400976515164341&mid=4400976515164341&max_id_type=

”。

于是使用requests来获得数据。

获得的数据是json格式，需要转换成字典格式，我们想要的数据存储在“text”键值中，这是一个字典格式的数据。

接下来我们做的就是把这些数据爬取下来并保存到本地。

代码如下：

得到的数据为：

接下来把这些数据生成了词云，词云方面借鉴了下，得到的词云图是这样的：

评论比较多，我只取了前2页的，具体代码如下：

import requests
import json
import re
from time import sleep
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
import numpy as np
import jieba
def get_comment(num):
    url = 'https://m.weibo.cn/comments/hotflow?id=4400976515164341&mid=4400976515164341&max_id_type=' + str(num)
    response = requests.get(url)
    #print(response.content.decode())
    content  = json.loads(response.text) #把获得数据转换成字典格式
    #print(content)
    #print(content['data']['data'])
    for i in content['data']['data']:
        text= i['text']
        #print(text)
        label_filter = re.compile(r'.*?', re.S) 
        comment = re.sub(label_filter, '', text)#把一些链接去掉，保留纯文本内容。


        #print(comment)
        with open('温柔男生.txt', "a", encoding="utf8") as f:
            f.write(comment + '\n')
def GetWordCloud():
    path_txt = r'E:\接口自动化\温柔男生.txt' #数据路径
    path_img = r"C:\Users\Administrator\Desktop\timg .jpg" #图片路径（我使用的图片是索隆大大的，哈哈）
    f = open(path_txt, 'r', encoding='UTF-8').read()
    background_image = np.array(Image.open(path_img))
    # 结巴分词，生成字符串，如果不通过分词，无法直接生成正确的中文词云,感兴趣的朋友可以去查一下，有多种分词模式
    # Python join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串。


    cut_text = " ".join(jieba.cut(f))
    wordcloud = WordCloud(
        # 设置字体，不然会出现口字乱码，文字的路径是电脑的字体一般路径，可以换成别的
        font_path="C:/Windows/Fonts/simfang.ttf",
        background_color="white",
        # mask参数=图片背景，必须要写上，另外有mask参数再设定宽高是无效的
        mask=background_image).generate(cut_text)
    # 生成颜色值
    image_colors = ImageColorGenerator(background_image)
    # 下面代码表示显示图片
    plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
    plt.axis("off")
    plt.show()
if __name__ == '__main__':
   for i in range(2):
       get_comment(i)
       sleep(2)
       
GetWordCloud()

二、结构化数据的爬取

今天使用python来爬取百度学术的论文信息，并且增加了简单的可视化功能，主要内容有：

    1、爬取百度学术上论文信息（主要用到requests库，bs4库中BeautifulSoup模块，pandas，re等）
    2、使用tkinter构建GUI，显示爬取的论文信息
    3、对爬取的数据进行可视化，进行词云展示
    可以在小窗口中输入检索信息(这里输入了检索词：图像重建、深度学习，没有选择输入作者和机构)：

GUI图像如下(界面很丑，只能靠初音撑颜值了)：

使用词云（用的一只小狗，应该还挺可爱）展示的关键词：

对爬取到的论文的发表时间进行了统计，使用柱状图展示：

一、爬取论文

爬取思路：
主要使用了requests库、bs4里面的BeautifulSoup和re模块

    准备工作（剩下的关注后续博文）
    获取数据
    解析数据
    保存数据
    个人认为思路很简单，就是和我们手动去网上找信息类似（找到网址->进入网页->查找信息）：
    a. 给定一个url，使用req = requests.get()函数获取网页信息，然后使用req.text返回解码后的网页信息（如果使用requests.get().content返回的是字节）
    b. 使用正则表达式对获取的网页信息进行匹配，获取自己需要的内容。

c. 保存数据

（1）准备工作

在这一部分先弄清楚准备爬取的网页以及需要在网页上寻找哪些信息！
百度学术的网址为：https://xueshu.baidu.com
进行高级检索时（假设只使用三个信息进行检索）：检索词+作者+机构名称
url可以写为：https://xueshu.baidu.com/s?wd=关键词+author作者+affs%3A%28机构名称%29 （这个东西怎么得到的？很简单）
1）进入百度学术页面
首先打开百度学术的页面，发现最上面的网址是https://xueshu.baidu.com。

2）进入检索后的页面
然后我们选择高级检索，在全部检索词、作者、机构三个位置分别输入图像重建、陈达、南京航空航天大学，如下图所示：

3）点击搜索后会跳转页面，如下图所示，关注一下最上面的https://那一行，观察一下和最开始https://xueshu.baidu.com的区别，可以发现就是在后面加了一点内容，可以换几个检索词试试，很快就能找到规律，https://xueshu.baidu.com/s?wd=关键词+author作者+affs%3A%28机构名称%29，也可以自己把这一串东西里面的关键词、作者、机构改了，然后用浏览器搜索，试试能不能跳转到百度学术的页面。

我们需要爬取的页面是 https://xueshu.baidu.com/s?wd=关键词+author作者+affs%3A%28机构名称%29 这个，而不是https://xueshu.baidu.com/s。

4）输入检索词后跳转的页面有很多论文，然后我们需要点击标题，进入每一个论文的界面，如果是使用爬虫的话，我们就需要知道每一个论文页面的链接是什么。

在上一个图的界面中，可以使用快捷键（Ctrl+Shift+I），然后出现下面这个图的界面。

然后可以在右边找到下面红色框里的这个，在这个页面要做的就是找到每一个论文页面的链接，可以使用下面的方法获取每一个论文页面的链接，其中content是使用requests.get().text()返回的网页内容。

通过上一步得到的链接（需要在前面加上https:），我们可以进入每一个论文的界面，如下所示：

我们需要从界面中爬取的信息有：标题、作者、摘要、关键词、DOI、被引量、年份。

同样使用Ctrl+Shift+I进入开发者工具，选择元素一项，然后在里面找到要爬取的信息，如需要寻找标题，使用正则表达式对content进行匹配即可，具体 *** 作后续再讲。

B站上有个爬虫实战的教程，把爬虫的流程讲的很详细：https://www.bilibili.com/video/BV12E411A7ZQ?p=16

先直接贴出代码吧，后续再进行分析。

主要分为3个模块，一个是爬虫模块，一个是构建GUI模块，一个是数据可视化模块。

爬虫模块中导入了GUI模块和可视化模块，在爬虫模块中运行三个模块。

（一）爬虫模块代码

# -*- codeing = utf-8 -*-
# @Time  :2021/6/28 22:03
# @Author: wangyuhan
# @File  :BaiduScholar.py
# @Software: PyCharm

'''

一、
    检索通式：
    百度学术网址：https://xueshu.baidu.com
    高级检索：检索词+作者+机构名称
    https://xueshu.baidu.com/s?wd=关键词+author作者+affs%3A%28机构名称%29

二、
    匹配规则：使用bs4.BeautifulSoup, lxml库进行解析
    匹配href=re.compile('^//xueshu.baidu.com')

三、
    分析页面：(需要注意：可能有些资料是图书，没有摘要，则跳过)
    作者：re.compile('"{\'button_tp\':\'author\'}">(.*?)')
    摘要：re.compile('
(.*?)')
    关键词：re.compile('target="_blank" >(.*?)')
    DOI: re.compile('data-click="{\'button_tp\':\'doi\'}">\s+(.*?)\s+')
    被引量：re.compile('"{\'button_tp\':\'sc_cited\'}">\s+(\d+)\s+')
    年份：re.compile('\s+(\d+)\s+')
    标题：re.compile('"{\'act_block\':\'main\',\'button_tp\':\'title\'}"\s+>\s+(\S+)\s+')

四、
    不爬取图书，若匹配不到摘要，则放弃该文献
爬虫思路：
1、爬取网页
2、解析数据
3、保存数据

'''
from bs4 import BeautifulSoup
import re
import requests
import os
import pandas as pd
import sys
import random
import time
# 导入自定义模块（GUI组件）
import WindowShow
import DataVis

class Crawler_Paper():
    '''百度学术爬虫类'''
    def __init__(self):
        self.base_url = 'https://xueshu.baidu.com'
        self.header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
        self.paper_num = 1
        self.data = pd.DataFrame()    # 用于保存文献信息

    def input_message(self, url):
        '''
        输入检索信息，并且修改url
        :param url: 待爬网址
        :return x_url: 返回修改后的url，进行了关键词、作者和机构检索
        '''
        try:
            s = input('是否输入检索词(y/n，不区分大小写)>')
            if s == 'y' or s =='Y':
                keyword = input('检索词>')
                x_url = url + '/s?wd=' + keyword
            elif s !='n' and s != 'N':
                print('Input Error!')
                sys.exit('Goodbye!')

            a = input('是否输入作者(y/n，不区分大小写)>')
            if a == 'y' or a == 'Y':
                author = input('作者>')
                x_url = x_url + 'author' + author
            elif a !='n' and a != 'N':
                print('Input Error!')
                sys.exit('Goodbye!')

            af = input('是否输入机构(y/n，不区分大小写)>')
            if af == 'y' or af =='Y':
                affs = input('机构>')
                x_url = x_url + 'affs%3A%28' + affs + '%29'
            elif af != 'n' and af != 'N':
                print('Input Error!')
                sys.exit('Goodbye!')
            # 返回修改后的url
            return x_url
        # 增强程序的容错率，若没有输入，则退出程序
        except:
            print('No input!')
            sys.exit('Goodbye!')

    def get_page(self, url, params=None):
        '''
        获取网页信息函数，发送请求，获取响应内容
        :param url: 需要进入的网址
        :return req.text: 返回解码后的网页信息
        '''

        # 休眠，避免被对方反爬到
        time.sleep(random.randint(1,5))
        req = requests.get(url, headers=self.header, params=params)

        # 返回解码后的内容：requests.get().text
        # 返回字节：requests.get().content
        return req.text

    def analyze_paper(self, url):
        '''
        对每一个文章的网页进行检索，获取信息
        :param url: 需要检索的文章的url
        :return: msg: 包含文章的标题、作者、DOI、摘要、被引量、年份、关键词信息，若返回None,则表示不为期刊论文，跳过该文献
        '''
        content = self.get_page(url)
        # 匹配结果均为一个列表，关键词、作者、需要保留为List，其余需要转换为str
        # 关键词
        keyword = re.findall('target="_blank" >(.*?)', content)
        if not keyword:
            print('不为论文...')   # 如果检索的不是期刊论文，则跳过
            return None
        msg = dict(keyword=keyword)
        # 文章标题
        title = re.findall('"{\'act_block\':\'main\',\'button_tp\':\'title\'}"\s+>\s+(\S+)\s+', content)
        # 作者
        author = re.findall('"{\'button_tp\':\'author\'}">(.*?)', content)
        # 摘要
        abstract = re.findall('
(.*?)', content)
        # DOI
        DOI = re.findall('data-click="{\'button_tp\':\'doi\'}">\s+(.*?)\s+', content)
        # 被引量
        f = re.findall('"{\'button_tp\':\'sc_cited\'}">\s+(\d+)\s+', content)
        # 年份
        pub_time = re.findall('\s+(\d+)\s+', content)
        # 更新字典
        # 使用try处理异常，可能有些页面会有信息缺失
        print(author)
        try:
            msg.update({'title': title[0],
                        'author': author,
                        'abstract': abstract[0],
                        'DOI': DOI[0],
                        'f': f[0],
                        'time': pub_time[0]
                        })
        except IndexError:
            Ti = None if title==[] else title[0]
            Au = None if author==[] else author[0]
            Ab = None if abstract==[] else abstract[0]
            Doi = None if DOI==[] else DOI[0]
            fac = None if f==[] else f[0]
            Pt = None if pub_time==[] else pub_time[0]
            msg.update({'title': Ti,
                        'author': Au,
                        'abstract': Ab,
                        'DOI': Doi,
                        'f': fac,
                        'time': Pt
                        })
        return msg

        # 核心函数
    def analyze_page(self, content):
        '''
        解析网页信息
        :param content: 获取的网页信息（解码后的）
        '''
        # 获取每一个文章的链接
        soup = BeautifulSoup(markup=content, features='lxml')
        urls_list = soup.find_all('a', href=re.compile('^//xueshu.baidu.com/usercenter/paper'))
        # 对该页面所有文章链接进行迭代
        for url in urls_list:
            message = self.analyze_paper('https:'+url['href'])  # 获取文章页面的信息
            if message:
                '''保存数据'''
                print(f'第{self.paper_num}篇文章爬取中')
                self.save_data(message)     # 保存文献信息
                self.paper_num += 1
            else:
                continue
        # 进行下一页爬取，判断是否还存在下一页的链接
        next_urls = re.findall('

 
 （二）GUI模块代码 
# -*- codeing = utf-8 -*-
# @Time  :2021/6/29 14:12
# @Author:wangyuhan
# @File  :WindowShow.py
# @Software: PyCharm

import tkinter
import tkinter.ttk
import pandas as pd
import PIL.Image
import PIL.ImageTk


class MainForm():
    def __init__(self, datafile='./BaiduScholar/BaiduScholar.xlsx', image_path='./BaiduScholar/miku.png'):
        self.DataFile = datafile
        self.Image_path = image_path
        self.paper_dict = {}  # 文章内容
        self.root = tkinter.Tk()     # 创建窗体
        self.root.title('wangyuhan的程序')
        self.root.geometry("1200x290")  # 设置窗体尺寸
        self.creat_init_list() # 初始化列表
        self.load_data()      # 加载数据
        self.root.mainloop()   # 循环监听


    def creat_init_list(self):
        self.treeview = tkinter.ttk.Treeview(self.root, columns=('number', 'title', 'author', 'keyword', 'f', 'time', 'DOI', 'abstract'),
                                             show="headings", height=20)  # 创建普通列表
        # 配置列
        self.treeview.column('number', width=4, anchor=tkinter.CENTER)
        self.treeview.column('title', width=120, anchor=tkinter.W)   # anchor: 文字在cell里的对齐方式
        self.treeview.column('author', width=60, anchor=tkinter.W)
        self.treeview.column('keyword', width=60, anchor=tkinter.CENTER)
        self.treeview.column('f', width=5, anchor=tkinter.CENTER)
        self.treeview.column('time', width=10, anchor=tkinter.CENTER)
        self.treeview.column('DOI', width=30, anchor=tkinter.CENTER)
        self.treeview.column('abstract', width=220, anchor=tkinter.W)
        # 设置标题
        self.treeview.heading(column='number', text='序号')
        self.treeview.heading(column='title', text='标题')
        self.treeview.heading(column='author', text='作者')
        self.treeview.heading(column='keyword', text='关键词')
        self.treeview.heading(column='f', text='被引量')
        self.treeview.heading(column='time', text='发表年份')
        self.treeview.heading(column='DOI', text='DOI')
        self.treeview.heading(column='abstract', text='摘要')
        '''
        定义滚动条控件
        orient为滚动条的方向，vertical:纵向，horizontal:横向
        command = self.tree.yview   将滚动条绑定到treeview控件的Y轴
        '''
        self.scrollbar = tkinter.Scrollbar(self.root, orient='vertical', command=self.treeview.yview)
        # 滚动条显示
        self.scrollbar.pack(side=tkinter.RIGHT, fill=tkinter.Y)
        # 配置滚动条
        # self.scrollbar.config(command=)
        self.treeview.configure(yscrollcommand=self.scrollbar.set)
        self.treeview.bind("", self.paper_item_show)  # 绑定事件
        self.treeview.pack(fill=tkinter.BOTH)  # 显示组件

    def load_data(self):   # 创建下拉列表框
        self.frame = tkinter.Frame(self.root)
        data = pd.read_excel(self.DataFile)
        len = data.shape[0]   # 返回data的行数
        for i in range(len):
            number = str(i+1)
            title = data.loc[i, 'title']
            author = data.loc[i, 'author']
            keyword = data.loc[i, 'keyword']
            f = data.loc[i, 'f']
            pub_time = int(data.loc[i, 'time']) if data.loc[i, 'time']>0 else data.loc[i, 'time']
            DOI = data.loc[i, 'DOI']
            abstract = data.loc[i, 'abstract']
            self.treeview.insert(parent='', index=tkinter.END,
                                 values=(number, title, author, keyword, f, pub_time, DOI, abstract))

    def paper_item_show(self, event):   # 显示文章详细信息
        for index in self.treeview.selection():
            values = self.treeview.item(index, "values")    # 获得选定内容
            # 规范年份信息，如将2016.0转换为2016
            self.subroot = tkinter.Toplevel()      # 创建子窗体
            self.subroot.title('wangyuhan的小窗口')      # 设置子窗体标题
            self.subroot.geometry('800x400')       # 设置子窗体尺寸
            # picture = PIL.ImageTk.PhotoImage(image=PIL.Image.open(Image_path), size='0.01x0.01')  # 加载图片
            picture = PIL.ImageTk.PhotoImage(image=PIL.Image.open(self.Image_path).resize((290, 380)))  # 加载图片
            label_picture = tkinter.Label(self.subroot, image=picture)
            label_picture.grid(row=0, column=0)    # 显示图片
            sub_frame = tkinter.Frame(self.subroot)  # 定义子Frame
            # 设置标签
            tkinter.Label(sub_frame, text='【标题】' + values[1], font=('微软雅黑', 12), justify='left').grid(row=0, sticky=tkinter.W)
            tkinter.Label(sub_frame, text='【作者】' + values[2], font=('微软雅黑', 12), justify='left').grid(row=1, sticky=tkinter.W)
            tkinter.Label(sub_frame, text='【关键词】' + values[3], font=('微软雅黑', 12), justify='left').grid(row=2, sticky=tkinter.W)
            tkinter.Label(sub_frame, text='【被引量】' + values[4], font=('微软雅黑', 12), justify='left').grid(row=3, sticky=tkinter.W)
            tkinter.Label(sub_frame, text='【年份】' + values[5], font=('微软雅黑', 12), justify='left').grid(row=4, sticky=tkinter.W)
            tkinter.Label(sub_frame, text='【DOI】' + values[6], font=('微软雅黑', 12), justify='left').grid(row=5, sticky=tkinter.W)
            tkinter.Label(sub_frame, text='【摘要】', font=('微软雅黑', 12), justify='left').grid(row=6, sticky=tkinter.W)
            # 创建一个frame
            abstract_frame = tkinter.Frame(sub_frame)
            abstract_listbox = tkinter.Listbox(abstract_frame, height=10, width=50)  # 创建Listbox

            # 需要设置34个中文字符换行
            Length = len(values[7].split('/'))
            list_item = list(values[7])
            text = ['']
            num = 0
            Index = 0
            for i in list_item:
                if (Index+1) % 30 == 0:
                    num += 1
                    text.append(i)
                else:
                    text[num] = text[num]+i
                Index += 1
            for item in text:
                abstract_listbox.insert(tkinter.END, item)

            abstract_scrollbar = tkinter.Scrollbar(abstract_frame)
            abstract_scrollbar.config(command=abstract_listbox.yview)
            abstract_scrollbar.pack(side=tkinter.RIGHT, fill=tkinter.Y)
            abstract_listbox.pack()

            abstract_frame.grid(row=7, sticky=tkinter.W)    # 显示Frame
            sub_frame.grid(row=0, column=1, sticky=tkinter.N)
            self.subroot.mainloop()    # 显示子窗体
 
 三）可视化模块代码 
# -*- codeing = utf-8 -*-
# @Time  :2021/6/30 10:01
# @Author:汪煜晗
# @File  :DataVis.py
# @Software: PyCharm
import numpy as np
import pandas as pd
import wordcloud
import stylecloud
from collections import Counter
import matplotlib.pyplot as plt


def DataProcess(path='./BaiduScholar/BaiduScholar.xlsx'):
    '''
    提取时间信息和关键词，返回时间信息格式为np.array，关键词信息格式为list
    '''
    # 读取数据
    read_data = pd.read_excel(path)
    keyword = []
    for item in read_data['keyword']:   # 读取关键词信息
        k = str(item)
        # 规范关键词格式
        k = k.replace('[', '')
        k = k.replace(']', '')
        k = k.replace('\'', '')
        k = k.replace('；', '、') if '；' in k else k.replace(',', '、') if ',' in k else k
        k = k.split('、')
        if k[-1] == '':
            del k[-1]
        keyword.append(k)
    # 规范时间信息
    t = read_data['time']  # 读取时间信息
    t = t.dropna(axis=0, how='all')  # 清除Nan
    time = np.asarray(t)
    return keyword, time

def cloud_show(KW, save_path='./BaiduScholar/stylecloud.png'):
    result = {}
    for item in KW:
        for words in item:
            word = words.strip()
            print(word)
            if len(word) == 1:    # 单个字不记录
                continue
            else:
                result[word] = result.get(word, '') + ' ' + word   # 重复的词语累加统计量
        result_text = " ".join(result.values())     # 获取全部词汇

        stylecloud.gen_stylecloud(text=result_text, collocations=False,
                                  font_path=r'C:\Windows\Fonts\STXINGKA.TTF',
                                  icon_name='fas fa-dog',
                                  size=600,
                                  output_name=save_path
                                  )
        # cloud = wordcloud.WordCloud(
        #     # 设置字体路径、背景色、宽度、高度
        #     font_path=r'C:\Windows\Fonts\STXINGKA.TTF',
        #     background_color='white', width=1000, height=380
        # )
        # cloud.generate(result_text)    # 加载处理文本
        # cloud.to_file(save_path)                # 保存处理结果

def time_show(TM, save_path='./BaiduScholar/time.png'):
    # 统计个时间文章数量
    num = pd.DataFrame([Counter(TM)])  # 将统计结果的字典类型转换为DataFrame
    num.sort_index(axis=1, ascending=True, inplace=True)      # 对列索引进行升序
    # 可视化
    fig, ax = plt.subplots()
    # ggplot (a popular plotting package for R).
    plt.style.use('ggplot')
    TimeLabel = num.columns.values.tolist()
    TimeLabel = list(map(int, TimeLabel))
    y_num = np.arange(len(TimeLabel))
    timedata = num.iloc[0,:].values
    p1 = ax.barh(y_num, timedata, align='center', color=list(plt.rcParams['axes.prop_cycle'])[2]['color'])
    ax.set_yticks(y_num)
    ax.set_yticklabels(TimeLabel)
    ax.invert_yaxis()  # labels read top-to-bottom
    # 增加标签
    # ax.bar_label(p1, padding=3)
    ax.set_xlabel('number')
    ax.set_title('The number of articles')
    plt.savefig(save_path, dpi=800)
    plt.show()


def main():
    Keyword, time = DataProcess()
    cloud_show(KW=Keyword)
    time_show(TM=time)
 
 
 
 
 
 
					
										


					
						欢迎分享，转载请注明来源：内存溢出
原文地址: http://outofmemory.cn/langs/577999.html

使用爬虫爬取两种数据：结构化非结构化文本

发表评论

评论列表（0条）