2021最新微博爬虫——根据话题名称获取所有相关微博与评论_随笔

2021最新微博爬虫——根据话题名称获取所有相关微博与评论

由于课程大作业需要进行一些有关NLP的分析，在网上没有找到特别好使的代码，所以就干脆自己写一个爬虫，可以根据话题名称对其微博内容、评论内容、微博发布者相关信息进行爬取，目前作者测试是没有特别的问题的。
文章讲解中的代码比较散，要进行测试的话建议测试完整代码。

首先针对效果进行展示，作者测试没有把所有数据爬完，但是也爬取了近八千条数据。

文章目录

一、环境准备
二、微博数据获取
三、微博发布者信息获取
三、评论数据获取
四、微博评论者信息获取
五、存入CSV
六、完整代码

一、环境准备

这里不多说，用的包是这些，用使用pip install进行安装就行。

import requests
from lxml import etree
import csv
import re
import time
import random
from html.parser import HTMLParser

二、微博数据获取

-首先确定抓取微博内容、评论数、点赞数、发布时间、发布者名称等主要字段。选择weibo.com作为主要数据来源。（就是因为搜索功能好使）

知道了爬取目标，进一步就对结构进行分析，需要爬取的内容大致如下：
根据翻页查看url的变化，根据观察，url内主要是话题名称与页面数量的变化，所以既定方案如下：

topic = '扫黑风暴'
url = baseUrl.format(topic)

tempUrl = url + '&page=' + str(page)

知道了大致结构，便开始提取元素：

        for i in range(1, count + 1):
            try:
                contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content_full"]')
                contents = contents[0].xpath('string(.)').strip()  # 读取该节点下的所有字符串
            except:
                contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content"]')
                # 如果出错就代表当前这条微博有问题
                try:
                    contents = contents[0].xpath('string(.)').strip()
                except:
                    continue
            contents = contents.replace('收起全文d', '')
            contents = contents.replace('收起d', '')
            contents = contents.split(' 2')[0]

            # 发微博的人的名字
            name = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/div[1]/div[2]/a')[0].text
            # 微博url
            weibo_url = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a/@href')[0]
            url_str = '.*?com/d+/(.*)?refer_flag=d+_'
            res = re.findall(url_str, weibo_url)
            weibo_url = res[0]
            host_url = 'https://weibo.cn/comment/'+weibo_url
            # 发微博的时间
            timeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a')[0].text.strip()
            # 点赞数
            likeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[3]/a/button/span[2]')[0].text
            hostComment = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[2]/a')[0].text
            # 如果点赞数为空，那么代表点赞数为0
            if likeA == '赞':
                likeA = 0
            if hostComment == '评论 ':
                hostComment = 0
            if hostComment != 0:
                print('正在爬取第',page,'页，第',i,'条微博的评论。')
                getComment(host_url)
            # print(name,weibo_url,contents, timeA,likeA, hostComment)
            try:
                hosturl,host_sex, host_location, hostcount, hostfollow, hostfans=getpeople(name)
                list = ['微博', name, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans,contents, timeA, likeA]
                writer.writerow(list)
            except:
                continue

其中，这个微博url特别重要，因为后续爬取其下面的内容需要根据这个url寻找相关内容。

            weibo_url = html.xpath('//div[@][' + str(i) +']/div[@]/div[1]/div[2]/p[@]/a/@href')[0]
            url_str = '.*?com/d+/(.*)?refer_flag=d+_'
            res = re.findall(url_str, weibo_url)
            weibo_url = res[0]
            host_url = 'https://weibo.cn/comment/'+weibo_url

主要就是需要下面图中的内容

根据页面元素判断是否翻页

        try:
            if pageCount == 1:
                pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a')[0].text
                print(pageA)
                pageCount = pageCount + 1
            elif pageCount == 50:
                print('没有下一页了')
                break
            else:
                pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a[2]')[0].text
                pageCount = pageCount + 1
                print(pageA)
        except:
            print('没有下一页了')
            break

结果字段如下：

name,weibo_url,contents, timeA,likeA, hostComment

三、微博发布者信息获取

weibo.com中的信息不够直观，所以在weibo.cn中进行相关数据爬取，页面结构如下：

要进入对应的用户界面就需要获取到相关url，本文的方案是根据上一步获取到的用户名称，在weibo.com中进行相关搜索，我们只需要获取到搜索的第一个人的相关url就行，因为获取到的用户名称都是完整的，所以对应第一个就是我们需要的内容。
相关代码如下：

    url2 = 'https://s.weibo.com/user?q='
    while True:
        try:
            response = requests.post('https://weibo.cn/search/?pos=search', headers=headers_cn,data={'suser': '找人', 'keyword': name})
            tempUrl2 = url2 + str(name)+'&Refer=weibo_user'
            print('搜索页面',tempUrl2)
            response2 = requests.get(tempUrl2, headers=headers_com_1)
            html = etree.HTML(response2.content, parser=etree.HTMLParser(encoding='utf-8'))
            # print('/html/body/div[1]/div[2]/div/div[2]/div[1]/div[3]/div[1]/div[2]/div/a')
            hosturl_01 =html.xpath('/html/body/div[1]/div[2]/div/div[2]/div[1]/div[3]/div[1]/div[2]/div/a/@href')[0]
            url_str = '.*?com/(.*)'
            res = re.findall(url_str, hosturl_01)
            hosturl = 'https://weibo.cn/'+res[0]

获取到url后，进入weibo.cn进行相关数据爬取：

    while True:
        try:
            response = requests.get(hosturl, headers=headers_cn_1)
            html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8'))
            # 微博数
            hostcount = html.xpath('/html/body/div[4]/div/span')[0].text
            hostcount = re.match('(SSS)(d+)', hostcount).group(2)
            # 关注数
            hostfollow = html.xpath('/html/body/div[4]/div/a[1]')[0].text
            hostfollow = re.match('(SSS)(d+)', hostfollow).group(2)
            # 粉丝数
            hostfans = html.xpath('/html/body/div[4]/div/a[2]')[0].text
            hostfans = re.match('(SSS)(d+)', hostfans).group(2)
            # 性别和地点
            host_sex_location = html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()')
            break
        except:
            print('找人失败')
            time.sleep(random.randint(0, 10))
            pass
    try:
        host_sex_locationA = host_sex_location[0].split('xa0')
        host_sex_locationA = host_sex_locationA[1].split('/')
        host_sex = host_sex_locationA[0]
        host_location = host_sex_locationA[1].strip()
    except:
        host_sex_locationA = host_sex_location[1].split('xa0')
        host_sex_locationA = host_sex_locationA[1].split('/')
        host_sex = host_sex_locationA[0]
        host_location = host_sex_locationA[1].strip()

    # print('微博信息',name,hosturl,host_sex,host_location,hostcount,hostfollow,hostfans)
    return hosturl,host_sex, host_location, hostcount, hostfollow, hostfans

三、评论数据获取

第一步中获取到了微博相关的标识与weibo.cn的url，所以我们根据url进行爬取即可：
分析一下大致的情况
发现它每一页的评论数量不一样，而且评论所在标签也没有什么唯一标识，所以根据xpath获取有点麻烦，便改用正则表达式进行数据获取。
评论内容部分爬取代码如下：

    page=0
    pageCount=1
    count = []#内容
    date = []#时间
    like_times = []#赞
    user_url = []#用户url
    user_name = []#用户昵称
    while True:
        page=page+1
        print('正在爬取第',page,'页评论')
        if page == 1:
            url = hosturl
        else:
            url = hosturl+'?page='+str(page)
        print(url)
        try:
            response = requests.get(url, headers=headers_cn)
        except:
            break
        html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8'))
        user_re = '= '
        co_re = '(.*?)'
        zan_re = '
        date_re = '(.*?);'
        count_re = '回复:(.*)'
        user_name2 = re.findall(user_name_re,response.text)
        zan = re.findall(zan_re,response.text)
        date_2 = re.findall(date_re,response.text)
        count_2 = re.findall(co_re, response.text)
        user_url2 = re.findall(user_re,response.text)
        flag = len(zan)
        for i in range(flag):
            count.append(count_2[i])
            date.append(date_2[i])
            like_times.append(zan[i])
            user_name.append(user_name2[i])
            user_url.append('https://weibo.cn'+user_url2[i])
        try:
            if pageCount==1: #第一页找下一页标志代码如下
                pageA = html.xpath('//*[@id="pagelist"]/form/div/a')[0].text
                print('='*40,pageA,'='*40)
                pageCount = pageCount + 1
            else:  #第二页之后寻找下一页的标志
                pageA = html.xpath('//*[@id="pagelist"]/form/div/a[1]')[0].text
                pageCount=pageCount+1
                print('='*40,pageA,'='*40)
        except:
            print('没有下一页')
            break
    print("#"*20,'评论爬取结束，下面开始爬取评论人信息',"#"*20)
    print(len(like_times),len(count),len(date),len(user_url),len(user_name))

四、微博评论者信息获取

由于weibo.cn中可以直接获取到用户url关键部分，所以不用对用户名进行检索，直接获取url进一步爬取即可

def findUrl(hosturl):
    while True:
        try:
            print(hosturl)
            response = requests.get(hosturl, headers=headers_cn_1)
            html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8'))
            hostcount=html.xpath('/html/body/div[4]/div/span')[0].text
            hostcount=re.match('(SSS)(d+)',hostcount).group(2)
            hostfollow=html.xpath('/html/body/div[4]/div/a[1]')[0].text
            hostfollow = re.match('(SSS)(d+)', hostfollow).group(2)
            hostfans=html.xpath('/html/body/div[4]/div/a[2]')[0].text
            hostfans = re.match('(SSS)(d+)', hostfans).group(2)
            host_sex_location=html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()')
            break
        except:
            print('找人失败')
            time.sleep(random.randint(0, 5))
            pass
    try:
        host_sex_locationA=host_sex_location[0].split('xa0')
        host_sex_locationA=host_sex_locationA[1].split('/')
        host_sex=host_sex_locationA[0]
        host_location=host_sex_locationA[1].strip()
    except:
        host_sex_locationA=host_sex_location[1].split('xa0')
        host_sex_locationA = host_sex_locationA[1].split('/')
        host_sex = host_sex_locationA[0]
        host_location = host_sex_locationA[1].strip()
    time.sleep(random.randint(0, 2))
    # print('微博信息:','url:', hosturl, '性别:',host_sex, '地区：',host_location,'微博数:', hostcount, '关注数:',hostfollow,'粉丝数:', hostfans)
    return host_sex,host_location,hostcount,hostfollow,hostfans

总结一下大致思路就是：
微博获取：
1.weibo.com获取微博url、用户名称以及微博内容等信息
2. 进一步根据用户名称在weibo.com中进行用户url获取
3.根据构建的用户url在weibo.cn中爬取微博发布者的信息
微博评论获取：
1.根据上面获取的微博标识，构建weibo.cn中对应微博的地址
2.根据正则表达式获取评论内容

五、存入CSV

新建相关文件

topic = '扫黑风暴'
    url = baseUrl.format(topic)
    print(url)
    writer.writerow(['类别', '用户名', '用户链接', '性别', '地区', '微博数', '关注数', '粉丝数', '评论内容', '评论时间', '点赞次数'])

存入微博

hosturl,host_sex, host_location, hostcount, hostfollow, hostfans=getpeople(name)
list = ['微博', name, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans,contents, timeA, likeA]
writer.writerow(list)

存入评论

list111 = ['评论',user_name[i], user_url[i] , host_sex, host_location,hostcount, hostfollow, hostfans,count[i],date[i],like_times[i]]
writer.writerow(list111)

六、完整代码

# -*- coding: utf-8 -*-
# @Time : 2021/12/8 10:20
# @Author : MinChess
# @File : weibo.py
# @Software: PyCharm
import requests
from lxml import etree
import csv
import re
import time
import random
from html.parser import HTMLParser
headers_com = {
        'cookie': '看不到我',
        'user-agent': '看不到我'
        }
headers_cn = {
    'cookie': '看不到我',
    'user-agent': '看不到我'
        }

baseUrl = 'https://s.weibo.com/weibo?q=%23{}%23&Refer=index'
topic = '扫黑风暴'
csvfile = open(topic+ '.csv', 'a', newline='', encoding='utf-8-sig')
writer = csv.writer(csvfile)

def getTopic(url):
    page = 0
    pageCount = 1

    while True:
        weibo_content = []
        weibo_liketimes = []
        weibo_date = []
        page = page + 1
        tempUrl = url + '&page=' + str(page)
        print('-' * 36, tempUrl, '-' * 36)
        response = requests.get(tempUrl, headers=headers_com)
        html = etree.HTML(response.text, parser=etree.HTMLParser(encoding='utf-8'))
        count = len(html.xpath('//div[@]')) - 2
        for i in range(1, count + 1):
            try:
                contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content_full"]')
                contents = contents[0].xpath('string(.)').strip()  # 读取该节点下的所有字符串
            except:
                contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content"]')
                # 如果出错就代表当前这条微博有问题
                try:
                    contents = contents[0].xpath('string(.)').strip()
                except:
                    continue
            contents = contents.replace('收起全文d', '')
            contents = contents.replace('收起d', '')
            contents = contents.split(' 2')[0]

            # 发微博的人的名字
            name = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/div[1]/div[2]/a')[0].text
            # 微博url
            weibo_url = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a/@href')[0]
            url_str = '.*?com/d+/(.*)?refer_flag=d+_'
            res = re.findall(url_str, weibo_url)
            weibo_url = res[0]
            host_url = 'https://weibo.cn/comment/'+weibo_url
            # 发微博的时间
            timeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a')[0].text.strip()
            # 点赞数
            likeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[3]/a/button/span[2]')[0].text
            hostComment = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[2]/a')[0].text
            # 如果点赞数为空，那么代表点赞数为0
            if likeA == '赞':
                likeA = 0
            if hostComment == '评论 ':
                hostComment = 0
            if hostComment != 0:
                print('正在爬取第',page,'页，第',i,'条微博的评论。')
                getComment(host_url)
            try:
                hosturl,host_sex, host_location, hostcount, hostfollow, hostfans=getpeople(name)
                list = ['微博', name, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans,contents, timeA, likeA]
                writer.writerow(list)
            except:
                continue
            print('=' * 66)
        try:
            if pageCount == 1:
                pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a')[0].text
                print(pageA)
                pageCount = pageCount + 1
            elif pageCount == 50:
                print('没有下一页了')
                break
            else:
                pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a[2]')[0].text
                pageCount = pageCount + 1
                print(pageA)
        except:
            print('没有下一页了')
            break

def getpeople(name):
    findPoeple=0
    url2 = 'https://s.weibo.com/user?q='
    while True:
        try:
            response = requests.post('https://weibo.cn/search/?pos=search', headers=headers_cn,data={'suser': '找人', 'keyword': name})
            tempUrl2 = url2 + str(name)+'&Refer=weibo_user'
            print('搜索页面',tempUrl2)
            response2 = requests.get(tempUrl2, headers=headers_com)
            html = etree.HTML(response2.content, parser=etree.HTMLParser(encoding='utf-8'))
            hosturl_01 =html.xpath('/html/body/div[1]/div[2]/div/div[2]/div[1]/div[3]/div[1]/div[2]/div/a/@href')[0]
            url_str = '.*?com/(.*)'
            res = re.findall(url_str, hosturl_01)
            hosturl = 'https://weibo.cn/'+res[0]
            print('找人主页：',hosturl)
            break
        except:
            if findPoeple==10:
                stop=random.randint(60,300)
                print('IP被封等待一段时间在爬',stop,'秒')
                time.sleep(stop)
            if response.status_code==200:
                return
            print('找人')
            time.sleep(random.randint(0,10))
            findPoeple=findPoeple+1
    while True:
        try:
            response = requests.get(hosturl, headers=headers_cn)
            # print('微博主人个人主页',hosturl)
            html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8'))
            # 获取微博数
            # html2 = etree.HTML(html)
            # print(html2)
            hostcount = html.xpath('/html/body/div[4]/div/span')[0].text
            # 正则表达式，只获取数字部分
            # print(hostcount)
            hostcount = re.match('(SSS)(d+)', hostcount).group(2)
            # 获取关注数
            hostfollow = html.xpath('/html/body/div[4]/div/a[1]')[0].text
            # 正则表达式，只获取数字部分
            hostfollow = re.match('(SSS)(d+)', hostfollow).group(2)
            # 获取粉丝数
            hostfans = html.xpath('/html/body/div[4]/div/a[2]')[0].text
            # 正则表达式，只获取数字部分
            hostfans = re.match('(SSS)(d+)', hostfans).group(2)
            # 获取性别和地点
            host_sex_location = html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()')
            # print(hostcount, hostfollow, hostfans, host_sex_location)
            break
        except:
            print('找人失败')
            time.sleep(random.randint(0, 10))
            pass
    try:
        host_sex_locationA = host_sex_location[0].split('xa0')
        host_sex_locationA = host_sex_locationA[1].split('/')
        host_sex = host_sex_locationA[0]
        host_location = host_sex_locationA[1].strip()
    except:
        host_sex_locationA = host_sex_location[1].split('xa0')
        host_sex_locationA = host_sex_locationA[1].split('/')
        host_sex = host_sex_locationA[0]
        host_location = host_sex_locationA[1].strip()

    # print('微博信息',name,hosturl,host_sex,host_location,hostcount,hostfollow,hostfans)
    # nickname, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans
    return hosturl,host_sex, host_location, hostcount, hostfollow, hostfans

def getComment(hosturl):
    page=0
    pageCount=1
    count = []#内容
    date = []#时间
    like_times = []#赞
    user_url = []#用户url
    user_name = []#用户昵称
    while True:
        page=page+1
        print('正在爬取第',page,'页评论')
        if page == 1:
            url = hosturl
        else:
            url = hosturl+'?page='+str(page)
        print(url)
        try:
            response = requests.get(url, headers=headers_cn)
        except:
            break
        html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8'))
        user_re = '= '
        co_re = '(.*?)'
        zan_re = '
        date_re = '(.*?);'
        count_re = '回复:(.*)'
        user_name2 = re.findall(user_name_re,response.text)
        zan = re.findall(zan_re,response.text)
        date_2 = re.findall(date_re,response.text)
        count_2 = re.findall(co_re, response.text)
        user_url2 = re.findall(user_re,response.text)
        flag = len(zan)
        for i in range(flag):
            count.append(count_2[i])
            date.append(date_2[i])
            like_times.append(zan[i])
            user_name.append(user_name2[i])
            user_url.append('https://weibo.cn'+user_url2[i])
        try:
            if pageCount==1: #第一页找下一页标志代码如下
                pageA = html.xpath('//*[@id="pagelist"]/form/div/a')[0].text
                print('='*40,pageA,'='*40)
                pageCount = pageCount + 1
            else:  #第二页之后寻找下一页的标志
                pageA = html.xpath('//*[@id="pagelist"]/form/div/a[1]')[0].text
                pageCount=pageCount+1
                print('='*40,pageA,'='*40)
        except:
            print('没有下一页')
            break
    print("#"*20,'评论爬取结束，下面开始爬取评论人信息',"#"*20)
    print(len(like_times),len(count),len(date),len(user_url),len(user_name))
    flag=min(len(like_times),len(count),len(date),len(user_url),len(user_name))
    for i in range(flag):
        host_sex,host_location,hostcount,hostfollow,hostfans=findUrl(user_url[i])
        # print('评论',user_name[i], user_url[i] , host_sex, host_location,hostcount, hostfollow, hostfans,count[i],date[i],like_times[i])
        print('在爬取第', page,'页','第',i+1, '个人')
        list111 = ['评论',user_name[i], user_url[i] , host_sex, host_location,hostcount, hostfollow, hostfans,count[i],date[i],like_times[i]]
        writer.writerow(list111)
        time.sleep(random.randint(0, 2))
def findUrl(hosturl):
    while True:
        try:
            print(hosturl)
            response = requests.get(hosturl, headers=headers_cn)
            html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8'))
            hostcount=html.xpath('/html/body/div[4]/div/span')[0].text
            hostcount=re.match('(SSS)(d+)',hostcount).group(2)
            hostfollow=html.xpath('/html/body/div[4]/div/a[1]')[0].text
            hostfollow = re.match('(SSS)(d+)', hostfollow).group(2)
            hostfans=html.xpath('/html/body/div[4]/div/a[2]')[0].text
            hostfans = re.match('(SSS)(d+)', hostfans).group(2)
            host_sex_location=html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()')
            break
        except:
            print('找人失败')
            time.sleep(random.randint(0, 5))
            pass
    try:
        host_sex_locationA=host_sex_location[0].split('xa0')
        host_sex_locationA=host_sex_locationA[1].split('/')
        host_sex=host_sex_locationA[0]
        host_location=host_sex_locationA[1].strip()
    except:
        host_sex_locationA=host_sex_location[1].split('xa0')
        host_sex_locationA = host_sex_locationA[1].split('/')
        host_sex = host_sex_locationA[0]
        host_location = host_sex_locationA[1].strip()
    time.sleep(random.randint(0, 2))
    # print('微博信息:','url:', hosturl, '性别:',host_sex, '地区：',host_location,'微博数:', hostcount, '关注数:',hostfollow,'粉丝数:', hostfans)
    return host_sex,host_location,hostcount,hostfollow,hostfans

if __name__=='__main__':
    topic = '扫黑风暴'
    url = baseUrl.format(topic)
    print(url)
    writer.writerow(['类别', '用户名', '用户链接', '性别', '地区', '微博数', '关注数', '粉丝数', '评论内容', '评论时间', '点赞次数'])
    getTopic(url)  #去话题页获取微博

代码有点多，但是要是细看就是主要四个函数，互相嵌套罢了，思路和方法都比较简单
中间还有一些细节可能没有讲的很清楚，但是感觉大致的思路应该是很清晰了，就是逐层获取，希望各位看官担待着看！

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5671172.html

2021最新微博爬虫——根据话题名称获取所有相关微博与评论

发表评论

评论列表（0条）