由于课程大作业需要进行一些有关NLP的分析,在网上没有找到特别好使的代码,所以就干脆自己写一个爬虫,可以根据话题名称对其微博内容、评论内容、微博发布者相关信息进行爬取,目前作者测试是没有特别的问题的。
文章讲解中的代码比较散,要进行测试的话建议测试完整代码。
- 首先针对效果进行展示,作者测试没有把所有数据爬完,但是也爬取了近八千条数据。
- 一、环境准备
- 二、微博数据获取
- 三、微博发布者信息获取
- 三、评论数据获取
- 四、微博评论者信息获取
- 五、存入CSV
- 六、完整代码
这里不多说,用的包是这些,用使用pip install进行安装就行。
import requests from lxml import etree import csv import re import time import random from html.parser import HTMLParser二、微博数据获取
-首先确定抓取微博内容、评论数、点赞数、发布时间、发布者名称等主要字段。选择weibo.com作为主要数据来源。(就是因为搜索功能好使)
- 知道了爬取目标,进一步就对结构进行分析,需要爬取的内容大致如下:
- 根据翻页查看url的变化,根据观察,url内主要是话题名称与页面数量的变化,所以既定方案如下:
topic = '扫黑风暴' url = baseUrl.format(topic)
tempUrl = url + '&page=' + str(page)
- 知道了大致结构,便开始提取元素:
for i in range(1, count + 1): try: contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content_full"]') contents = contents[0].xpath('string(.)').strip() # 读取该节点下的所有字符串 except: contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content"]') # 如果出错就代表当前这条微博有问题 try: contents = contents[0].xpath('string(.)').strip() except: continue contents = contents.replace('收起全文d', '') contents = contents.replace('收起d', '') contents = contents.split(' 2')[0] # 发微博的人的名字 name = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/div[1]/div[2]/a')[0].text # 微博url weibo_url = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a/@href')[0] url_str = '.*?com/d+/(.*)?refer_flag=d+_' res = re.findall(url_str, weibo_url) weibo_url = res[0] host_url = 'https://weibo.cn/comment/'+weibo_url # 发微博的时间 timeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a')[0].text.strip() # 点赞数 likeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[3]/a/button/span[2]')[0].text hostComment = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[2]/a')[0].text # 如果点赞数为空,那么代表点赞数为0 if likeA == '赞': likeA = 0 if hostComment == '评论 ': hostComment = 0 if hostComment != 0: print('正在爬取第',page,'页,第',i,'条微博的评论。') getComment(host_url) # print(name,weibo_url,contents, timeA,likeA, hostComment) try: hosturl,host_sex, host_location, hostcount, hostfollow, hostfans=getpeople(name) list = ['微博', name, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans,contents, timeA, likeA] writer.writerow(list) except: continue
其中,这个微博url特别重要,因为后续爬取其下面的内容需要根据这个url寻找相关内容。
weibo_url = html.xpath('//div[@][' + str(i) +']/div[@]/div[1]/div[2]/p[@]/a/@href')[0] url_str = '.*?com/d+/(.*)?refer_flag=d+_' res = re.findall(url_str, weibo_url) weibo_url = res[0] host_url = 'https://weibo.cn/comment/'+weibo_url
主要就是需要下面图中的内容
- 根据页面元素判断是否翻页
try: if pageCount == 1: pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a')[0].text print(pageA) pageCount = pageCount + 1 elif pageCount == 50: print('没有下一页了') break else: pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a[2]')[0].text pageCount = pageCount + 1 print(pageA) except: print('没有下一页了') break
- 结果字段如下:
name,weibo_url,contents, timeA,likeA, hostComment三、微博发布者信息获取
weibo.com中的信息不够直观,所以在weibo.cn中进行相关数据爬取,页面结构如下:
- 要进入对应的用户界面就需要获取到相关url,本文的方案是根据上一步获取到的用户名称,在weibo.com中进行相关搜索,我们只需要获取到搜索的第一个人的相关url就行,因为获取到的用户名称都是完整的,所以对应第一个就是我们需要的内容。
- 相关代码如下:
url2 = 'https://s.weibo.com/user?q=' while True: try: response = requests.post('https://weibo.cn/search/?pos=search', headers=headers_cn,data={'suser': '找人', 'keyword': name}) tempUrl2 = url2 + str(name)+'&Refer=weibo_user' print('搜索页面',tempUrl2) response2 = requests.get(tempUrl2, headers=headers_com_1) html = etree.HTML(response2.content, parser=etree.HTMLParser(encoding='utf-8')) # print('/html/body/div[1]/div[2]/div/div[2]/div[1]/div[3]/div[1]/div[2]/div/a') hosturl_01 =html.xpath('/html/body/div[1]/div[2]/div/div[2]/div[1]/div[3]/div[1]/div[2]/div/a/@href')[0] url_str = '.*?com/(.*)' res = re.findall(url_str, hosturl_01) hosturl = 'https://weibo.cn/'+res[0]
- 获取到url后,进入weibo.cn进行相关数据爬取:
while True: try: response = requests.get(hosturl, headers=headers_cn_1) html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8')) # 微博数 hostcount = html.xpath('/html/body/div[4]/div/span')[0].text hostcount = re.match('(SSS)(d+)', hostcount).group(2) # 关注数 hostfollow = html.xpath('/html/body/div[4]/div/a[1]')[0].text hostfollow = re.match('(SSS)(d+)', hostfollow).group(2) # 粉丝数 hostfans = html.xpath('/html/body/div[4]/div/a[2]')[0].text hostfans = re.match('(SSS)(d+)', hostfans).group(2) # 性别和地点 host_sex_location = html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()') break except: print('找人失败') time.sleep(random.randint(0, 10)) pass try: host_sex_locationA = host_sex_location[0].split('xa0') host_sex_locationA = host_sex_locationA[1].split('/') host_sex = host_sex_locationA[0] host_location = host_sex_locationA[1].strip() except: host_sex_locationA = host_sex_location[1].split('xa0') host_sex_locationA = host_sex_locationA[1].split('/') host_sex = host_sex_locationA[0] host_location = host_sex_locationA[1].strip() # print('微博信息',name,hosturl,host_sex,host_location,hostcount,hostfollow,hostfans) return hosturl,host_sex, host_location, hostcount, hostfollow, hostfans三、评论数据获取
-
第一步中获取到了微博相关的标识与weibo.cn的url,所以我们根据url进行爬取即可:
-
分析一下大致的情况
-
发现它每一页的评论数量不一样,而且评论所在标签也没有什么唯一标识,所以根据xpath获取有点麻烦,便改用正则表达式进行数据获取。
-
评论内容部分爬取代码如下:
page=0 pageCount=1 count = []#内容 date = []#时间 like_times = []#赞 user_url = []#用户url user_name = []#用户昵称 while True: page=page+1 print('正在爬取第',page,'页评论') if page == 1: url = hosturl else: url = hosturl+'?page='+str(page) print(url) try: response = requests.get(url, headers=headers_cn) except: break html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8')) user_re = '= ' co_re = '(.*?)' zan_re = ' date_re = '(.*?);' count_re = '回复:(.*)' user_name2 = re.findall(user_name_re,response.text) zan = re.findall(zan_re,response.text) date_2 = re.findall(date_re,response.text) count_2 = re.findall(co_re, response.text) user_url2 = re.findall(user_re,response.text) flag = len(zan) for i in range(flag): count.append(count_2[i]) date.append(date_2[i]) like_times.append(zan[i]) user_name.append(user_name2[i]) user_url.append('https://weibo.cn'+user_url2[i]) try: if pageCount==1: #第一页找下一页标志代码如下 pageA = html.xpath('//*[@id="pagelist"]/form/div/a')[0].text print('='*40,pageA,'='*40) pageCount = pageCount + 1 else: #第二页之后寻找下一页的标志 pageA = html.xpath('//*[@id="pagelist"]/form/div/a[1]')[0].text pageCount=pageCount+1 print('='*40,pageA,'='*40) except: print('没有下一页') break print("#"*20,'评论爬取结束,下面开始爬取评论人信息',"#"*20) print(len(like_times),len(count),len(date),len(user_url),len(user_name))四、微博评论者信息获取
- 由于weibo.cn中可以直接获取到用户url关键部分,所以不用对用户名进行检索,直接获取url进一步爬取即可
def findUrl(hosturl): while True: try: print(hosturl) response = requests.get(hosturl, headers=headers_cn_1) html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8')) hostcount=html.xpath('/html/body/div[4]/div/span')[0].text hostcount=re.match('(SSS)(d+)',hostcount).group(2) hostfollow=html.xpath('/html/body/div[4]/div/a[1]')[0].text hostfollow = re.match('(SSS)(d+)', hostfollow).group(2) hostfans=html.xpath('/html/body/div[4]/div/a[2]')[0].text hostfans = re.match('(SSS)(d+)', hostfans).group(2) host_sex_location=html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()') break except: print('找人失败') time.sleep(random.randint(0, 5)) pass try: host_sex_locationA=host_sex_location[0].split('xa0') host_sex_locationA=host_sex_locationA[1].split('/') host_sex=host_sex_locationA[0] host_location=host_sex_locationA[1].strip() except: host_sex_locationA=host_sex_location[1].split('xa0') host_sex_locationA = host_sex_locationA[1].split('/') host_sex = host_sex_locationA[0] host_location = host_sex_locationA[1].strip() time.sleep(random.randint(0, 2)) # print('微博信息:','url:', hosturl, '性别:',host_sex, '地区:',host_location,'微博数:', hostcount, '关注数:',hostfollow,'粉丝数:', hostfans) return host_sex,host_location,hostcount,hostfollow,hostfans
五、存入CSV总结一下大致思路就是:
微博获取:
1.weibo.com获取微博url、用户名称以及微博内容等信息
2. 进一步根据用户名称在weibo.com中进行用户url获取
3.根据构建的用户url在weibo.cn中爬取微博发布者的信息
微博评论获取:
1.根据上面获取的微博标识,构建weibo.cn中对应微博的地址
2.根据正则表达式获取评论内容
- 新建相关文件
topic = '扫黑风暴' url = baseUrl.format(topic) print(url) writer.writerow(['类别', '用户名', '用户链接', '性别', '地区', '微博数', '关注数', '粉丝数', '评论内容', '评论时间', '点赞次数'])
- 存入微博
hosturl,host_sex, host_location, hostcount, hostfollow, hostfans=getpeople(name) list = ['微博', name, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans,contents, timeA, likeA] writer.writerow(list)
- 存入评论
list111 = ['评论',user_name[i], user_url[i] , host_sex, host_location,hostcount, hostfollow, hostfans,count[i],date[i],like_times[i]] writer.writerow(list111)六、完整代码
# -*- coding: utf-8 -*- # @Time : 2021/12/8 10:20 # @Author : MinChess # @File : weibo.py # @Software: PyCharm import requests from lxml import etree import csv import re import time import random from html.parser import HTMLParser headers_com = { 'cookie': '看不到我', 'user-agent': '看不到我' } headers_cn = { 'cookie': '看不到我', 'user-agent': '看不到我' } baseUrl = 'https://s.weibo.com/weibo?q=%23{}%23&Refer=index' topic = '扫黑风暴' csvfile = open(topic+ '.csv', 'a', newline='', encoding='utf-8-sig') writer = csv.writer(csvfile) def getTopic(url): page = 0 pageCount = 1 while True: weibo_content = [] weibo_liketimes = [] weibo_date = [] page = page + 1 tempUrl = url + '&page=' + str(page) print('-' * 36, tempUrl, '-' * 36) response = requests.get(tempUrl, headers=headers_com) html = etree.HTML(response.text, parser=etree.HTMLParser(encoding='utf-8')) count = len(html.xpath('//div[@]')) - 2 for i in range(1, count + 1): try: contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content_full"]') contents = contents[0].xpath('string(.)').strip() # 读取该节点下的所有字符串 except: contents = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@node-type="feed_list_content"]') # 如果出错就代表当前这条微博有问题 try: contents = contents[0].xpath('string(.)').strip() except: continue contents = contents.replace('收起全文d', '') contents = contents.replace('收起d', '') contents = contents.split(' 2')[0] # 发微博的人的名字 name = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/div[1]/div[2]/a')[0].text # 微博url weibo_url = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a/@href')[0] url_str = '.*?com/d+/(.*)?refer_flag=d+_' res = re.findall(url_str, weibo_url) weibo_url = res[0] host_url = 'https://weibo.cn/comment/'+weibo_url # 发微博的时间 timeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[1]/div[2]/p[@]/a')[0].text.strip() # 点赞数 likeA = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[3]/a/button/span[2]')[0].text hostComment = html.xpath('//div[@][' + str(i) + ']/div[@]/div[2]/ul[1]/li[2]/a')[0].text # 如果点赞数为空,那么代表点赞数为0 if likeA == '赞': likeA = 0 if hostComment == '评论 ': hostComment = 0 if hostComment != 0: print('正在爬取第',page,'页,第',i,'条微博的评论。') getComment(host_url) try: hosturl,host_sex, host_location, hostcount, hostfollow, hostfans=getpeople(name) list = ['微博', name, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans,contents, timeA, likeA] writer.writerow(list) except: continue print('=' * 66) try: if pageCount == 1: pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a')[0].text print(pageA) pageCount = pageCount + 1 elif pageCount == 50: print('没有下一页了') break else: pageA = html.xpath('//*[@id="pl_feedlist_index"]/div[5]/div/a[2]')[0].text pageCount = pageCount + 1 print(pageA) except: print('没有下一页了') break def getpeople(name): findPoeple=0 url2 = 'https://s.weibo.com/user?q=' while True: try: response = requests.post('https://weibo.cn/search/?pos=search', headers=headers_cn,data={'suser': '找人', 'keyword': name}) tempUrl2 = url2 + str(name)+'&Refer=weibo_user' print('搜索页面',tempUrl2) response2 = requests.get(tempUrl2, headers=headers_com) html = etree.HTML(response2.content, parser=etree.HTMLParser(encoding='utf-8')) hosturl_01 =html.xpath('/html/body/div[1]/div[2]/div/div[2]/div[1]/div[3]/div[1]/div[2]/div/a/@href')[0] url_str = '.*?com/(.*)' res = re.findall(url_str, hosturl_01) hosturl = 'https://weibo.cn/'+res[0] print('找人主页:',hosturl) break except: if findPoeple==10: stop=random.randint(60,300) print('IP被封等待一段时间在爬',stop,'秒') time.sleep(stop) if response.status_code==200: return print('找人') time.sleep(random.randint(0,10)) findPoeple=findPoeple+1 while True: try: response = requests.get(hosturl, headers=headers_cn) # print('微博主人个人主页',hosturl) html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8')) # 获取微博数 # html2 = etree.HTML(html) # print(html2) hostcount = html.xpath('/html/body/div[4]/div/span')[0].text # 正则表达式,只获取数字部分 # print(hostcount) hostcount = re.match('(SSS)(d+)', hostcount).group(2) # 获取关注数 hostfollow = html.xpath('/html/body/div[4]/div/a[1]')[0].text # 正则表达式,只获取数字部分 hostfollow = re.match('(SSS)(d+)', hostfollow).group(2) # 获取粉丝数 hostfans = html.xpath('/html/body/div[4]/div/a[2]')[0].text # 正则表达式,只获取数字部分 hostfans = re.match('(SSS)(d+)', hostfans).group(2) # 获取性别和地点 host_sex_location = html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()') # print(hostcount, hostfollow, hostfans, host_sex_location) break except: print('找人失败') time.sleep(random.randint(0, 10)) pass try: host_sex_locationA = host_sex_location[0].split('xa0') host_sex_locationA = host_sex_locationA[1].split('/') host_sex = host_sex_locationA[0] host_location = host_sex_locationA[1].strip() except: host_sex_locationA = host_sex_location[1].split('xa0') host_sex_locationA = host_sex_locationA[1].split('/') host_sex = host_sex_locationA[0] host_location = host_sex_locationA[1].strip() # print('微博信息',name,hosturl,host_sex,host_location,hostcount,hostfollow,hostfans) # nickname, hosturl, host_sex, host_location, hostcount, hostfollow, hostfans return hosturl,host_sex, host_location, hostcount, hostfollow, hostfans def getComment(hosturl): page=0 pageCount=1 count = []#内容 date = []#时间 like_times = []#赞 user_url = []#用户url user_name = []#用户昵称 while True: page=page+1 print('正在爬取第',page,'页评论') if page == 1: url = hosturl else: url = hosturl+'?page='+str(page) print(url) try: response = requests.get(url, headers=headers_cn) except: break html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8')) user_re = '= ' co_re = '(.*?)' zan_re = ' date_re = '(.*?);' count_re = '回复:(.*)' user_name2 = re.findall(user_name_re,response.text) zan = re.findall(zan_re,response.text) date_2 = re.findall(date_re,response.text) count_2 = re.findall(co_re, response.text) user_url2 = re.findall(user_re,response.text) flag = len(zan) for i in range(flag): count.append(count_2[i]) date.append(date_2[i]) like_times.append(zan[i]) user_name.append(user_name2[i]) user_url.append('https://weibo.cn'+user_url2[i]) try: if pageCount==1: #第一页找下一页标志代码如下 pageA = html.xpath('//*[@id="pagelist"]/form/div/a')[0].text print('='*40,pageA,'='*40) pageCount = pageCount + 1 else: #第二页之后寻找下一页的标志 pageA = html.xpath('//*[@id="pagelist"]/form/div/a[1]')[0].text pageCount=pageCount+1 print('='*40,pageA,'='*40) except: print('没有下一页') break print("#"*20,'评论爬取结束,下面开始爬取评论人信息',"#"*20) print(len(like_times),len(count),len(date),len(user_url),len(user_name)) flag=min(len(like_times),len(count),len(date),len(user_url),len(user_name)) for i in range(flag): host_sex,host_location,hostcount,hostfollow,hostfans=findUrl(user_url[i]) # print('评论',user_name[i], user_url[i] , host_sex, host_location,hostcount, hostfollow, hostfans,count[i],date[i],like_times[i]) print('在爬取第', page,'页','第',i+1, '个人') list111 = ['评论',user_name[i], user_url[i] , host_sex, host_location,hostcount, hostfollow, hostfans,count[i],date[i],like_times[i]] writer.writerow(list111) time.sleep(random.randint(0, 2)) def findUrl(hosturl): while True: try: print(hosturl) response = requests.get(hosturl, headers=headers_cn) html = etree.HTML(response.content, parser=etree.HTMLParser(encoding='utf-8')) hostcount=html.xpath('/html/body/div[4]/div/span')[0].text hostcount=re.match('(SSS)(d+)',hostcount).group(2) hostfollow=html.xpath('/html/body/div[4]/div/a[1]')[0].text hostfollow = re.match('(SSS)(d+)', hostfollow).group(2) hostfans=html.xpath('/html/body/div[4]/div/a[2]')[0].text hostfans = re.match('(SSS)(d+)', hostfans).group(2) host_sex_location=html.xpath('/html/body/div[4]/table/tr/td[2]/div/span[1]/text()') break except: print('找人失败') time.sleep(random.randint(0, 5)) pass try: host_sex_locationA=host_sex_location[0].split('xa0') host_sex_locationA=host_sex_locationA[1].split('/') host_sex=host_sex_locationA[0] host_location=host_sex_locationA[1].strip() except: host_sex_locationA=host_sex_location[1].split('xa0') host_sex_locationA = host_sex_locationA[1].split('/') host_sex = host_sex_locationA[0] host_location = host_sex_locationA[1].strip() time.sleep(random.randint(0, 2)) # print('微博信息:','url:', hosturl, '性别:',host_sex, '地区:',host_location,'微博数:', hostcount, '关注数:',hostfollow,'粉丝数:', hostfans) return host_sex,host_location,hostcount,hostfollow,hostfans if __name__=='__main__': topic = '扫黑风暴' url = baseUrl.format(topic) print(url) writer.writerow(['类别', '用户名', '用户链接', '性别', '地区', '微博数', '关注数', '粉丝数', '评论内容', '评论时间', '点赞次数']) getTopic(url) #去话题页获取微博
- 代码有点多,但是要是细看就是主要四个函数,互相嵌套罢了,思路和方法都比较简单
- 中间还有一些细节可能没有讲的很清楚,但是感觉大致的思路应该是很清晰了,就是逐层获取,希望各位看官担待着看!
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)