- 前提:选取所爬贴吧是要注意,贴吧的帖子是否过多(鉴别方法:点击尾页,是否能查看尾页内容,如是则可以。)
- 导入所需要的库
import requests from lxml import etree
- 先看下接下来的大概思路
#url
#headers
#发送请求,获取响应
#从响应中获取数据(数据和翻页用的url)
#判断是否终结 - 首先要获取url,以及headers
class Tieba(object): def __init__(self, name): self.url ="https://tieba.baidu.com/f?ie=utf-8&kw={}".format(name) self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36" }
{}代替的是你所爬的吧名,这样更方便于切换爬取目标,如下图,将吧名换成{}即可
获取User-Agent:
- 接下来需要发送请求获取相应
def get_data(self, url): response = requests.get(url, headers=self.headers) with open("temp.html","wb") as f: f.write(response.content) return response.content
-
然后创建一个element对象
代码如下def parse_data(self, data): data = data.decode().replace("","") html = etree.HTML(data) el_list = html.xpath('//*[@id="thread_list"]/li/div/div[2]/div[1]/div[1]/a')
其中,第二行代码,是因为所用的浏览器代理比较高级,例如百度,谷歌什么的,所以会把数据放到注释里,然后高级浏览器自己提取,这行代码就是为了提取数据。
最后一行代码,是帖子的xpath地址,获取方法如下:检查帖子,得到图右,然后复制xpath地址,需要验证xpath是否足够准确(验证:结果第一行就是帖子 )得到全部帖子的xpath,找到节点并删除,如下图,删除所选中的即可(找不到的话,就一个一个的删,总能试出来)
结果如下图所示
-
创建字典以及获取下一页的url
先看代码:data_list = [] for el in el_list: temp={} temp['title'] = el.xpath("./text()")[0] temp['link'] = 'https://tieba.baidu.com' + el.xpath("./@href")[0] data_list.append(temp) try: next_url = 'https:' + html.xpath('//a[contains(text(),"下一页>")]/@href')[0] # 获取下一页,引号内为下一页得xpath语句 except: next_url = None return data_list, next_url
此时要注意,每一页里的下一页的xpath语句都不一样,所以用上图的方法,找下一页的xpath(//a[contains(text(),"下一页>")]/@href)
-
保存数据,代码如图:
def save_data(self, data_list): for data in data_list: print(data)
-
代码中的注释写了代码的作用
def run(self): # headers next_url = self.url while True: # 发送请求,获取相应 data = self.get_data(next_url) # 从相应中提取数据(数据和翻页用的uel) data_list, next_url = self.parse_data(data) self.save_data(data_list) print(next_url) # 判断是否终结 if next_url == None: break
- 最后运行
if __name__ == '__main__': tieba = Tieba("塞纳河") tieba.run()
部分运行结果
-
完整代码如下,按照注释填入相关内容即可运行
import requests from lxml import etree class Tieba(object): def __init__(self, name): self.url ="#网址".format(name) self.headers = { "#User-Agent" } def get_data(self, url): response = requests.get(url, headers=self.headers) with open("temp.html","wb") as f: f.write(response.content) return response.content def parse_data(self, data): data = data.decode().replace("","") html = etree.HTML(data) el_list = html.xpath('所爬内容的xpath') data_list = [] for el in el_list: temp={} temp['title'] = el.xpath("./text()")[0] temp['link'] = 'https://tieba.baidu.com' + el.xpath("./@href")[0] data_list.append(temp) try: next_url = 'https:' + html.xpath('//a[contains(text(),"下一页>")]/@href')[0] except: next_url = None return data_list, next_url def save_data(self, data_list): for data in data_list: print(data) def run(self): next_url = self.url while True: data = self.get_data(next_url) data_list, next_url = self.parse_data(data) self.save_data(data_list) print(next_url) if next_url == None: break if __name__ == '__main__': tieba = Tieba("#吧名") # 输入名字在引号中 tieba.run()
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)