requests库爬取zhihu表情包_随笔

requests库/爬取zhihu表情包

　　先学了requests库的一些基本 *** 作，简单的爬了一下。

　　用到了requests.get()方法，就是以GET方式请求网页，得到一个Response对象。

不加headers的话可能会400error所以加上: page=requests.get(url='https://www.zhihu.com/question/46508954',headers=hd)

　　还用到了一些os模块的方法，os.mkdir(x)用于在x目录下创建一个文件夹，os.path.exists(path)用于检测当前路径是否存在。

　　还有就是regex了，由于很简单就不说了。

先用requests.get()进入知乎问题界面，然后观察html发现每个jpg都包含在一个<figure>语句内，然后搞个正则提取出所有图片的url，注意有jpg和gif两种区分下。

然后对每个图片进行下载，当做二进制文件。

一开始有若干图片会400加上headers就好了。

 import re

 import requests

 import os

 import random

 hd={

         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 98 Safari/537.36'

 }

 adr='C:/face'

 def dowload(i,url):

     global adr

     if url==None:

         return

     res=requests.get(url,headers=hd)

     if url.find('jpg')!=-1:

         with open(adr+'/zhihu'+str(i)+'.jpg','wb') as f:

             f.write(res.content)

     elif url.find('gif')!=-1:

         with open(adr+'/zhihu'+str(i)+'.gif','wb') as f:

             f.write(res.content)

     else:

         print('error',url)

 def gethtml():

     page=requests.get(url='https://www.zhihu.com/question/46508954',headers=hd)

     page.encoding='utf-8'

     pattern=re.compile(r'<figure>.*?(https.*?(?:jpg|gif)).*?</figure>')

     res=pattern.findall(page.text)

     global adr

     if os.path.exists(adr)==False:

         os.mkdir(adr)

     else:

         adr=adr+str(random.randint(1,1000))

     pre,tot=0,len(res)

     for i,url in enumerate(res):

         dowload(i,url)

         rate=int((i+1)/tot*100)

         if rate!=pre:

             pre=rate

             print(str(rate)+'%')

 gethtml()

 print('图片已保存在'+adr+'目录！')

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/589650.html

requests库爬取zhihu表情包

发表评论

评论列表（0条）