python web网页内容获取

python web网页内容获取,第1张

1.简单获取网页单张图片:
2.拿取网页代码:

import requests 

r = requests.get('http://www.kaotop.com/file/tupian/20220430/uyvkzZ.jpeg')
with open('图片.jpeg','wb') as f:
    f.write(r.content)
    f.close()

3.注意:有时获取的网页图片读不了时:

是因为网页反爬虫功能代码中需要添加:headers ={‘User-Agent’: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ’
‘Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60’} #在网页检查代码内’控制台(console)'输入(window.navigator.userAgent)获取代码
.

import requests

headers ={'User-Agent':  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                'Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60'}
r = requests.get('https://get.wallhere.com/photo/space-galaxy-planet-fantasy-'
                 'wallpaper-desktop-landscape-surreal-1579551.png')
with open('图片.png', 'wb') as f:
    f.write(r.content)
    f.close()       

获取图片:

4.获取整个网页图片:

#解析网页;
获取代码:

import requests ,re 
from bs4 import BeautifulSoup

def get_content(target):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60'}
    r = requests.get(url = target,headers = headers)
    textwrap = BeautifulSoup(r.content, 'lxml')
    pictrues = textwrap.find('div', class_='hub-photomodal')
    pictrues = pictrues.find_all('a')
    for pictrue in pictrues:
        response = requests.get(pictrue.get('href'))
        return response
if __name__=='__main__':
    headers ={'User-Agent':  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                 'Chrome/99.0.4844.84 Safari/537.36 OPR/85.0.4341.60'}
    server = 'https://wallhere.com'
    r = requests.get(server)
    r.encoding = 'utf-8'
    text = BeautifulSoup(r.text,'lxml')
    picture_urls = text.find('div',class_ = 'hub-mediagrid hub-fleximages hub-loadphoto')
    picture_urls  = picture_urls .find_all('a')
    # print(picture_urls )
    for url in picture_urls:
        urls = url.get('href')
        # print(urls)
        url_img = re.findall('wallpaper',urls)
        # print(url_img)
        # print(type(urls))
        try:
            if url_img[0] == 'wallpaper':
                url = server + urls
                # print(url)
            else:
                continue
        except IndexError as e :
            continue
       response = get_content(url)
       picture_r = requests.get(response)
       with open('图片\%s' % (response.strip().split('-')[-1]), 'wb') as file:
            file.write(picture_r.content)
            file.close()

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/langs/789047.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-05-05
下一篇 2022-05-05

发表评论

登录后才能评论

评论列表(0条)

保存