urllib使用案例_随笔

urllib使用案例 1.HTTP请求协议关键字说明Request URL请求的URl地址Request Method请求的方法：GET/POSTStatus Code

状态码：

200：访问成功

206：视频传输

403：不允许访问

404：没有找寻此界面

500：服务器错误

Remote Address远端地址 : ( IP:端口号 )Connection连接类型Content-Encoding

数据压缩方式

常用压缩算法：

Content-Encoding:gzip

Content-Encoding:compress

Content-Encoding:deflate

Content-Encoding:identity Content-Encoding:br

关键字说明Accept发送端希望接受的数据类型Accept-Encoding发送端支持的压缩算法Accept-Language发送端支持的语言Cache-Contro缓存机制cookiecookieUser-Agent用户代理（模拟浏览器登录，绕过反爬机制）

2.urllib模块使用

# 导包

import urllib.request as ur

# read() 是将返回的HTTPResponse对象转换成网站信息以字节的方式返回
ret = ur.urlopen('【要请求的网站地址】').read()

print(ret)

# 导入到本地文件
with open('html名字','wb') as f:
    
    f.write(ret)

3.Request对象和URL编码

1> Request对象

参数解析：

参数作用url要请求的url（网址）datadata必须是bytes(字节流）类型，如果是字典，可以用urllib.parse模块里的urlencode()编码headersheaders是一个字典类型，是请求头。可以在构造请求时通过headers参数直接构造，也可以通过调用请求实例的add_header()方法添加。origin_req_host指定请求方的host名称或者ip地址unverifiable设置网页是否需要验证，默认是False，这个参数一般也不用设置。methodmethod是一个字符串，用来指定请求使用的方法，比如GET，POST和PUT等。

# 导包
import urllib.request as ur

# Request() 可以封装Hander和Data
request = ur.Request('被提取的网站')

response = ur.urlopen(request).read()

print(response)

2> URL编码

在提取的URL地址中可能会有中文，在python中是不被识别的会被转义，这时我们就可以导入 urllib.parse 模块把他转码过来。

# 导包

import urllib.parse as up

# 定义转码的数据
data = {

    (URl数据)
}

# URL编码
url_data = up.urlencode(

    data
)

print(url_data)

# 解码

ret = up.unquote(url_data)

print(ret)

4.Request对象之Http Post请求

案例：百度翻译爬虫

import urllib.request as ur
import urllib.parse as up
import json


# 实例化data
data = {
    'kw':'python'
}

url_data =up.urlencode(data)    # 对 data 编码

request = ur.Request(
    url='https://fanyi.baidu.com/sug',
    data=url_data.encode('utf-8')
)

response = ur.urlopen(request).read()

# print(response)

ret = json.loads(response)
# print(ret)

translate = ret['data'][0]['v']  # 索引取值
print(translate)

5.Requst对象之Hander伪装策略

    Hander是Request里的参数，Hander 中经常参加两个参数：1，cookie 2，user-agent 用于模拟浏览器登录，来绕过浏览器的反爬策略，常用的是 user-agent 来模拟浏览器访问

user_agent 请求头模板：https://blog.csdn.net/Bocker_Will/article/details/122641248?spm=1001.2014.3001.5502

import urllib.request as ur
import user_agent  # 导入准备的user_agent

request = ur.Request(
    url='https://edu.csdn.net/',
    headers={
        'User-Agent':user_agent.get_user_agent_pc()
    }
)

response = ur.urlopen(request).read()

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/zaji/5711975.html

urllib使用案例

发表评论

评论列表（0条）