python爬虫requests的库使用详解

小编 2026-07-02 阅读:1738 评论:0
import requests data = {\'name\': \'germey\', \'age\': \'22\'} headers = { \'User-Agent\': \'Mozi...
import requests

data = {\'name\': \'germey\', \'age\': \'22\'}
headers = {
    \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\'
}
response = requests.post(\"http://httpbin.org/post\", data=data, headers=headers)
print(response.json())
\'\'\'结果如下:
{\'args\': {}, \'data\': \'\', \'files\': {}, \'form\': {\'age\': \'22\', \'name\': \'germey\'}, \'headers\': {\'Accept\': \'*/*\', \'Accept-Encoding\': \'gzip, deflate\', \'Connection\': \'close\', \'Content-Length\': \'18\', \'Content-Type\': \'application/x-www-form-urlencoded\', \'Host\': \'httpbin.org\', \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\'}, \'json\': None, \'origin\': \'114.221.2.90\', \'url\': \'http://httpbin.org/post\'}
\'\'\'

        Requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多,Requests库是用pythony语言开发,基于urllib,采用Apache2 Licensed 开源协议的 第三方HTTP库。

       Requests的官网文档:http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

1.requests发送get请求与常见属性

1.1.requests的发送无参get请求

1.request发送基本get请求
import requests

response = requests.get(\'http://httpbin.org/get\')
print(response.text) #使用response.text显示response内容

\'\'\'结果如下:
{
  \"args\": {}, 
  \"headers\": {
    \"Accept\": \"*/*\", 
    \"Accept-Encoding\": \"gzip, deflate\", 
    \"Connection\": \"close\", 
    \"Host\": \"httpbin.org\", 
    \"User-Agent\": \"python-requests/2.20.1\"
  }, 
  \"origin\": \"114.221.2.90\", 
  \"url\": \"http://httpbin.org/get\"
}
\'\'\'

1.2.requests发送有参get请求

1.requests发送带参数的get请求
方式1:
import requests
response = requests.get(\"http://httpbin.org/get?name=germey&age=22\")
print(response.text)
方式2:
import requests

data = {
    \'name\': \'germey\',
    \'age\': 22
}
response = requests.get(\"http://httpbin.org/get\", params=data)
print(response.text)

1.3requests解析JSON

1.requests解析JSON
import requests
import json

response = requests.get(\"http://httpbin.org/get\")
print(type(response.text))
print(\'-------------------------------\')
print(response.json()) #获取的response转换成JSON
print(\'-------------------------------\')
print(json.loads(response.text)) #使用JSON类中的方法将response转换成JSON,和上面结果一样
print(\'-------------------------------\')
print(type(response.json()))
\'\'\'结果如下:
<class \'str\'>
-------------------------------
{\'args\': {}, \'headers\': {\'Accept\': \'*/*\', \'Accept-Encoding\': \'gzip, deflate\', \'Connection\': \'close\', \'Host\': \'httpbin.org\', \'User-Agent\': \'python-requests/2.20.1\'}, \'origin\': \'114.221.2.90\', \'url\': \'http://httpbin.org/get\'}
-------------------------------
{\'args\': {}, \'headers\': {\'Accept\': \'*/*\', \'Accept-Encoding\': \'gzip, deflate\', \'Connection\': \'close\', \'Host\': \'httpbin.org\', \'User-Agent\': \'python-requests/2.20.1\'}, \'origin\': \'114.221.2.90\', \'url\': \'http://httpbin.org/get\'}
-------------------------------
<class \'dict\'>
\'\'\'

1.4通过get请求获取网页文本或二进制数据

1.通过get请求获取网页二进制格式和文本格式数据
import requests

response = requests.get(\"https://github.com/favicon.ico\")
print(type(response.text), type(response.content)) #<class \'str\'> <class \'bytes\'>
print(response.text)  #是字符串类型
print(response.content) #content是网页二进制数据

2.通过file类方法将get请求获取的网页数据存储到本地

import requests

response = requests.get(\"https://github.com/favicon.ico\")
with open(\'./favicon.ico\', \'wb\') as f:  #将get请求返回的内容保存到当前目录
    f.write(response.content)
    f.close()

1.5发送get请求添加headers参数

    一般爬虫都要添加headers参数,不然很多往网站直接就会 返回not found,核心就是user_agent。比如下面爬去知乎界面,如果不添加headers直接返回失败

import requests

response = requests.get(\"https://www.zhihu.com/explore\")
print(response.text)
\'\'\'
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor=\"white\">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>

\'\'\'

使用requests发送get请求添加headers参数:可以正常访问知乎.

import requests

headers = {
\'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36\'
}
response = requests.get(\"https://www.zhihu.com/explore\", headers=headers)
print(response.text)

1.5使用get请求返回的一些属性提取

import requests

response = requests.get(\'https://www.baidu.com/\')
print(type(response)) #返回类型
print(\'--------------------------------------\')
print(response.status_code) #get请求返回值
print(\'--------------------------------------\')
print(type(response.text)) 
print(\'--------------------------------------\')
print(response.text)
print(\'--------------------------------------\')
print(response.cookies)

\'\'\'结果如下:
<class \'requests.models.Response\'> 
--------------------------------------
200
--------------------------------------
<class \'str\'>
--------------------------------------
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class=\"bg s_ipt_wr\"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class=\"bg s_btn_wr\"><input type=submit id=su value=ç¾åº¦ä¸ä¸ class=\"bg s_btn\" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ°é»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å°å¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§é¢</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç»å½</a> </noscript> <script>document.write(\'<a href=\"http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === \"\" ? \"?\" : \"&\")+ \"bdorz_come=1\")+ \'\" name=\"tj_login\" class=\"lb\">ç»å½</a>\');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style=\"display: block;\">æ´å¤äº§å</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å³äºç¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使ç¨ç¾åº¦åå¿è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æè§åé¦</a>&nbsp;京ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

--------------------------------------
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

Process finished with exit code 0


\'\'\'

2.requests发送post请求,跟get差不多

2.1.发送post请求,参数以字典的形式即可

import requests

data = {\'name\': \'germey\', \'age\': \'22\'}
response = requests.post(\"http://httpbin.org/post\", data=data)
print(response.text)

\'\'\'结果请求:
{
  \"args\": {}, 
  \"data\": \"\", 
  \"files\": {}, 
  \"form\": {
    \"age\": \"22\", 
    \"name\": \"germey\"
  }, 
  \"headers\": {
    \"Accept\": \"*/*\", 
    \"Accept-Encoding\": \"gzip, deflate\", 
    \"Connection\": \"close\", 
    \"Content-Length\": \"18\", 
    \"Content-Type\": \"application/x-www-form-urlencoded\", 
    \"Host\": \"httpbin.org\", 
    \"User-Agent\": \"python-requests/2.19.1\"
  }, 
  \"json\": null, 
  \"origin\": \"114.221.2.90\", 
  \"url\": \"http://httpbin.org/post\"
}


\'\'\'

2.2发送带headers的post请求

import requests

data = {\'name\': \'germey\', \'age\': \'22\'}
headers = {
    \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\'
}
response = requests.post(\"http://httpbin.org/post\", data=data, headers=headers)
print(response.json())

\'\'\'结果如下:
{\'args\': {}, \'data\': \'\', \'files\': {}, \'form\': {\'age\': \'22\', \'name\': \'germey\'}, \'headers\': {\'Accept\': \'*/*\', \'Accept-Encoding\': \'gzip, deflate\', \'Connection\': \'close\', \'Content-Length\': \'18\', \'Content-Type\': \'application/x-www-form-urlencoded\', \'Host\': \'httpbin.org\', \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36\'}, \'json\': None, \'origin\': \'114.221.2.90\', \'url\': \'http://httpbin.org/post\'}
\'\'\'

2.3关于response常见的属性

import requests

response = requests.get(\'http://www.jianshu.com\')
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)

3.关于requests库的常见其他用法

3.1文件上传功能

1.将当前目录下的下的favicon.ico文件上传到远程服务器上
import requests

files = {\'file\': open(\'./favicon.ico\', \'rb\')}
response = requests.post(\"http://httpbin.org/post\", files=files)
print(response.text)

\'\'\'结果返回:
{
  \"args\": {}, 
  \"data\": \"\", 
  \"files\": {
    \"file\": \"data:application/octet-stream;base64,AAABAAIAEBAAAAEAIAAoBQAAJgAAACAgAAABACAAKBQAAE4FAAAoAAAAEAAAACAAAAABACAAAAA
.......内容省略...................
  }, 
  \"form\": {}, 
  \"headers\": {
    \"Accept\": \"*/*\", 
    \"Accept-Encoding\": \"gzip, deflate\", 
    \"Connection\": \"close\", 
    \"Content-Length\": \"6665\", 
    \"Content-Type\": \"multipart/form-data; boundary=64baa48fe6e9aa9985fd4758bc97f1e9\", 
    \"Host\": \"httpbin.org\", 
    \"User-Agent\": \"python-requests/2.20.1\"
  }, 
  \"json\": null, 
  \"origin\": \"114.221.2.90\", 
  \"url\": \"http://httpbin.org/post\"
}

\'\'\'

3.2获取网站cookie值

import requests

response = requests.get(\"https://www.baidu.com\")
print(response.cookies)
for key, value in response.cookies.items():
    print(key + \'=\' + value)
\'\'\'结果如下:
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315
\'\'\'

3.3模拟登陆,通过Session()

import requests

s = requests.Session()
s.get(\'http://httpbin.org/cookies/set/number/123456789\')
response = s.get(\'http://httpbin.org/cookies\')
print(response.text)

3.4证书验证

1.登陆有些网站时,如果没有下载过网站验证证书,直接访问会被报错,如下所示:请求12306网站会被报错,SSLError

import requests

response = requests.get(\'https://www.12306.cn\')
print(response.status_code)

2.这个时候可以在发送get请求时,使用verify=False进行不验证,在可以通过urllib3.disable_warnings()忽略报错

import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get(\'https://www.12306.cn\', verify=False)##证书验证设为FALSE
print(response.status_code)

3.可以在发送get请求时,添加本地证书进行验证,如下所示:

import requests

response = requests.get(\'https://www.12306.cn\', cert=(\'./server.crt\', \'./key\'))
print(response.status_code)

3.5requests关于代理的设置

1.进行服务代理设置
import requests

proxies = {
  \"http\": \"http://127.0.0.1:9743\",
  \"https\": \"https://127.0.0.1:9743\",
}

response = requests.get(\"https://www.taobao.com\", proxies=proxies)
print(response.status_code)

2.代理设置方式2
import requests

proxies = {
    \"http\": \"http://user:password@127.0.0.1:9743/\",
}
response = requests.get(\"https://www.taobao.com\", proxies=proxies)
print(response.status_code)

3.6超时设置,以及异常处理

import requests
from requests.exceptions import ReadTimeout
try:
    response = requests.get(\"http://httpbin.org/get\", timeout = 0.8)
    print(response.status_code)
    print(\'-----------------异常分界线-------------------\')
except ReadTimeout :
    print(\'哈哈哈哈,Timeout\')

\'\'\'测试结果1:
200
-----------------异常分界线-------------------
\'\'\'
\'\'\'测试结果2:
哈哈哈哈,Timeout

\'\'\'

3.7认证设置

访问某些网站时,首先是登录界面,需要输入用户名和密码,必须登录以后才能进行操作,这个时候可以使用auth进行授权账号密码进行登录。如下所示:

import requests
from requests.auth import HTTPBasicAuth

r = requests.get(\'http://120.27.34.24:9001\', auth=HTTPBasicAuth(\'user\', \'123\'))
print(r.status_code)

3.8常见的请求异常类型

在你不确定会发生什么错误时,尽量使用try...except来捕获异常所有的requests exception:

import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException
try:
    response = requests.get(\"http://httpbin.org/get\", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print(\'Timeout\')
except ConnectionError:
    print(\'Connection error\')
except RequestException:
    print(\'Error\')

 

版权声明

本文仅代表作者观点,不代表百度立场。
本文系作者授权百度百家发表,未经许可,不得转载。

热门文章
  • Sequential Monte Carlo Methods (SMC) 序列蒙特卡洛/粒子滤波/Bootstrap Filtering

    Sequential Monte Carlo Methods (SMC) 序列蒙特卡洛/粒子滤波/Bootstrap Filtering
    Problem Statement 我们考虑一个具有马尔可夫性质、非线性、非高斯的状态空间模型(State Space Model):对于一个时间序列上的观测结果{yt,t∈N}\\{ y_t , t \\in N \\}{yt​,t∈N},我们认为每个观测结果yty_tyt​的生成依赖于一个无法直接观察的隐变量xt∈{xt,t∈N}x_t \\in \\{x_t , t \\in N \\}xt​∈{xt​,t∈N},即:p(...
  • 机房智能化温湿度解决方式之POE供电以太网温湿度传感器

    机房智能化温湿度解决方式之POE供电以太网温湿度传感器
    机房智能化温湿度解决方式之POE供电以太网温湿度传感器 北京盈创力和电子科技有限公司 智能型TCP网口温湿度记录仪 北京IP网络温湿度记录仪厂家,北京盈创力和 北京智能型TCP网口温湿度记录仪IP网络温湿度记录仪是一种新型的基于TCP/IP协议双绞线以太网标准温湿度采集模块,利用它可以实现现场温度值、相对湿度值的采集,同时利用其自身的RJ45通信接口可以方便地和机房监控主机或交换机集线器进行联网。 工作于-40℃~85℃工业级带...
  • Hive 系统函数及示例

    Hive 系统函数及示例
    查看所有系统函数 show functions; 函数分类 内置函数【系统函数】 数学函数: floor、round、ceil、cos、log2等 字符串函数: length、reverse、trim、lower、get_json_object、repeat等 收集函数: size 转换函数: cast 日期函数: year、month、datediff、date、date_add等 条件函数: coalesce、case…w...
  • HTTP状态保持的原理

    HTTP状态保持的原理
    a)在用户登录之后,浏览器返回响应的时候会在响应中添加上cookieb)浏览器接收到cookie之后会自动保存c)当用户再次请求同一服务器中的其他网页的时候,浏览器会自动带上之前保存的cookied)服务接收到请求之后可以请 request 对象中取到cookie 判断当前用户是否登录  Http是无状态的,就是连接时数据互通,关闭后...
  • CSRF的原理和防范措施

    CSRF的原理和防范措施
    a)攻击原理:i.用户C访问正常网站A时进行登录,浏览器保存A的cookieii.用户C再访问攻击网站B,网站B上有某个隐藏的链接或者图片标签会自动请求网站A的URL地址,例如表单提交,传指定的参数iii.而攻击网站B在访问网站A的时候,浏览器会自动带上网站A的cookieiv.所以网站A在接收到请求之后可判断当前用户是登录状态,所以...
标签列表