上篇,我们介绍了 Python 基础 HTTP 库 urllib 的基本使用,在使用上还是比较麻烦的,本篇,我们来看一下第三方 HTTP 库 Requests 是如何简化我们的操作的。
Requests
Requests 是基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库,他比 urllib 更加方便,节约我们大量的工作。
安装
请求
基本 GET 请求
import requests response = requests.get('http://httpbin.org/get') print(response.status_code) print(response.text) print(type(response.text))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  | 
 
带有参数 GET 请求
import requests response = requests.get('http://httpbin.org/get?foo=bar') print(response.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  params = {'foo': 'bar'} response = requests.get('http://httpbin.org/get', params=params) print(response.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  | 
 
json 解析
import requests params = {'foo': 'bar'} response = requests.get('http://httpbin.org/get', params=params) print(response.json()) print(type(response.json()))
 
 
 
 
  | 
 
二进制数据
import requests response = requests.get('http://github.com/favicon.ico') print(response.content) print(type(response.content))
 
 
 
  with open('favicon.ico', 'wb') as f:     f.write(response.content)
 
  | 
 
import requests
  headers = {     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',     'Cookie': 'thw=cn; v=0; t=f49d7a120cef747be295966efd96e846; cookie2=53b3047ac2325fbeb90d604ab16d4e58;...' } response = requests.get('http://zhihu.com/explore', headers=headers) print(response.text)
 
  | 
 
带有参数 POST 请求
import requests
 
  response = requests.post('http://httpbin.org/post', data={'foo': 'bar'}) print(response.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  response = requests.post('http://httpbin.org/post', data='foo') print(response.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  response = requests.post('http://httpbin.org/post', json={'foo': 'bar'}) print(response.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  | 
 
响应
import requests
  response = requests.get('http://www.baidu.com') print(type(response))
 
  print(response.status_code)
 
  print(response.text)
 
  print(response.content)
 
  print(response.cookies)
 
  print(response.headers)
 
  print(response.encoding)
 
 
  print(response.apparent_encoding)
 
 
 
  | 
 
状态码
import requests
  response = requests.get('http://www.baidu.com')
  if not response.status_code == requests.codes.ok:     pass else:     print('Successfuly')
  try:     response = requests.get('http://httpbin.org/post', timeout=10)     response.raise_for_status()      response.encoding = reqponse.apparent_encoding     print(response.text) except:     print('出现异常')
 
  | 
 
高级操作
文件上传
import requests
  url = 'http://httpbin.org/post' files = {'icon_file': open('github.ico', 'rb')} response = requests.post(upload_url, files=files)
  print(response.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  | 
 
会话维持
import requests
  requests.get('http://httpbin.org/cookies/set/foo/bar') response = requests.get('http://httpbin.org/cookies')
  print(response.text)
 
 
 
 
  session = requests.session() session.get('http://httpbin.org/cookies/set/foo/bar') response = session.get('http://httpbin.org/cookies')
  print(response.text)
 
 
 
 
 
 
 
  | 
 
证书验证
import requests
  resresponsep = requests.get('https://www.12306.cn', verify=False) print(response.status_code)
 
  | 
 
代理设置
import requests
  proxies = {     'http': 'http://112.85.173.34:9999'      }
  response = requests.get('http://httpbin.org/get', proxies=proxies) print(response.text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  | 
 
超时设置
import requests from requests.exceptions import ReadTimeout
  try:     response = requests.get('http://httpbin.org/get', timeout=0.1)     print(response.status_code) except ReadTimeout:     print('TIME OUT')
 
  | 
 
认证设置
import requests from requests.auth import HTTPBasicAuth
  response = requests.get('http://auth_demo.com', auth=HTTPBasicAuth('user', 123456))
 
 
  | 
 
异常处理
requests 库总共有 6 种异常: 
- requests.ConnectionError: 网络连接错误异常,如 DNS 查询失败,拒绝连接等
 
- requests.HTTPError: HTTP 错误异常
 
- requests.URLRequired: URL 缺失异常
 
- requests.TooManyRedirects: 超过最大重定向次数
 
- requests.ConnectTimeout: 连接远程服务器超时异常
 
- requests.Timeout: 请求 URL 超时异常
 
import requests from requests.exceptions import ReadTimeout, HTTPError, RequestException
  try:     response = requests.get('http://httpbin.org/get', timeout=0.1)     print(response.status_code) except ReadTimeout:     print('TIME OUT') except HTTPError:     print('HTTP Error') except RequestException:     print('Error')
 
  | 
 
Robots 协议
Robots 协议告知所有爬虫网站的爬取策略,要求爬虫遵守。
Robots 协议放置在网站根目录下的 robots.txt 中,如: www.zhihu.com/robots.txt,告知网站的爬取规则。
User-agent: Googlebot Disallow: /login Disallow: /logout Disallow: /resetpassword Disallow: /terms Disallow: /search Disallow: /notifications Disallow: /settings Disallow: /inbox Disallow: /admin_inbox Disallow: /*?guide*
  ...
  User-Agent: * Disallow: /
 
  | 
 
特别注意: 不遵守 Robots 协议可能存在法律风险。