第102天： Python异步之aiohttp - 纯洁的微笑博客

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

稳重的豆腐 · 习近平会见新加坡总统哈莉玛· 3 周前 ·

腹黑的弓箭 · 常州国家高新区管委会（新北区人民政府）· 3 周前 ·

好帅的柠檬 · 图形用户界面（GUI）开发教程 | 雷烈· 1 月前 ·

淡定的排球 · 银行数字化转型，敏捷组织和文化至关重要 - ...· 2 月前 ·

博学的甘蔗 · LocalDateTime常用方法总结，总有 ...· 1 年前 ·

async def fetch ( client ): async with client . get ( 'http://httpbin.org/get' ) as resp : assert resp . status == 200 return await resp . text () async def main (): async with aiohttp . ClientSession () as client : html = await fetch ( client ) print ( html ) loop = asyncio . get_event_loop () tasks = [] for i in range ( 30 ): task = loop . create_task ( main ()) tasks . append ( task ) start = datetime . now () loop . run_until_complete ( main ()) end = datetime . now () print ( "aiohttp版爬虫花费时间为：" ) print ( end - start )

# 打印网站返回的内容
aiohttp版爬虫花费时间为：
0:00:00.539416
从爬取时间可以看出，aiohttp 异步爬取网站只用了0.5秒左右的时间，比 requests 同步方式快了80倍左右，速度非常之快。




    

同一个 session
aiohttp.ClientSession() 中封装了一个 session 的连接池，并且在默认情况下支持 keepalives，官方建议在程序中使用单个 ClientSession 对象，而不是像上面示例中的那样每次连接都创建一个 ClientSession 对象，除非在程序中遇到大量的不同的服务。
将上面的示例修改为：
import aiohttp
import asyncio
from datetime import datetime
async def fetch(client):
    print("打印 ClientSession 对象")
    print(client)
    async with client.get('http://httpbin.org/get') as resp:
        assert resp.status == 200
        return await resp.text()
async def main():
    async with aiohttp.ClientSession() as client:
       tasks = []
       for i in range(30):
           tasks.append(asyncio.create_task(fetch(client)))
       await asyncio.wait(tasks)
loop = asyncio.get_event_loop()
start = datetime.now()
loop.run_until_complete(main())
end = datetime.now()
print("aiohttp版爬虫花费时间为：")
print(end - start)
# 重复30遍
打印 ClientSession 对象
<aiohttp.client.ClientSession object at 0x1094aff98>
aiohttp版爬虫花费时间为：
0:00:01.778045
从上面爬取的时间可以看出单个 ClientSession 对象比多个 ClientSession 对象多花了3倍时间。ClientSession 对象一直是同一个 0x1094aff98。
Json 串
在上面的示例中使用 response.text() 函数返回爬取到的内容，aiohttp 在处理 Json 返回值的时候，可以直接将字符串转换为 Json。
async def fetch(client):
    async with client.get('http://httpbin.org/get') as resp:
        return await resp.json()
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'Python/3.7 aiohttp/3.6.2'}, 'origin': '49.80.42.33, 49.80.42.33', 'url': 'https://httpbin.org/get'}
当返回的 Json 串不是一个标准的 Json 时，resp.json() 函数可以传递一个函数对json进行预处理，如：resp.json(replace(a, b))，replace()函数表示 a 替换为 b。
aiohttp 使用 response.read() 函数处理字节流，使用 with open() 方式保存文件或者图片
async def fetch(client):
    async with client.get('http://httpbin.org/image/png') as resp:
        return await resp.read()
async def main():
    async with aiohttp.ClientSession() as client:
        image = await fetch(client)
        with open("/Users/xxx/Desktop/image.png", 'wb') as f:
            f.write(image)
response.read() 函数可以传递数字参数用于读取多少个字节，如：response.read(3)读取前3个字节。
aiohttp 可以使用3种方式在 URL 地址中传递参数
async def fetch(client):
    params = [('a', 1), ('b', 2)]
    async with client.get('http://httpbin.org/get',params=params) as resp:
        return await resp.text()
示例URL地址
http://httpbin.org/get?a=1&b=2
async def fetch(client):
    params = {"a": 1,"b": 2}
    async with client.get('http://httpbin.org/get',params=params) as resp:
        return await resp.text()
示例URL地址
http://httpbin.org/get?a=1&b=2
async def fetch(client):
    async with client.get('http://httpbin.org/get',params='q=aiohttp+python&a=1') as resp:
        return await resp.text()
示例URL地址
http://httpbin.org/get?q=aiohttp+python&a=1
aiohttp 在自定义请求头时，类似于向 URL 传递参数的方式
async def fetch(client):
    headers = {'content-type': 'application/json', 'User-Agent': 'Python/3.7 aiohttp/3.7.2'}
    async with client.get('http://httpbin.org/get',headers=headers) as resp:
        return await resp.text()
COOKIES
cookies 是整个会话共用的，所以应该在初始化 ClientSession 对象时传递
async def fetch(client):
    async with client.get('http://httpbin.org/get') as resp:
        return await resp.text()
async def main():
    cookies = {'cookies': 'this is cookies'}
    async with aiohttp.ClientSession(cookies=cookies) as client:
        html = await fetch(client)
        print(html)
POST 方式
在前面的示例中都是以 GET 方式提交请求，下面用 POST 方式请求
async def fetch(client):
    data = {'a': '1', 'b': '2'}
    async with client.post('http://httpbin.org/post', data = data) as resp:
        return await resp.text()
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "a": "1", 
    "b": "2"
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "7", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python/3.7 aiohttp/3.6.2"
  "json": null, 
  "origin": "49.80.42.33, 49.80.42.33", 
  "url": "https://httpbin.org/post"
aiohttp版爬虫花费时间为：
0:00:00.514402
在示例结果中可以看到 form 中的内容就是模拟 POST 方式提交的内容
在请求网站时，有时会遇到超时问题，aiohttp 中使用 timeout 参数设置，单位为秒数，aiohttp 默认超时时间为5分钟
async def fetch(client):
    async with client.get('http://httpbin.org/get', timeout=60) as resp:
        return await resp.text()
aiohttp 以异步的方式爬取网站耗时远小于 requests 同步方式，这里列举了一些 aiohttp 常用功能，希望对大家有所帮助。
  示例代码：Python-100-day