Asynchronous HTTP Requests With Python & AIOHTTP

Asynchronous programming has become increasingly popular and useful in Python, especially for tasks like web scraping that involve making multiple HTTP requests. In this comprehensive guide, we‘ll cover the fundamentals of asynchronous programming in Python and demonstrate how to make async HTTP requests using the AIOHTTP library.

What is Asynchronous Programming?

Asynchronous programming refers to code that allows multiple tasks to be executed concurrently or in parallel. This is enabled through asynchronous functions that can begin executing but yield control back to the event loop while they wait on long-running operations like API calls or disk I/O.

The key advantage of asynchronous code is that it prevents blocking and allows other work to be done while waiting on network responses or other I/O. This makes asynchronous programs highly performant and scalable for I/O-bound tasks.

Python supports asynchronous programming through the asyncio module and async/await syntax. When a function is declared with async and uses await, it becomes a coroutine that can be executed concurrently by the asyncio event loop.

Why Use Asynchronous Programming for Web Scraping?

Web scraping often involves making large numbers of network requests to crawl through websites and extract data. Using synchronous requests, each request would have to finish completely before the next one could be sent. This wastes time waiting on network I/O.

With asynchronous requests, multiple requests can be in-flight simultaneously. This allows web scrapers to achieve much higher throughput and fully utilize network bandwidth and remote server capacity.

Other benefits of asynchronous web scraping include:

Improved scalability for large crawls
Lower memory usage compared to multithreaded scraping
Simpler to coordinate and manage requests compared to multiprocessing
Avoid blocking the main thread, allowing better responsiveness

AIOHTTP for Asynchronous HTTP Requests

The AIOHTTP library provides a simple API for making HTTP requests asynchronously in Python. It is built on top of the asyncio module and enables writing asynchronous web scrapers easily.

Key features of AIOHTTP include:

ClientSession for managing connection pools and cookie persistence
Familiar request() interface similar to requests library
async versions of all major HTTP methods like GET, POST, PUT etc.
Supports both async and await-based APIs
Timeout and connection limit settings
SSL and proxy support
Streaming response data
Helper methods for common patterns like client retries

In addition to simple requests, AIOHTTP provides more advanced functionality like websockets and connection pooling that make it suitable for building robust, production-ready asynchronous services.

Basic Asynchronous GET Request

Let‘s walk through a simple example of an asynchronous GET request with AIOHTTP:

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, ‘http://example.com‘)
        print(html)

asyncio.run(main())

Breaking this example down:

We create an async method fetch() that takes a session and URL.
Inside, it uses await session.get() to make an async GET request.
The awaitkeyword suspends execution while waiting on the network response.
Meanwhile, the event loop can switch to other tasks until the response returns.
Back in main(), we create a ClientSession and call fetch(), printing the result.

This demonstrates the core async pattern of awaiting IO-bound operations like API calls to prevent blocking.

Batching Multiple Requests

To scrape data from multiple pages, we‘ll need to fire off requests concurrently. This can be done by creating tasks:

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():

    urls = [
        ‘https://example.com/page1‘,
        ‘https://example.com/page2‘,
        ‘https://example.com/page3‘
    ]

    async with aiohttp.ClientSession() as session:

        tasks = []
        for url in urls:
            task = asyncio.create_task(fetch(session, url))
            tasks.append(task)

        htmls = await asyncio.gather(*tasks)

        for html in htmls:
            print(html)

asyncio.run(main())

We create a list of URLs to fetch and loop through them, launching async tasks for each one using asyncio.create_task(fetch()). This schedules the GET requests to run concurrently.

asyncio.gather() awaits all the tasks together and returns their results in a list once they all complete.

This allows us to efficiently scrape data from multiple pages in parallel instead of waiting for each one sequentially.

Handling Timeouts

Sometimes web servers may hang or take too long to respond. We can implement request timeouts to avoid blocking indefinitely:

import asyncio
import aiohttp

TIMEOUT = 60

async def fetch(session, url):
    try:
        async with session.get(url, timeout=TIMEOUT) as response:
            return await response.text()
    except asyncio.TimeoutError:
        return None

# main() omitted for brevity

Here we pass the timeout duration in seconds to the request. If it exceeds this, it will raise asyncio.TimeoutError that we handle by returning None.

Timeouts prevent a bad response from one site from blocking the entire scraping process across multiple sites.

Limiting Concurrency

By default, AIOHTTP has no limit on the number of concurrent requests. This can overload servers or use too many resources.

We can restrict concurrency using a Semaphore:

import asyncio
import aiohttp

MAX_CONCURRENT = 100
sem = asyncio.Semaphore(MAX_CONCURRENT) 

async def fetch(session, url):
    async with sem, session.get(url) as response:
        # Wait for semaphore acquired 
        return await response.text()  

# main() omitted

The semaphore acts as a lock controlling access to a shared resource. We pass it directly to the context manager used by the request along with the session.

Now, no more than 100 requests will execute concurrently. The rest will wait on the semaphore until slots free up as requests complete.

Persisting Sessions

Many websites require logged in sessions to access private data. AIOHTTP helps persist cookies across requests using a single ClientSession:

import asyncio
import aiohttp

LOGIN_DATA = {
    ‘username‘: ‘...‘,
    ‘password‘: ‘...‘    
}

async def login(session, url):
    async with session.post(url, data=LOGIN_DATA) as resp:
        return await resp.text()

async def fetch_profile(session, url):
    async with session.get(url) as resp:
        return await resp.text()

async def main():

    async with aiohttp.ClientSession() as session:
        await login(session, ‘https://example.com/login‘) 
        html = await fetch_profile(session, ‘https://example.com/profile‘)
        print(html)

asyncio.run(main())

Here we log in once at the start and reuse the same session that now contains the session cookies. This allows accessing authorized pages.

Sessions avoid having to log in separately for each request. They persist cookies across requests like a normal web browser.

Parsing JSON Responses

Many modern APIs return JSON data which can be directly parsed:

import asyncio
import aiohttp
import json

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.json()

async def main():
    async with aiohttp.ClientSession() as session:
        data = await fetch(session, ‘https://api.example.com/data‘)
        print(json.dumps(data, indent=4)) 

asyncio.run(main())

We use response.json() to deserialize the JSON response body directly into a Python dict. Much cleaner than parsing the raw text ourselves.

For APIs, this saves having to handle JSON decoding manually.

Handling Errors and Retries

Sometimes requests fail due to network issues or service outages. We can implement retries to improve reliability:

import asyncio
import aiohttp

MAX_RETRIES = 3

async def fetch_with_retries(session, url):
    for retry in range(MAX_RETRIES):
        try:
            async with session.get(url) as response:
                return await response.text()
        except Exception:
            if retry < MAX_RETRIES - 1:
                continue
            raise


async def main():
    # omitted

asyncio.run(main())

If an exception occurs, we retry up to 3 times before propagating the error. This provides basic fault tolerance against intermittent network or server issues.

More advanced retry logic like exponential backoff can be implemented. The key is wrapping requests with retry logic rather than re-raising errors.

Client-Side Rate Limiting

To avoid overloading servers, we can implement client-side rate limiting using asyncio.sleep():

import asyncio
import aiohttp

DELAY = 1 # seconds between requests

async def fetch(session, url):
    async with session.get(url) as response:
        await asyncio.sleep(DELAY)
        return await response.text()

This pauses execution for 1 second between each request, limiting ourselves to 1 request per second.

For more complex policies, we can use a semaphore initialized with max requests per time period.

Client rate limiting prevents servers from banning us for making too many rapid requests.

Batching Requests for Efficiency

Making individual requests for each URL can be inefficient. To minimize latency, we can batch multiple requests together:

import asyncio
import aiohttp


async def bulk_fetch(session, urls):
    tasks = []
    for url in urls:
        task = asyncio.create_task(fetch(session, url))
        tasks.append(task)

    results = await asyncio.gather(*tasks)
    return results

# main() omitted

Rather than await each request, we launch them all as tasks first. The await the batched results all together.

This allows the requests to start in parallel. Batching avoids roundtrips while still retaining asynchronous concurrency.

For 100s of URLs, batched requests dramatically cut down on unnecessary latency between each call.

Streaming Response Data

For large responses, we may want to process the data incrementally rather than buffering everything in memory.

AIOHTTP enables streaming using the content attribute:

async def fetch(session, url):
    async with session.get(url) as response:
        async for chunk in response.content:
            process_data(chunk)

The response content can be asynchronously iterated over to yield chunks of data as they arrive from the server.

This avoids having to load massive responses into memory. The data can be processed and discarded incrementally.

Useful for extracting key info from large files or downloading huge media assets from scraping.

Concurrency vs Parallelism

An important distinction in asynchronous programming is concurrency vs parallelism:

Concurrency means multiple tasks make progress simultaneously by interleaving execution on a single thread. The tasks share CPU resources.
Parallelism is when multiple tasks literally run at the exact same time across multiple CPUs/cores.

Asyncio enables concurrency using a single-threaded event loop. Because it uses cooperative multitasking, computations run concurrently but not in parallel.

To utilize multiple cores, multiprocessing can be used in addition to asyncio. But concurrency is often sufficient and simpler for IO-heavy workloads.

Pros and Cons of Asynchronous Programming

Here are some key benefits and downsides of async programming:

Pros:

Prevent blocking for higher throughput, scalability
Simpler than multithreading – no locks or race conditions
Lightweight coroutines have low overhead vs threads
Non-blocking design enables reactive, real-time systems

Cons:

Asynchronous code can be harder to reason about
Callbacks/chaining can obscure program flow vs sync code
Debugging and testing is more complex
CPU-bound tasks don‘t benefit from non-blocking calls

The benefits typically outweigh the downsides for IO-heavy tasks like web scraping. But async shouldn‘t be used blindly or it can make simple code complex.

Conclusion

In this guide, we covered the fundamentals of asynchronous programming in Python and how it can be applied to make non-blocking HTTP requests using AIOHTTP for efficient web scraping.

Key topics included:

How asyncio and async/await enable concurrency in Python
Why asynchronous requests help speed up web scraping workloads
Performing basic and advanced async operations with AIOHTTP
Patterns like batching, error handling and rate limiting
Differences between concurrency and parallelism

By embracing asynchronous programming, you can make your Python web scrapers significantly faster, more scalable and resilient when dealing with multiple requests.

AIOHTTP provides a clean and idiomatic API for async HTTP backed by the solid foundation of asyncio. It removes much of the complexity traditionally associated with non-blocking I/O.

To learn more about asynchronous programming, check out the official asyncio and AIOHTTP documentation.