How to Bypass CAPTCHA in Web Scraping Using Python

If you‘ve done any amount of web scraping, you‘ve likely encountered them – those pesky CAPTCHA challenges that grind your automated data collection to a halt.

CAPTCHAs are everywhere, with over 60% of the top 10,000 sites using them as of 2020. They come in all shapes and sizes – distorted text, clicking images, distorted audio – but all with the same goal of separating scrapers from real human users.

In this comprehensive guide, we‘ll cover proven methods for bypassing CAPTCHAs in your Python web scraping projects. I‘ve been working in web data collection for over 5 years, and had to solve my fair share of captcha puzzles.

Here are the techniques we‘ll cover:

Using CAPTCHA solving services
Leveraging scrapers with built-in solving
Masking scrapers with proxies
Simulating humans with browser automation
Employing scrapers with avoidance capabilities

We‘ll look at code examples, use cases, and recommendations for each method. I‘ll also share hard-won advice on how to combine multiple solutions for maximum effectiveness.

Let‘s start at the beginning – what are CAPTCHAs and why do they cause so much headaches for well-intentioned scrapers?

What is a CAPTCHA and Why It‘s a Problem

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. The term was coined in 2003 by researchers at Carnegie Mellon University.

The goal of a CAPTCHA is to allow humans to pass through while blocking bots and scrapers. This protects online systems from abuse like brute force password cracking or content scraping.

Some examples of common CAPTCHA types:

Text CAPTCHAs – The classic distorted text characters that are tough for computers to recognize.
Image CAPTCHAs – Selecting all images that match a category like "cars" or "roads".
reCAPTCHA – Google‘s CAPTCHA that may not show a challenge at all by analyzing your behavior.
hCaptcha – A popular alternative to reCAPTCHA used by sites like Ticketmaster and GitHub.
Audio CAPTCHAs – Spoken letters/numbers with background noise to avoid speech recognition.

A typical hard-to-read text CAPTCHA

Sites may show CAPTCHAs during account registration, logins, submitting forms, or when suspicious scraping activity is detected.

This creates a big problem for legitimate scrapers. We just want easy access to public data, while CAPTCHAs constantly interrupt the data collection process.

Manually solving CAPTCHAs at scale is painfully slow. Imagine trying to scrape thousands of pages while having to stop and solve a new challenge every single time.

That‘s why we need ways to get around these pesky (but effective) bot blockers. Let‘s look at the various options for bypassing CAPTCHAs in your Python web scrapers.

Option 1: Using a CAPTCHA Solving Service

One of the most popular methods is to use a specialized CAPTCHA solving service. These services employ real humans to manually solve CAPTCHA challenges around the clock.

Here are some of the most widely used CAPTCHA solving services for web scraping:

Service	Accuracy	Pricing	APIs
Anti-Captcha	High	$2-3 per 1000 CAPTCHAs	APIs for Python, Java, C#, Go, PHP, Ruby
DeathByCaptcha	High	$1.39 per 1000 solves	APIs for Python, PHP, Ruby, Java, C#, Perl
2Captcha	Good	$2.99 per 1000 solves	APIs for PHP, Python, Java, Ruby, C, C#, JS
EndCaptcha	Good	$2 per 1000 solves	APIs for Python, Java, PHP, C#, Ruby, JS

The basic approach is:

Your bot or scraper encounters a CAPTCHA and extracts the challenge image/audio.
The CAPTCHA data gets sent to the API of the solving service.
Human solvers working for the service manually solve the challenge.
The API returns the correct solution for you to input on the form.

This allows you to automatically defeat even the most advanced CAPTCHAs without any manual effort on your end. The service takes care of the heavy lifting.

Here‘s a Python example using the Anti-Captcha API:

import requests

API_KEY = ‘YOUR_API_KEY‘
SITE_KEY = ‘SITE_KEY_FROM_TARGET_SITE‘  
PAGE_URL = ‘https://targetsite.com/page-to-scrape‘

# Get captcha URL from page
page = requests.get(PAGE_URL)  
captcha_url = page.json()[‘captcha_image_url‘]

# Send to Anti-Captcha to be solved
api_request = {
    ‘clientKey‘: API_KEY,
    ‘task‘: {
        ‘type‘: ‘ImageToTextTask‘,
        ‘body‘: captcha_url,
        ‘phrase‘: False,
        ‘case‘: False,
        ‘numeric‘: 0,
        ‘math‘: 0,
        ‘minLength‘: 0,
        ‘maxLength‘: 0
    }
}

solve_response = requests.post(
    ‘https://api.anti-captcha.com/createTask‘,
    json=api_request
)

# Get task ID of new solving job
task_id = solve_response.json()[‘taskId‘] 

# Poll API to check when solved
while True:
    response = requests.post(
        ‘https://api.anti-captcha.com/getTaskResult‘,
        json={‘clientKey‘: API_KEY, ‘taskId‘: task_id}
    )
    if response.json()[‘status‘] == ‘ready‘:
        break

# Submit CAPTCHA solution       
captcha_solution = response.json()[‘solution‘][‘text‘]
requests.post(PAGE_URL, data={‘captcha_response‘: captcha_solution})

We extract the CAPTCHA image, send it to Anti-Captcha, poll the API until it‘s solved, and submit the response.

Pros:

Solves any CAPTCHA automatically without human effort
Very accurate solutions
Excellent for large scraping projects

Cons:

Can get expensive at scale depending on pricing model
Many CAPTCHAs now use advanced bot detection making them hard for solvers
Need to handle API integration

I‘d recommend CAPTCHA solving services if you anticipate needing to bypass a high volume of challenges. While the costs add up, it really is the most reliable and hands-off method currently available.

Option 2: Using a Scraper With Built-In CAPTCHA Solving

Rather than integrating an external API, some pre-built web scraping tools come with CAPTCHA solving capabilities built-in.

This means you don‘t have to worry about orchestrating the API calls – the scraper handles it behind the scenes.

Here are some popular Python scraping libraries with built-in solvers:

Scrapy

Web scraping framework for Python
Integrates with 2Captcha out of the box for solving any image or ReCaptcha challenge.
Usage:

# Enable built-in captcha solving
IMAGES_STORE = ‘captcha_images‘
CAPTCHA_SOLVER = ‘captcha_solver_name‘

# Rest of scraper code
# When captcha encountered, solver will automatically
# call 2Captcha API and solve

Python Requests-HTML

Python library for rendering pages and interacting with HTML
Built-in ReCaptcha solving via 2Captcha
Usage:

from requests_html import HTMLSession
session = HTMLSession()

# Enable built-in solving 
session.recaptcha_re = True  

# Request page with captcha  
resp = session.get(PAGE_URL)

# Response will contain solved captcha token
print(resp.html.recaptcha_response)

The benefit here is convenience – the libraries handle the CAPTCHA provider integration for you.

But there are some downsides:

Limited to just one CAPTCHA solving provider
Less control over solving configuration
Typically only works for image and ReCaptcha challenges

I‘d recommend trying built-in solvers in your existing pipeline first to see if they fit your needs. If more flexibility or scale is needed, you can always integrate an external API later.

Option 3: Using Proxies to Bypass CAPTCHAs

Proxies are one of the most reliable methods for avoiding CAPTCHAs altogether.

A proxy acts as an intermediary for your requests. It masks the real IP address and location of your scraper. This makes the target site think it‘s receiving organic user traffic rather than bots.

Here are some popular proxy services used for web scraping at scale:

Provider	Size	Geotargeting	Pricing	Rotations
Luminati	Over 30 million IPs	Yes	$500+ per month	Unlimited
Oxylabs	Over 100+ million IPs	Yes	Pay per GB of data	Unlimited
Smartproxy	40 million IPs	Yes	$75+ per month	Unlimited
GeoSurf	23 million IPs	Yes	$350+ per month	Unlimited

These are known as residential proxies – IP addresses from real devices like phones, laptops, etc. around the world.

Because the traffic appears to come from many different organic users, the target site is unlikely to detect bot patterns and serve a CAPTCHA.

Here is an example using the Python Requests module with Residential Proxies from Luminati:

from luminati_proxy import LuminatiProxy
import requests

proxy = LuminatiProxy(‘customer_id‘, ‘zone‘, ‘password‘)  

proxies = {
    ‘http‘: ‘http://‘ + proxy.get_proxy(),
    ‘https‘: ‘https://‘ + proxy.get_proxy() 
}

requests.get(‘https://targetpage.com‘, proxies=proxies)

We configure Luminati proxies and pass them to the Requests module. Now all traffic will be routed through Luminati‘s network masking our scraper.

The key is proxy rotation – dynamically changing IPs with each request. This prevents the site from seeing repeat traffic from the same IP and detecting suspicious patterns.

Pros of Proxies:

Avoid CAPTCHAs without needing to solve anything
Residential IPs appear as real user traffic
Easy to integrate with existing scraper code

Cons:

Adds latency – residential proxies are slower than data center IPs
Proxy networks can get blocked by sites enacting IP bans
Monthly costs can get high for large projects

I‘d recommend proxies as the first line of defense in your CAPTCHA avoidance strategy. While not 100% foolproof, they are highly effective at avoiding challenges while maintaining scraping speed.

Option 4: Browser Automation to Avoid CAPTCHAs

Browser automation tools like Selenium and Puppeteer allow you to programmatically control a real browser like Chrome and Firefox.

The advantage is the target site just sees a regular browser acting as a normal user would. This makes CAPTCHA detection much less likely.

Here we‘ll focus on Selenium to demonstrate how browser automation avoids CAPTCHAs:

from selenium import webdriver

options = webdriver.ChromeOptions() 
driver = webdriver.Chrome(options=options)

# Navigate to page 
driver.get(‘http://targetpage.com‘)

# Interact with elements
driver.find_element_by_id(‘search‘).send_keys(‘Hello World‘)
driver.find_element_by_id(‘submit‘).click()

# CAPTCHAs less likely to appear  
# versus using requests or scrapy...

html = driver.page_source 
driver.quit()

This launches a real Chrome browser and navigates to the site. The scraper can then interact naturally with page elements like a human user.

Key Benefits

Real browser better mimics organic traffic
Clicking, scrolling, and keyboard input appears natural
Enables Javascript rendering

Downsides:

Slower page load times than pure HTTP requests
Still potential for CAPTCHA if site very sensitive

I‘d recommend starting off with a headless browser before trying proxies or solving services. Browser automation has come a long way in mimicking human behavior.

For best results, be sure to implement organic wait times and interactions.

Option 5: Scrapers With Built-In Avoidance

There are now some scraper tools built specifically with CAPTCHA avoidance capabilities:

ScrapeOps

Uses proxy networks, browsers, and AI behavior
Also has integrated solving as backup
Pricing starts at $199/month

ParseHub

Template based web scraper
Has crawler engine designed to avoid detection
Plans start at $99/month

ProxyCrawl

Headless browsers + proxy network
Fingerprint randomization to avoid patterns
Starts at $49/month

These tools handle CAPTCHA avoidance for you automatically. Benefits are:

Avoid spending time building workarounds manually
Leverage advanced techniques like AI and fingerprint randomization

Downsides:

Monthly subscription fees can add up
Limited flexibility compared to custom solutions

I‘d recommend purpose-built scrapers once you start scaling up data collection. The costs outweigh the engineering time needed to build and maintain your own CAPTCHA avoidance infrastructure.

Combining Multiple Methods for Best Results

For best results bypassing CAPTCHAs, I recommend combining multiple techniques like:

Proxies to prevent detection
Browser automation to simulate natural behavior
Solving APIs to handle challenges as needed

For example, you could route traffic through proxy networks and headless Chrome to avoid CAPTCHAs. But as a failsafe, integrate Anti-Captcha API if a challenge still appears.

This defense in depth approach provides layers of redundancy:

Some other tips for combining methods:

Use different proxy networks for registration vs. data collection
Enable browser automation just for CAPTCHA pages
Limit API calls for CAPTCHAs encountered sporadically

Take the time to understand your target sites‘ tolerance for scraping. A site with more basic challenges may be easily fooled with just proxies.

But highly sensitive sites like Facebook or Google will likely require multiple bypass techniques working in concert.

Bypassing CAPTCHAs: Conclusion

CAPTCHAs are one of the biggest roadblocks faced when building web scrapers. But with a variety of techniques available, they can be overcome:

Solving APIs provide reliable automation but can get expensive
Built-in solvers simplify the integration work
Proxies are highly effective for avoidance and prevent detection
Browser automation mimics human behavior making CAPTCHAs less likely
Scrapers with built-in avoidance handle the heavy lifting automatically

For best results, I recommend combining multiple methods like proxies, browsers, and solving services as needed.

The specific strategies will depend on your targeting sites‘ level of bot detection sensitivity. Easy targets may only require simple proxies or browsers.

High security sites like Google and Cloudflare protected properties can detect advanced bots. These will likely require multiple layers like proxies, browsers, and solver APIs to guarantee CAPTCHA bypassing at scale.

Hopefully this guide has provided both technical and strategic advice for defeating CAPTCHAs in your web scraping projects. I invite you to contact me if you have any other questions! I‘m always happy to chat more about proxy services or my experiences in web data collection.