How to Bypass CAPTCHA in Web Scraping Using Python

If you‘ve done any amount of web scraping, you‘ve likely encountered them – those pesky CAPTCHA challenges that grind your automated data collection to a halt.

CAPTCHAs are everywhere, with over 60% of the top 10,000 sites using them as of 2020. They come in all shapes and sizes – distorted text, clicking images, distorted audio – but all with the same goal of separating scrapers from real human users.

In this comprehensive guide, we‘ll cover proven methods for bypassing CAPTCHAs in your Python web scraping projects. I‘ve been working in web data collection for over 5 years, and had to solve my fair share of captcha puzzles.

Here are the techniques we‘ll cover:

  • Using CAPTCHA solving services
  • Leveraging scrapers with built-in solving
  • Masking scrapers with proxies
  • Simulating humans with browser automation
  • Employing scrapers with avoidance capabilities

We‘ll look at code examples, use cases, and recommendations for each method. I‘ll also share hard-won advice on how to combine multiple solutions for maximum effectiveness.

Let‘s start at the beginning – what are CAPTCHAs and why do they cause so much headaches for well-intentioned scrapers?

What is a CAPTCHA and Why It‘s a Problem

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. The term was coined in 2003 by researchers at Carnegie Mellon University.

The goal of a CAPTCHA is to allow humans to pass through while blocking bots and scrapers. This protects online systems from abuse like brute force password cracking or content scraping.

Some examples of common CAPTCHA types:

  • Text CAPTCHAs – The classic distorted text characters that are tough for computers to recognize.
  • Image CAPTCHAs – Selecting all images that match a category like "cars" or "roads".
  • reCAPTCHA – Google‘s CAPTCHA that may not show a challenge at all by analyzing your behavior.
  • hCaptcha – A popular alternative to reCAPTCHA used by sites like Ticketmaster and GitHub.
  • Audio CAPTCHAs – Spoken letters/numbers with background noise to avoid speech recognition.

example captcha
A typical hard-to-read text CAPTCHA

Sites may show CAPTCHAs during account registration, logins, submitting forms, or when suspicious scraping activity is detected.

This creates a big problem for legitimate scrapers. We just want easy access to public data, while CAPTCHAs constantly interrupt the data collection process.

Manually solving CAPTCHAs at scale is painfully slow. Imagine trying to scrape thousands of pages while having to stop and solve a new challenge every single time.

That‘s why we need ways to get around these pesky (but effective) bot blockers. Let‘s look at the various options for bypassing CAPTCHAs in your Python web scrapers.

Option 1: Using a CAPTCHA Solving Service

One of the most popular methods is to use a specialized CAPTCHA solving service. These services employ real humans to manually solve CAPTCHA challenges around the clock.

Here are some of the most widely used CAPTCHA solving services for web scraping:

Service Accuracy Pricing APIs
Anti-Captcha High $2-3 per 1000 CAPTCHAs APIs for Python, Java, C#, Go, PHP, Ruby
DeathByCaptcha High $1.39 per 1000 solves APIs for Python, PHP, Ruby, Java, C#, Perl
2Captcha Good $2.99 per 1000 solves APIs for PHP, Python, Java, Ruby, C, C#, JS
EndCaptcha Good $2 per 1000 solves APIs for Python, Java, PHP, C#, Ruby, JS

The basic approach is:

  1. Your bot or scraper encounters a CAPTCHA and extracts the challenge image/audio.
  2. The CAPTCHA data gets sent to the API of the solving service.
  3. Human solvers working for the service manually solve the challenge.
  4. The API returns the correct solution for you to input on the form.

This allows you to automatically defeat even the most advanced CAPTCHAs without any manual effort on your end. The service takes care of the heavy lifting.

Here‘s a Python example using the Anti-Captcha API:

import requests

API_KEY = ‘YOUR_API_KEY‘
SITE_KEY = ‘SITE_KEY_FROM_TARGET_SITE‘  
PAGE_URL = ‘https://targetsite.com/page-to-scrape‘

# Get captcha URL from page
page = requests.get(PAGE_URL)  
captcha_url = page.json()[‘captcha_image_url‘]

# Send to Anti-Captcha to be solved
api_request = {
    ‘clientKey‘: API_KEY,
    ‘task‘: {
        ‘type‘: ‘ImageToTextTask‘,
        ‘body‘: captcha_url,
        ‘phrase‘: False,
        ‘case‘: False,
        ‘numeric‘: 0,
        ‘math‘: 0,
        ‘minLength‘: 0,
        ‘maxLength‘: 0
    }
}

solve_response = requests.post(
    ‘https://api.anti-captcha.com/createTask‘,
    json=api_request
)

# Get task ID of new solving job
task_id = solve_response.json()[‘taskId‘] 

# Poll API to check when solved
while True:
    response = requests.post(
        ‘https://api.anti-captcha.com/getTaskResult‘,
        json={‘clientKey‘: API_KEY, ‘taskId‘: task_id}
    )
    if response.json()[‘status‘] == ‘ready‘:
        break

# Submit CAPTCHA solution       
captcha_solution = response.json()[‘solution‘][‘text‘]
requests.post(PAGE_URL, data={‘captcha_response‘: captcha_solution})   

We extract the CAPTCHA image, send it to Anti-Captcha, poll the API until it‘s solved, and submit the response.

Pros:

  • Solves any CAPTCHA automatically without human effort
  • Very accurate solutions
  • Excellent for large scraping projects

Cons:

  • Can get expensive at scale depending on pricing model
  • Many CAPTCHAs now use advanced bot detection making them hard for solvers
  • Need to handle API integration

I‘d recommend CAPTCHA solving services if you anticipate needing to bypass a high volume of challenges. While the costs add up, it really is the most reliable and hands-off method currently available.

Option 2: Using a Scraper With Built-In CAPTCHA Solving

Rather than integrating an external API, some pre-built web scraping tools come with CAPTCHA solving capabilities built-in.

This means you don‘t have to worry about orchestrating the API calls – the scraper handles it behind the scenes.

Here are some popular Python scraping libraries with built-in solvers:

Scrapy

  • Web scraping framework for Python
  • Integrates with 2Captcha out of the box for solving any image or ReCaptcha challenge.
  • Usage:
# Enable built-in captcha solving
IMAGES_STORE = ‘captcha_images‘
CAPTCHA_SOLVER = ‘captcha_solver_name‘

# Rest of scraper code
# When captcha encountered, solver will automatically
# call 2Captcha API and solve

Python Requests-HTML

  • Python library for rendering pages and interacting with HTML
  • Built-in ReCaptcha solving via 2Captcha
  • Usage:
from requests_html import HTMLSession
session = HTMLSession()

# Enable built-in solving 
session.recaptcha_re = True  

# Request page with captcha  
resp = session.get(PAGE_URL)

# Response will contain solved captcha token
print(resp.html.recaptcha_response)

The benefit here is convenience – the libraries handle the CAPTCHA provider integration for you.

But there are some downsides:

  • Limited to just one CAPTCHA solving provider
  • Less control over solving configuration
  • Typically only works for image and ReCaptcha challenges

I‘d recommend trying built-in solvers in your existing pipeline first to see if they fit your needs. If more flexibility or scale is needed, you can always integrate an external API later.

Option 3: Using Proxies to Bypass CAPTCHAs

Proxies are one of the most reliable methods for avoiding CAPTCHAs altogether.

A proxy acts as an intermediary for your requests. It masks the real IP address and location of your scraper. This makes the target site think it‘s receiving organic user traffic rather than bots.

Here are some popular proxy services used for web scraping at scale:

Provider Size Geotargeting Pricing Rotations
Luminati Over 30 million IPs Yes $500+ per month Unlimited
Oxylabs Over 100+ million IPs Yes Pay per GB of data Unlimited
Smartproxy 40 million IPs Yes $75+ per month Unlimited
GeoSurf 23 million IPs Yes $350+ per month Unlimited

These are known as residential proxies – IP addresses from real devices like phones, laptops, etc. around the world.

Because the traffic appears to come from many different organic users, the target site is unlikely to detect bot patterns and serve a CAPTCHA.

Here is an example using the Python Requests module with Residential Proxies from Luminati:

from luminati_proxy import LuminatiProxy
import requests

proxy = LuminatiProxy(‘customer_id‘, ‘zone‘, ‘password‘)  

proxies = {
    ‘http‘: ‘http://‘ + proxy.get_proxy(),
    ‘https‘: ‘https://‘ + proxy.get_proxy() 
}

requests.get(‘https://targetpage.com‘, proxies=proxies)

We configure Luminati proxies and pass them to the Requests module. Now all traffic will be routed through Luminati‘s network masking our scraper.

The key is proxy rotation – dynamically changing IPs with each request. This prevents the site from seeing repeat traffic from the same IP and detecting suspicious patterns.

Pros of Proxies:

  • Avoid CAPTCHAs without needing to solve anything
  • Residential IPs appear as real user traffic
  • Easy to integrate with existing scraper code

Cons:

  • Adds latency – residential proxies are slower than data center IPs
  • Proxy networks can get blocked by sites enacting IP bans
  • Monthly costs can get high for large projects

I‘d recommend proxies as the first line of defense in your CAPTCHA avoidance strategy. While not 100% foolproof, they are highly effective at avoiding challenges while maintaining scraping speed.

Option 4: Browser Automation to Avoid CAPTCHAs

Browser automation tools like Selenium and Puppeteer allow you to programmatically control a real browser like Chrome and Firefox.

The advantage is the target site just sees a regular browser acting as a normal user would. This makes CAPTCHA detection much less likely.

Here we‘ll focus on Selenium to demonstrate how browser automation avoids CAPTCHAs:

from selenium import webdriver

options = webdriver.ChromeOptions() 
driver = webdriver.Chrome(options=options)

# Navigate to page 
driver.get(‘http://targetpage.com‘)

# Interact with elements
driver.find_element_by_id(‘search‘).send_keys(‘Hello World‘)
driver.find_element_by_id(‘submit‘).click()

# CAPTCHAs less likely to appear  
# versus using requests or scrapy...

html = driver.page_source 
driver.quit()

This launches a real Chrome browser and navigates to the site. The scraper can then interact naturally with page elements like a human user.

Key Benefits

  • Real browser better mimics organic traffic
  • Clicking, scrolling, and keyboard input appears natural
  • Enables Javascript rendering

Downsides:

  • Slower page load times than pure HTTP requests
  • Still potential for CAPTCHA if site very sensitive

I‘d recommend starting off with a headless browser before trying proxies or solving services. Browser automation has come a long way in mimicking human behavior.

For best results, be sure to implement organic wait times and interactions.

Option 5: Scrapers With Built-In Avoidance

There are now some scraper tools built specifically with CAPTCHA avoidance capabilities:

ScrapeOps

  • Uses proxy networks, browsers, and AI behavior
  • Also has integrated solving as backup
  • Pricing starts at $199/month

ParseHub

  • Template based web scraper
  • Has crawler engine designed to avoid detection
  • Plans start at $99/month

ProxyCrawl

  • Headless browsers + proxy network
  • Fingerprint randomization to avoid patterns
  • Starts at $49/month

These tools handle CAPTCHA avoidance for you automatically. Benefits are:

  • Avoid spending time building workarounds manually
  • Leverage advanced techniques like AI and fingerprint randomization

Downsides:

  • Monthly subscription fees can add up
  • Limited flexibility compared to custom solutions

I‘d recommend purpose-built scrapers once you start scaling up data collection. The costs outweigh the engineering time needed to build and maintain your own CAPTCHA avoidance infrastructure.

Combining Multiple Methods for Best Results

For best results bypassing CAPTCHAs, I recommend combining multiple techniques like:

  • Proxies to prevent detection
  • Browser automation to simulate natural behavior
  • Solving APIs to handle challenges as needed

For example, you could route traffic through proxy networks and headless Chrome to avoid CAPTCHAs. But as a failsafe, integrate Anti-Captcha API if a challenge still appears.

This defense in depth approach provides layers of redundancy:

captcha bypass layers

Some other tips for combining methods:

  • Use different proxy networks for registration vs. data collection
  • Enable browser automation just for CAPTCHA pages
  • Limit API calls for CAPTCHAs encountered sporadically

Take the time to understand your target sites‘ tolerance for scraping. A site with more basic challenges may be easily fooled with just proxies.

But highly sensitive sites like Facebook or Google will likely require multiple bypass techniques working in concert.

Bypassing CAPTCHAs: Conclusion

CAPTCHAs are one of the biggest roadblocks faced when building web scrapers. But with a variety of techniques available, they can be overcome:

  • Solving APIs provide reliable automation but can get expensive
  • Built-in solvers simplify the integration work
  • Proxies are highly effective for avoidance and prevent detection
  • Browser automation mimics human behavior making CAPTCHAs less likely
  • Scrapers with built-in avoidance handle the heavy lifting automatically

For best results, I recommend combining multiple methods like proxies, browsers, and solving services as needed.

The specific strategies will depend on your targeting sites‘ level of bot detection sensitivity. Easy targets may only require simple proxies or browsers.

High security sites like Google and Cloudflare protected properties can detect advanced bots. These will likely require multiple layers like proxies, browsers, and solver APIs to guarantee CAPTCHA bypassing at scale.

Hopefully this guide has provided both technical and strategic advice for defeating CAPTCHAs in your web scraping projects. I invite you to contact me if you have any other questions! I‘m always happy to chat more about proxy services or my experiences in web data collection.

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.