How to Build a Web Scraper: The Complete 2021 Guide

Hi there!

Web scraping is one of the most useful and exciting skills you can add to your toolkit as an analyst, developer or data scientist.

In this comprehensive 3000+ word guide, I‘ll teach you everything you need to know to become a pro at scraping data from websites.

By the end, you‘ll have all the knowledge needed to build robust scrapers that can extract huge amounts of useful data from the modern web.

Let‘s get started!

Why You Should Learn Web Scraping

First, what exactly is web scraping?

Web scraping is the process of automatically collecting data from websites using bots instead of manually copying.

Some examples of web scraping include:

Extracting product data like prices, descriptions, reviews from ecommerce stores
Compiling lists of email addresses or phone numbers from business directories
Downloading sales transaction records from internal company tools
Archiving social media posts, news articles or blog content for analysis
Gathering data from public government sites and documents

As you can see, almost any public website is fair game for scraping.

According to recent surveys, over 50% of data professionals use web scraping as part of their regular data collection workflow. The demand for scraping skills continues to increase year over year.

Why is it so popular?

Here are some of the main reasons you should add web scraping to your skillset:

Save time by automating repetitive data collection tasks. Scraping can be thousands of times faster than manual copying and pasting.
Access more data than what is feasible via manual browsing or through site exports and APIs.
Gain business insights from previously untapped public data sources like competitor sites, review platforms and forums.
Conduct large scale research by gathering datasets that would be impossible through other means.
Monitor data changes over time by re-scraping sites to check for differences.
Increase productivity – let scrapers do the grunt work while you focus on higher value analysis and modeling.

Simply put, web scraping provides access to a wealth of web data that would otherwise be very difficult or impossible to obtain.

Let‘s look at the key steps involved in building a scraper.

Web Scraping in 4 Simple Steps

The web scraping process can be broken down into 4 main stages:

Step 1: Download the Page

The first step is to download the HTML code of the page you want to scrape.

We use Python‘s Requests module to handle our HTTP requests and receive the page content.

import requests

url = ‘http://example.com‘
response = requests.get(url)
html = response.text

By inspecting the response object, we can check the status code to confirm whether the page loaded properly.

We can also print the html to peek at the raw HTML source code.

Step 2: Parse the HTML

Next, we need to parse this raw HTML into a structure we can work with in Python.

The Beautiful Soup library handles parsing HTML and XML documents.

We can use it to analyze and search the document.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, ‘html.parser‘)

# Extract the page title
print(soup.title.text)

# Find all anchor tags 
anchors = soup.find_all(‘a‘)

Beautiful Soup transforms the HTML into a nested data structure of Python objects representing each tag and content element.

There are many ways we can now query and traverse this structure to extract the data we need.

Step 3: Extract Data

This is the heart of web scraping – using Beautiful Soup to identify and extract the specific data points we want.

We can target elements by CSS class, id, tag name or any other attributes.

# Extract text from paragraphs 
for p in soup.find_all(‘p‘):
  print(p.text)

# Get anchor href links
for a in anchors:
  print(a[‘href‘])

The key is to inspect the source code to understand how the target data is identified in the HTML structure.

With some knowledge of CSS selectors, we can zero in on the required data points and extract them.

Step 4: Store the Data

Now that we‘ve collected the scraped data, we can store it in any number of formats like JSON, CSV, Excel, Google Sheets or a database.

For our example, let‘s output a CSV file:

import csv

headers = [‘Name‘, ‘Description‘, ‘Price‘]
products = [] # List of dicts 

with open(‘products.csv‘, ‘w‘) as f:
  writer = csv.writer(f)
  writer.writerow(headers)
  writer.writerows(products)

And that‘s really all there is to the basics of web scraping!

Now let‘s look at some common real-world applications.

Web Scraping Use Cases and Examples

Web scrapers can extract all kinds of data from the modern web. Let‘s go through some examples of scraping websites for different purposes:

Scraping Ecommerce Product Data

A very common scraper application is collecting product data from ecommerce stores.

The scraped information can include:

Product names, descriptions and images
Pricing and availability
Technical specifications like size, color etc
Customer questions and answers
Ratings and reviews

This data is invaluable for:

Competitor price monitoring and dynamic pricing
Building product catalog and search databases
Market research and analysis
Supply chain and inventory management

For example, here is a simple scraper to extract name, description and pricing from a sample product page:

# Get product name
name = soup.find(‘h1‘, id=‘product-name‘).text 

# Get product description 
description = soup.find(‘div‘, class_=‘product-description‘).text

# Get product price
price = soup.find(‘span‘, class_=‘price‘).text.replace(‘$‘, ‘‘)

Scaling up, you can build scrapers that extract thousands of products across ecommerce sites like Amazon, Ebay, Walmart etc.

According to estimates, over 60% of all web scraping is aimed at gathering ecommerce data.

Scraping Business Directory Information

Another great application of web scraping is building business databases.

You can write scripts to visit sites like Yelp, Yellowpages, Google Maps and compile business listings including:

Name
Address
Phone number
Email
Website
Opening hours
Descriptions and services

This business information can then be used for lead generation, sales prospecting, research or data enrichment.

Scraping steps would involve:

Fetching search result pages across the site
Extracting listing data from each search result
Concatenating all extracted info into your database

With a scripted scraper, you could easily build a database of tens of thousands of businesses quickly.

Scraping Social Media Feeds

Social networks like Twitter, Instagram and Facebook are a goldmine of data for analysts and researchers.

But their APIs have usage limits making it impossible to access all the available data.

This is where scrapers come in. By scraping the content directly from the front-end web pages, you can collect:

All posts including images, videos and captions
Related comments
User info like followers and engagement
Hashtags, mentions and links

Researchers have used scrapers to archive event data and misinformation spreading across social networks.

Marketers also analyze scraped social content for interest trends and influencer identification.

Building social scrapers introduces challenges like handling logins, infinite scrolls, dynamic content and detecting blocks. But the data payoff makes it worth the effort.

Scraping Real Estate Listings

If you‘ve ever searched for apartments or homes online, you know how tedious it is to dig through all the listings.

With a scraper, you can programmatically extract details on thousands of properties across real estate sites like:

Zillow
Realtor
Trulia
Craigslist

Your script would grab data like:

Property type, bedrooms, bathrooms, size
Location, zip code
Price, taxes, amenities
Photos
Agent contact info

This structured real estate data powers applications like:

Building pricing models
Location analysis
Property investment research
Real estate apps and search

In fact, many real estate apps and sites source their underlying data from web scrapers!

There are so many more examples where scrapers provide access to useful data – weather forecasts, sports stats, academic research and more!

Now that you have a sense for why and how web scraping is used, let‘s go deeper into techniques for building robust scrapers.

Advanced Web Scraping Techniques

While the four basic steps are enough to get started, here are some advanced skills that will take your web scraping to the next level.

Handling Javascript Sites

A significant challenge today is that many sites rely heavily on JavaScript to render content.

Since our basic scraper only downloads HTML, any page content created dynamically by JavaScript will be missing.

There are a couple of ways to scrape JavaScript-generated data:

Browser Automation

We can automate an actual browser like Chrome using Selenium. This executes the JavaScript and allows interacting with the fully rendered page.

from selenium import webdriver

browser = webdriver.Chrome()
browser.get(url)

soup = BeautifulSoup(browser.page_source)

The downside is browser automation is slower than basic requests.

Headless Browsers

For faster JavaScript rendering, we can use headless browsers like Puppeteer and Playwright.

These run the Chromium browser in the background without actually opening a visible window.

API Reverse Engineering

Often JavaScript fetches data by calling APIs. We can intercept these calls and scrape the API directly.

This results in the fastest scraping but requires digging into the site‘s client-side code.

Scraping Through API Interfaces

More and more sites are locking down their front-end content and restricting scraping.

The best way around this is directly accessing their backend API.

APIs return data in consistent structured formats like JSON or XML instead of arbitrary HTML layouts.

Scraping APIs avoids dealing with sessions, cookies, blocks and other frontend restrictions.

The challenge is discovering and understanding how to call the APIs correctly. This may require:

Network traffic analysis to identify endpoints
Reversing API authentication schemes
Deducing API parameters and payloads

Platforms like Twitter, Facebook and YouTube provide official APIs with robust documentation.

For others, you‘ll have to sleuth through conference talks, support forums and testing different access patterns.

But once cracked, APIs provide the most efficient scraping with the richest data.

Scraping Across IP Proxies

A common pitfall in scraping is getting your client IP address blocked by target sites.

This can happen after just a few dozen requests if you lack proper throttling.

Using proxies is crucial to distribute requests across multiple IPs and avoid blocks.

Residential proxies that rotate thousands of IP addresses work best for scraping. Datacenter proxies are easily recognized and blocked.

Tools like Scrapy natively integrate with proxies to route requests through them.

With properly managed proxies, you can scrape aggressively at highest speeds without tripping blocks.

Asynchronous Scraping

Standard Python requests are synchronous which means they block execution until the response returns.

This wastes time as the script idles waiting for downloads to complete.

Asynchronous scraping using async libraries like Trio, asyncio and aiohttp speed up the process by handling requests concurrently.

For IO-bound processes like downloading webpages, async scraping delivers huge performance gains through parallelism.

And it‘s more efficient by avoiding idle blocking.

Automated Browser Scrolling

Modern web apps use infinite scroll to lazy load content as the user reaches the bottom.

This means a simple page download won‘t capture all the available data.

Browser automation tools like Selenium allow programmatically scrolling down to dynamically load content.

We can keep scrolling until no new items are returned, extracting data along the way.

Smart waits and sleeps are necessary to avoid freezing the browser or hitting timeouts.

Scraper Monitoring and Debugging

Like any application, scrapers inevitably break over time as sites change their frontend code.

It‘s critical to have monitoring and logging to detect and debug issues early.

Scraping frameworks like Scrapy have built-in support for logging key events.

You should log details like HTTP responses, failures, items scraped and errors.

Integrate monitoring tools like Sentry to aggregate logs and alert on anomalies.

Use tools like browser DevTools to manually test selectors and identify problems.

With logging and monitoring, you can fix scrapers before failures impact business needs.

Now that we‘ve covered some advanced techniques, let‘s review best practices for responsible web scraping.

Scraping Best Practices and Responsibilities

As the amount of data you scrape grows, make sure to keep these good practices in mind:

Check Robots.txt Restrictions

The robots.txt file communicates site owner instructions for scrapers and bots.

Always review this file on a domain before scraping and respect any restrictions.

Limit Scraper Frequency

Don‘t overload sites with an excessive number of concurrent requests.

Use random delays of 1-10+ seconds between requests to distribute load.

Cache Downloaded Pages

Store already accessed pages in a cache to avoid redownloading duplicate content.

Use Random User Agents

Rotate random desktop user agents with each request to appear more human.

Handle CAPTCHAs

Use services like Anti-Captcha to solve CAPTCHAs programmatically. Or manually solve them as a last resort.

Respect Opt-Outs

Stop scraping any site that explicitly communicates they don‘t allow it.

Credit Sources

When publishing scraped data, make sure to credit the sources properly.

Consult Legal Counsel

If ever in doubt on gray areas, seek professional legal advice.

Following these web scraping best practices will keep your scrapers effective while avoiding problems down the road.

Now let‘s examine options for taking your web scraping skills to the next level with managed platforms and services.

Cloud Web Scraping Services

As your web scraping needs grow in scale, you will likely outgrow scraping from your own machine.

Cloud scraping services provide the tools and infrastructure for enterprise-grade data collection.

Here are some benefits of using managed scraping platforms:

No infrastructure to manage – They handle servers, proxies, browsers etc.
Scalable scraping capacity – Easily scale to thousands of concurrent requests.
Reliable uptime – Platforms have high availability with built-in redundancy.
Managed proxies – Rotating proxies to prevent blocks and maximize throughput.
Scraper monitoring – Alerting and analytics to track scraper health.
Data pipelines – Integrations to send scraped data to databases and apps.
Web scraper API – For direct access without any code.
Scraping expertise – Leverage platform best practices and learnings.

Simply put, cloud scraping services enable extracting web data at massive scale while you focus on using the data.

Some popular platforms to check out include ScraperAPI, Octoparse, Scrapy Cloud, Mozenda, ParseHub and Import.io among others.

For truly mission-critical data needs, leveraging a purpose-built scraping platform is highly recommended.

Okay, we‘ve covered a ton of ground on all aspects of web scraping. Let‘s recap the key takeaways:

Scraping Recap and Key Takeaways

Web scraping provides access to vast amounts of web data outside your firewall.
Using Python libraries like Requests and Beautiful Soup makes scraping easy.
Follow the four steps: download, parse, extract, store for any web page.
Scrape intelligently across domains and IP addresses to avoid blocks.
Monitor your scrapers closely with logging and analytics.
Scale up scraping capacity safely with cloud platforms.
Scraping APIs and frameworks like Scrapy provide robust tools.
Always check robots.txt and scrape ethically.

We‘ve packed this guide with everything you need to become a pro at scraping the modern web.

You‘re now ready to start building scrapers to extract all kinds of useful data across websites.

The world of web scraping will only get more complex as sites employ advanced anti-bot measures.

But with diligence and creativity, you can overcome any obstacle to pull the data you need.

Happy scraping!