Scrapy Playwright Tutorial: How to Scrape JavaScript Websites

Scrapy is one of the most popular Python packages for web scraping, known for its speed, extensibility, and robustness. However, Scrapy was primarily designed for scraping static websites and struggles with dynamic pages loaded via JavaScript.

To bridge this gap, the Scrapy team created Scrapy Playwright – an integration that combines the power of Scrapy with Playwright, a high-level browser automation library. With Playwright under the hood, Scrapy can now render full web pages, interact with JavaScript, fill out forms, click buttons – everything required to scrape modern websites.

In this comprehensive tutorial, we‘ll walk through the process of setting up a Scrapy Playwright spider from scratch and use it to extract data from a demo e-commerce website with heavy JavaScript usage.

Why Scrapy Playwright?

Before we dive into the code, let‘s go over the benefits of using Scrapy Playwright for web scraping:

Headless browser automation – Playwright provides aHeadless Chromium, Firefox and WebKit browsers out-of-the-box. No more configuring Selenium and drivers.
Built-in device emulation – Easily mimic mobile devices with a single line of code. Playwright also supports native app automation.
Reliable data extraction – Playwright can pull clean data directly from the browser by evaluating JavaScript on the page. No more regex-parsing messy HTML.
Robust Puppeteer API – The Playwright API mirrors the Puppeteer API closely, making it easy to transition for those experienced with Puppeteer.
Lightning fast performance – Playwright runs on Mozilla‘s Rust backend built for parallel processing. Benchmark tests clock Playwright as one of the fastest browser automation libraries available today.
Smooth debugging – Playwright offers interactive browser UIs, video recordings, comprehensive logs and exception screenshots to resolve issues quickly.
Cross-platform support – Playwright enables cross-browser testing and works on Windows, Mac, Linux and even Docker.

By combining Scrapy‘s existing capabilities like handling asynchronous requests and parsing HTML with Playwright‘s modern browser automation features, Scrapy Playwright unlocks the ability to scrape even the most complex JavaScript-heavy websites with ease.

Prerequisites

Before starting, make sure you have the following installed:

Python 3.6+
Scrapy 2.6+
Playwright Python package
A code editor like Visual Studio Code

You can install Scrapy and Playwright via pip:

pip install scrapy scrapy-playwright

Let‘s now set up our scraping project!

Setting up a new Scrapy Playwright Project

We‘ll use the scrapy CLI to generate a new project:

scrapy startproject bookstore

This creates a bookstore folder with the following contents:

bookstore/
    ├── bookstore/
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── settings.py
    │   └── spiders/
    │       ├── __init__.py
    └── scrapy.cfg

The key files we‘ll focus on are:

settings.py – for configuring our scraper
items.py – for defining scraped data structures
spiders/ – where our spider code will reside

Defining Items

Items represent the scraped data we want to extract. Let‘s create an Book item with two fields – name and price:

# bookstore/items.py

import scrapy


class Book(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

Easy enough! The Field() objects will automatically serialize our scraped data to JSON.

Creating the Spider

The spider contains the logic to scrape the website. Let‘s generate a starter one:

cd bookstore
scrapy genspider booksScraper books.toscrape.com

This creates a booksScraper.py file under /spiders with a barebones spider class:

# bookstore/spiders/booksScraper.py

import scrapy

class BooksScraperSpider(scrapy.Spider):
    name = ‘booksScraper‘
    allowed_domains = [‘books.toscrape.com‘]
    start_urls = [‘http://books.toscrape.com/‘]

    def parse(self, response):
        pass

We‘ll fill out the parse method shortly to scrape the data. But first, we need to integrate Playwright.

Integrating Playwright with Scrapy

To use Playwright with Scrapy, we have to make a few tweaks to settings.py:

# bookstore/settings.py

BOT_NAME = ‘bookstore‘

SPIDER_MODULES = [‘bookstore.spiders‘]
NEWSPIDER_MODULE = ‘bookstore.spiders‘


# Scrapy settings for bookstore project
...

DOWNLOAD_HANDLERS = {
  ‘http‘:‘scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler‘,
  ‘https‘:‘scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler‘,
}

PLAYWRIGHT_LAUNCH_OPTIONS = {
  ‘headless‘: True
}

PLAYWRIGHT_BROWSER_TYPE = ‘chromium‘

This instructs Scrapy to use Playwright for downloading pages. We launch Playwright in headless mode for faster performance.

With the integrations set up, let‘s start scraping!

Scraping Page Data

Here‘s the code to scrape book names and prices from the home page:

import scrapy
from ..items import Book

class BooksScraperSpider(scrapy.Spider):
  name = ‘booksScraper‘

  def start_requests(self):
    yield scrapy.Request(
      url=‘http://books.toscrape.com‘, 
      meta={
        ‘playwright‘: True,
        ‘playwright_include_page‘: True
      }
    )

  async def parse(self, response):
    page = response.meta[‘playwright_page‘]

    books = await page.query_selector_all(‘.product_pod‘)

    for book in books:
      name = await book.query_selector(‘.product_pod h3 a‘)
      name = await name.inner_text()

      price = await book.query_selector(‘.product_price .price_color‘)
      price = await price.inner_text()

      yield Book(name=name, price=price)

    await page.close()

Let‘s break down what‘s happening:

We make a request to the start URL and enable Playwright with the meta parameters.
Inside parse(), we grab the Playwright Page instance from response.meta.
We use CSS selectors and Playwright‘s query_selector_all method to find all books.
For each book, we extract the name and price values using Playwright‘s DOM extraction methods like inner_text().
We populate the Book item with the scraped data.
The book is yielded and sent to Scrapy‘s item pipeline.
We close the page after scraping is done.

And that‘s it! Let‘s run the spider:

scrapy crawl booksScraper -o books.json

This scrapes the page and stores the output into books.json. We should see a file with all the scraped books!

Scraping Across Pages

While we can scrape one page, most sites have data split across multiple pages. To scrape them all, we have to:

Check if a next page button exists
If so, extract the next page URL
Yield a request to recursively scrape the next page

Here is how to add pagination support:

async def parse(self, response):
  ...

  # Scraping logic

  next_page = response.css(‘.next a::attr(href)‘).get()

  if next_page:
    yield scrapy.Request(
      response.urljoin(next_page),
      meta={‘playwright‘: True}  
    )

We find the next page button, extract its URL, and yield a new request to scrape it. This continues until no more next pages remain.

Filling Out Forms

A key advantage of Playwright is interacting with page elements like filling out forms. Let‘s try it out by scraping book data after applying filters.

First, we‘ll click the different category links on the side to filter the books. To do this cleanly, we can define a PageMethod list:

class BooksScraperSpider(scrapy.Spider):

  def start_requests(self):
    yield scrapy.Request(
      url=‘http://books.toscrape.com‘,
      meta={
        ‘playwright‘: True,
        ‘playwright_include_page‘: True,
        ‘playwright_page_methods‘: [
          PageMethod(‘click‘, ‘.side_categories li a‘),
          PageMethod(‘wait_for_selector‘, ‘.product_pod‘),
        ]  
      }
    )

The methods instruct Playwright to:

Click all category links one by one
Wait for the products to load

Next, let‘s fill out the search form. We can access the Page in parse:

async def parse(self, response):
  page = response.meta[‘playwright_page‘]

  await page.fill(‘#id_q‘, ‘History‘)
  await page.click(‘#submit‘)

  # Rest of parsing logic

This fills the search input and submits the form. Playwright will wait for the results page to load before continuing.

And that‘s it! With a few extra lines, we can now scrape data after interacting with site elements.

Additional Tips

Here are some additional tips when using Scrapy Playwright:

Set up a proxy – To prevent IP blocks, configure a proxy in settings.py:

PROXY_POOL_ENABLED = True

PROXY_POOL_FEED_ENABLED = True

Lower timeout – Decrease DOWNLOAD_TIMEOUT if pages take time to load:

DOWNLOAD_TIMEOUT = 10

Retry on failure – Set RETRY_TIMES to re-request pages with errors:

RETRY_TIMES = 10

Simulate mobile – Set playwright_context_kwargs to emulate a mobile browser:

‘meta‘: {
  ‘playwright_context_kwargs‘: {
    ‘viewport‘: {
      ‘width‘: 360,
      ‘height‘: 640
    }
  }
}

Debug interactively – Set playwright_debug to pause on errors:

‘meta‘: {
  ‘playwright_debug‘: True
}

Conclusion

And we‘re done! We walked through a complete Scrapy Playwright scraping workflow – from installation, to setup,request handling and form filling using real browser automation.

Some key takeaways:

Scrapy Playwright lets you scrape dynamic JavaScript sites with the power of Playwright and Scrapy
APIs like page.query_selector() make extracting data from rendered pages easy
Methods like page.click() enable interacting with buttons, forms and more
Pagination and crawling multiple pages is seamless
Options like proxies and device emulation boost efficiency
Interactive debugging provides control over the scraping process

Scrapy Playwright is a game-changer for anyone struggling to scrape modern web applications laden with JavaScript. I hope you found this hands-on tutorial useful! Let me know in the comments if you have any other topics you would like covered.

Happy scraping!