Scrapy is one of the most popular Python packages for web scraping, known for its speed, extensibility, and robustness. However, Scrapy was primarily designed for scraping static websites and struggles with dynamic pages loaded via JavaScript.
To bridge this gap, the Scrapy team created Scrapy Playwright – an integration that combines the power of Scrapy with Playwright, a high-level browser automation library. With Playwright under the hood, Scrapy can now render full web pages, interact with JavaScript, fill out forms, click buttons – everything required to scrape modern websites.
In this comprehensive tutorial, we‘ll walk through the process of setting up a Scrapy Playwright spider from scratch and use it to extract data from a demo e-commerce website with heavy JavaScript usage.
Why Scrapy Playwright?
Before we dive into the code, let‘s go over the benefits of using Scrapy Playwright for web scraping:
-
Headless browser automation – Playwright provides aHeadless Chromium, Firefox and WebKit browsers out-of-the-box. No more configuring Selenium and drivers.
-
Built-in device emulation – Easily mimic mobile devices with a single line of code. Playwright also supports native app automation.
-
Reliable data extraction – Playwright can pull clean data directly from the browser by evaluating JavaScript on the page. No more regex-parsing messy HTML.
-
Robust Puppeteer API – The Playwright API mirrors the Puppeteer API closely, making it easy to transition for those experienced with Puppeteer.
-
Lightning fast performance – Playwright runs on Mozilla‘s Rust backend built for parallel processing. Benchmark tests clock Playwright as one of the fastest browser automation libraries available today.
-
Smooth debugging – Playwright offers interactive browser UIs, video recordings, comprehensive logs and exception screenshots to resolve issues quickly.
-
Cross-platform support – Playwright enables cross-browser testing and works on Windows, Mac, Linux and even Docker.
By combining Scrapy‘s existing capabilities like handling asynchronous requests and parsing HTML with Playwright‘s modern browser automation features, Scrapy Playwright unlocks the ability to scrape even the most complex JavaScript-heavy websites with ease.
Prerequisites
Before starting, make sure you have the following installed:
- Python 3.6+
- Scrapy 2.6+
- Playwright Python package
- A code editor like Visual Studio Code
You can install Scrapy and Playwright via pip:
pip install scrapy scrapy-playwright
Let‘s now set up our scraping project!
Setting up a new Scrapy Playwright Project
We‘ll use the scrapy
CLI to generate a new project:
scrapy startproject bookstore
This creates a bookstore
folder with the following contents:
bookstore/
├── bookstore/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders/
│ ├── __init__.py
└── scrapy.cfg
The key files we‘ll focus on are:
settings.py
– for configuring our scraperitems.py
– for defining scraped data structuresspiders/
– where our spider code will reside
Defining Items
Items represent the scraped data we want to extract. Let‘s create an Book
item with two fields – name
and price
:
# bookstore/items.py
import scrapy
class Book(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
Easy enough! The Field()
objects will automatically serialize our scraped data to JSON.
Creating the Spider
The spider contains the logic to scrape the website. Let‘s generate a starter one:
cd bookstore
scrapy genspider booksScraper books.toscrape.com
This creates a booksScraper.py
file under /spiders
with a barebones spider class:
# bookstore/spiders/booksScraper.py
import scrapy
class BooksScraperSpider(scrapy.Spider):
name = ‘booksScraper‘
allowed_domains = [‘books.toscrape.com‘]
start_urls = [‘http://books.toscrape.com/‘]
def parse(self, response):
pass
We‘ll fill out the parse
method shortly to scrape the data. But first, we need to integrate Playwright.
Integrating Playwright with Scrapy
To use Playwright with Scrapy, we have to make a few tweaks to settings.py
:
# bookstore/settings.py
BOT_NAME = ‘bookstore‘
SPIDER_MODULES = [‘bookstore.spiders‘]
NEWSPIDER_MODULE = ‘bookstore.spiders‘
# Scrapy settings for bookstore project
...
DOWNLOAD_HANDLERS = {
‘http‘:‘scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler‘,
‘https‘:‘scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler‘,
}
PLAYWRIGHT_LAUNCH_OPTIONS = {
‘headless‘: True
}
PLAYWRIGHT_BROWSER_TYPE = ‘chromium‘
This instructs Scrapy to use Playwright for downloading pages. We launch Playwright in headless mode for faster performance.
With the integrations set up, let‘s start scraping!
Scraping Page Data
Here‘s the code to scrape book names and prices from the home page:
import scrapy
from ..items import Book
class BooksScraperSpider(scrapy.Spider):
name = ‘booksScraper‘
def start_requests(self):
yield scrapy.Request(
url=‘http://books.toscrape.com‘,
meta={
‘playwright‘: True,
‘playwright_include_page‘: True
}
)
async def parse(self, response):
page = response.meta[‘playwright_page‘]
books = await page.query_selector_all(‘.product_pod‘)
for book in books:
name = await book.query_selector(‘.product_pod h3 a‘)
name = await name.inner_text()
price = await book.query_selector(‘.product_price .price_color‘)
price = await price.inner_text()
yield Book(name=name, price=price)
await page.close()
Let‘s break down what‘s happening:
-
We make a request to the start URL and enable Playwright with the
meta
parameters. -
Inside
parse()
, we grab the PlaywrightPage
instance fromresponse.meta
. -
We use CSS selectors and Playwright‘s
query_selector_all
method to find all books. -
For each book, we extract the name and price values using Playwright‘s DOM extraction methods like
inner_text()
. -
We populate the
Book
item with the scraped data. -
The book is yielded and sent to Scrapy‘s item pipeline.
-
We close the page after scraping is done.
And that‘s it! Let‘s run the spider:
scrapy crawl booksScraper -o books.json
This scrapes the page and stores the output into books.json
. We should see a file with all the scraped books!
Scraping Across Pages
While we can scrape one page, most sites have data split across multiple pages. To scrape them all, we have to:
- Check if a next page button exists
- If so, extract the next page URL
- Yield a request to recursively scrape the next page
Here is how to add pagination support:
async def parse(self, response):
...
# Scraping logic
next_page = response.css(‘.next a::attr(href)‘).get()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
meta={‘playwright‘: True}
)
We find the next page button, extract its URL, and yield a new request to scrape it. This continues until no more next pages remain.
Filling Out Forms
A key advantage of Playwright is interacting with page elements like filling out forms. Let‘s try it out by scraping book data after applying filters.
First, we‘ll click the different category links on the side to filter the books. To do this cleanly, we can define a PageMethod
list:
class BooksScraperSpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(
url=‘http://books.toscrape.com‘,
meta={
‘playwright‘: True,
‘playwright_include_page‘: True,
‘playwright_page_methods‘: [
PageMethod(‘click‘, ‘.side_categories li a‘),
PageMethod(‘wait_for_selector‘, ‘.product_pod‘),
]
}
)
The methods instruct Playwright to:
- Click all category links one by one
- Wait for the products to load
Next, let‘s fill out the search form. We can access the Page
in parse
:
async def parse(self, response):
page = response.meta[‘playwright_page‘]
await page.fill(‘#id_q‘, ‘History‘)
await page.click(‘#submit‘)
# Rest of parsing logic
This fills the search input and submits the form. Playwright will wait for the results page to load before continuing.
And that‘s it! With a few extra lines, we can now scrape data after interacting with site elements.
Additional Tips
Here are some additional tips when using Scrapy Playwright:
Set up a proxy – To prevent IP blocks, configure a proxy in settings.py
:
PROXY_POOL_ENABLED = True
PROXY_POOL_FEED_ENABLED = True
Lower timeout – Decrease DOWNLOAD_TIMEOUT
if pages take time to load:
DOWNLOAD_TIMEOUT = 10
Retry on failure – Set RETRY_TIMES
to re-request pages with errors:
RETRY_TIMES = 10
Simulate mobile – Set playwright_context_kwargs
to emulate a mobile browser:
‘meta‘: {
‘playwright_context_kwargs‘: {
‘viewport‘: {
‘width‘: 360,
‘height‘: 640
}
}
}
Debug interactively – Set playwright_debug
to pause on errors:
‘meta‘: {
‘playwright_debug‘: True
}
Conclusion
And we‘re done! We walked through a complete Scrapy Playwright scraping workflow – from installation, to setup,request handling and form filling using real browser automation.
Some key takeaways:
-
Scrapy Playwright lets you scrape dynamic JavaScript sites with the power of Playwright and Scrapy
-
APIs like
page.query_selector()
make extracting data from rendered pages easy -
Methods like
page.click()
enable interacting with buttons, forms and more -
Pagination and crawling multiple pages is seamless
-
Options like proxies and device emulation boost efficiency
-
Interactive debugging provides control over the scraping process
Scrapy Playwright is a game-changer for anyone struggling to scrape modern web applications laden with JavaScript. I hope you found this hands-on tutorial useful! Let me know in the comments if you have any other topics you would like covered.
Happy scraping!