A Beginner‘s Guide to Scraping JavaScript Websites with Scrapy and Splash

Modern websites are increasingly relying on JavaScript to dynamically load content. While useful for users, this can make scraping more challenging. In this comprehensive guide, we‘ll learn how to integrate Scrapy with Splash to scrape tricky JavaScript sites with ease!

The Rise of JavaScript-Heavy Websites

First, let‘s quickly understand the trends that have led to complex JavaScript-rendered sites becoming more common:

Adoption of front-end frameworks – Libraries like React, Vue, and Angular allow crafting complex UIs that load data seamlessly. This often requires JavaScript under the hood.
Shift towards web apps – Many sites now function more as interactive apps rather than static pages. JavaScript powers much of their logic.
Performance improvements – Loading content dynamically via JS cuts down on initial bandwidth usage.

According to BuiltWith, already nearly 50% of the top 10,000 websites use JavaScript frameworks like React. This usage is only growing over time.

While great for users, these trends pose challenges for scraping. Traditional tools like requests and BeautifulSoup can only parse initial raw HTML. To extract content loaded asynchronously, we need browsers like Splash.

Why Splash Solves the JavaScript Scraping Problem

Splash is a lightweight JavaScript rendering service with an HTTP API. It offers a quick and convenient way to integrate JS rendering into Scrapy spiders.

Some key capabilities Splash provides:

Fetch pages and renders JavaScript – Splash will execute JS and return interactive DOM.
Process updates – Allows waiting for specific elements or delays.
Execute custom JavaScript – Can inject JS snippets to extract data.
Maintain stateful browsing sessions – Keeps cookies and local storage across requests.

Overall, Splash combines the power of full JavaScript execution with the flexibility of an API. This makes it easy to incorporate into Python scraping workflows.

Splash architecture (image from Splash docs)

Now let‘s walk through how to set up Splash and integrate it with Scrapy.

Installing System Requirements

Splash relies on several moving parts under the hood. We‘ll need to install these prerequisites first:

Docker

Splash runs nicely isolated inside Docker containers. Docker provides a lightweight virtualized environment to run processes like Splash easily.

The installation process depends on your operating system:

Linux – Use apt for Ubuntu/Debian or dnf for Fedora:

sudo apt install docker // Ubuntu
sudo dnf install docker // Fedora

Windows / MacOS – Download and run the Docker Desktop app:

https://www.docker.com/get-started

Overall, getting Docker up takes just a few minutes on any modern system. The Docker architecture brings huge benefits for running distributed web scraping infrastructure.

Scrapy

Of course, we‘ll need Scrapy itself as our web scraping framework. Scrapy provides a battle-tested platform for building complex crawling programs in Python.

Install Scrapy via pip:

pip install scrapy

Make sure pip points to your preferred Python environment.

scrapy-splash

Lastly, we need the library that integrates Scrapy with Splash – scrapy-splash.

scrapy-splash handles passing requests to Splash and converting the responses. This handles communication between Scrapy and Splash for us.

Install it like so:

pip install scrapy-splash

And that‘s it for dependencies! Now we‘re ready to launch Splash.

Running The Splash Docker Container

With Docker installed, getting a local Splash instance running takes just two quick commands:

# Pull latest Splash image 
docker pull scrapinghub/splash  

# Run container on port 8050 
docker run -p 8050:8050 scrapinghub/splash

This launches the Splash Docker image in a neatly isolated container, ready for our spiders to connect to!

By default, Splash runs on port 8050. We can test this by visiting http://localhost:8050:

With Splash up and running, let‘s look at integrating it into a Scrapy spider.

Configuring Scrapy Settings

All Scrapy Splash requests will get routed through our containerized Splash instance. To set this up we need to configure a few Scrapy settings:

# settings.py

SPLASH_URL = ‘http://localhost:8050‘ # Our Docker Splash

DOWNLOADER_MIDDLEWARES = {
    ‘scrapy_splash.SplashCookiesMiddleware‘: 723,
    ‘scrapy_splash.SplashMiddleware‘: 725,
}

DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter‘

This tells Scrapy:

Where to find our Splash instance
To use Splash-aware middlewares and filters

And that‘s it for setup on the Scrapy side! Now we can start sending requests.

Making Splash Requests

To render pages via Splash, Scrapy provides a special SplashRequest class.

SplashRequest works just like Scrapy‘s regular Request, but will proxy pages through the browser instead of fetching directly:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):

  # Initial requests
  def start_requests(self):
    yield SplashRequest(
      url="http://example.com",
      callback=self.parse,
      args={‘wait‘: 2},  

    )

  # Parse response   
  def parse(self, response):
    ...

This makes a request to example.com, has Splash render the page with a 2 second wait, and then passes the interactive response to the callback.

Just like that, SplashRequest allows applying browser rendering to any page!

Splash Request Options

SplashRequest offers many options to control rendering and JavaScript execution:

wait – Wait time for AJAX requests to finish loading content
resource_timeout – Maximum time to wait for individual resource (image, script, etc) downloads
js_source – Additional JS code to execute on page
js_enable – Enable or disable JS execution (defaults to True)
images – Return image blob data

For example:

yield SplashRequest(
  url=url,
  endpoint=‘render.html‘,
  args={
     ‘wait‘: 5,
     ‘js_source‘: ‘document.getElementById("btn").click()‘,
     ‘png‘: 1,
  }
)

This clicks a button, waits for resulting content changes, and returns screenshot images!

These options enable interacting with pages via JavaScript to load data.

Handling Splash Responses

Once requests go through Splash and are rendered, Scrapy will receive interactive responses:

SplashJsonResponse – For JSON data
SplashTextResponse – For HTML/text pages
SplashResponse – For media like images

We can handle these just like regular Scrapy responses:

def parse(self, response):

  if isinstance(response, SplashJsonResponse):
    data = json.loads(response.body)

  elif isinstance(response, SplashTextResponse): 
    html = response.body

  # ... extract data!

The response body will contain our rendered, interactive page data.

Putting It All Together: Sample Spider

Let‘s tie together everything we‘ve covered by writing a sample Splash spider to scrape quotes from quotes.toscrape.com!

import scrapy
from scrapy_splash import SplashRequest


class QuotesSpider(scrapy.Spider):

  name = ‘quotes‘
  allowed_domains = [‘toscrape.com‘]

  def start_requests(self):
    yield SplashRequest(
      url=‘http://quotes.toscrape.com/js‘,
      callback=self.parse,
      endpoint=‘render.html‘
    )

  def parse(self, response):
    for quote in response.css(‘div.quote‘):
      yield {
        ‘text‘: quote.css(‘span.text::text‘).get(),
        ‘author‘: quote.css(‘small.author::text‘).get(),
        ‘tags‘: quote.css(‘div.tags a.tag::text‘).getall()
      }

    next_page = response.css(‘li.next > a::attr(href)‘).get()
    if next_page is not None:
      url = response.urljoin(next_page)
      yield SplashRequest(url, self.parse, endpoint=‘render.html‘)

This demonstrates several Splash techniques:

Initial request using SplashRequest to render page
Extracting content from SplashTextResponse
Recursively crawling links by passing SplashRequest to callbacks
Updating URL via response.urljoin()

And that‘s it! With just a few small tweaks we can scrape complex sites powered by JavaScript.

Splash opens up a whole new world of possibilities for your Scrapy spiders!

Debugging & Troubleshooting Splash

Of course, it takes some practice to get the hang of Splash. Here are some tips for debugging issues that arise:

Check Splash logs – requests, responses, errors will all appear here
Try different endpoints – render.json gives raw data, debug allows interactive REPL
Use REQUEST_FINGERPRINTS to track requests across duplicates
Test scrapes via the Splash playground UI
Enable Scrapy‘s DOWNLOAD_FAIL_ON_DATALOSS to catch errors
Slow down Splash using argument wait times to pinpoint problems

Overall, Splash provides visibility into its JavaScript rendering. Familiarity with browser developer tools also goes a long way for debugging!

Additional Tips and Tools

Here are some more tips for leveling up your Splash scraping skills:

Learn Splash scripting – Splash Lua scripts allow complex browser automation logic
Use Splash-Jupyter – This handy tool lets you control Splash via a notebook UI
Integrate proxies – Route traffic through proxies to avoid blocks
Use a remote Splash instance – For increased scale, reliability, and proxies

Beyond debugging, it‘s also worth familiarizing yourself with Splash‘s Lua scripting environment. This allows crafting logic around common use cases like handling infinite scroll or navigating pages.

Services like ScrapingBee also provide hosted Splash clusters with proxies and automation support. These can simplify managing scraping infrastructure.

Key Takeaways and Next Steps

And there we have it – an introduction to scraping JavaScript web apps with Scrapy Splash!

Here are some key points:

Modern sites increasingly rely on JavaScript to build complex UIs
Splash provides a HEADLESS browser to execute JavaScript for scraping
SplashRequest integrates Splash into standard Scrapy spiders
Scripts and wait times enable interacting with pages
Scrape tricky pages that were previously inaccessible!

For further learning, be sure to check out the official Splash docs for additional examples and API reference.

Happy Splash scraping!