Crawlee Tutorial: Easy Web Scraping and Browser Automation

Web scraping is an essential skill for collecting and analyzing data from the web. With the right tools and techniques, anyone can scrape websites to extract valuable information. This comprehensive tutorial will teach you how to leverage Crawlee, a powerful web scraping and automation framework, to easily scrape data from the web.

An Introduction to Crawlee

Crawlee is an open-source web scraping and browser automation tool built on Node.js. It provides a unified interface for scraping websites using simple HTTP requests or fully-featured headless browsers like Puppeteer and Playwright.

Some key features of Crawlee include:

  • Simple and flexible API for defining scrapers
  • Integrated queue for managing URLs to crawl
  • Support for headless and headful browser automation
  • Pluggable storage adapters for saving scraped data
  • Built-in proxy rotation for avoiding blocks
  • Customizable hooks and middlewares
  • Easy deployment with Docker

With Crawlee, you can quickly build robust web scrapers without worrying about the underlying complexities. The declarative API abstracts away the gritty details, allowing you to focus on your business logic. Next, let‘s see Crawlee in action with a hands-on web scraping tutorial.

Installing and Setting Up Crawlee

Crawlee offers a Node.js package that can be installed via NPM. Additionally, the Crawlee CLI provides scaffolding capabilities to bootstrap new projects.

Installing Crawlee Package

To install Crawlee in your project, run:

npm install crawlee

This will install the latest version of the crawlee package and save it to your package.json dependencies.

Bootstrapping with Crawlee CLI

For quickly scaffolding a new project, you can use the Crawlee CLI:

npx crawlee create my-scraper

This will prompt you to select a template – choose "Getting Started Example (Javascript)" which contains sample code demonstrating Crawlee usage.

The CLI will generate a new my-scraper directory with an example crawler inside src/main.js. It will also initialize package.json with all the required dependencies.

With Crawlee set up, you‘re ready to write your first web scraper!

Creating a Basic Web Scraper with Crawlee

To illustrate Crawlee in action, we‘ll build a simple scraper for books.toscrape.com. The goal is to extract names of all books on the site.

The boilerplate CLI project contains a basic crawler in main.js:

// Import Crawlee classes
const { PlaywrightCrawler } = require("crawlee");

// Create crawler instance 
const crawler = new PlaywrightCrawler({
  // Async request handler function
  requestHandler: async ({ page }) => {

    // Wait for content to load
    await page.waitForSelector("h3");

    // Extract data
    const titles = await page.$$eval("h3", els => els.map(e => e.textContent));

    // Log scraped data
    titles.forEach((title, index) => {
      console.log(`${index + 1}. ${title}`);
    });
  }
});

// Start crawling
crawler.run(["https://books.toscrape.com/"]); 

The key steps are:

  1. Import required Crawlee classes

  2. Create a PlaywrightCrawler instance

  3. Define requestHandler function that will extract data from each page

  4. Call crawler.run() to start crawling the site

The requestHandler is an async function that receives the page object allowing you to interact with the browser. Here we wait for <h3> elements to load, extract the text contents to get book titles, and log them.

To run the scraper, use:

node src/main.js

This crawls the site and prints out all book names – our scraper works!

The full power of Crawlee lies in its flexible APIs that enable complex scraping logic tailored to any website. Next, let‘s look at some advanced features.

Using Crawlee with Headless Browsers

While simple HTTP requests work for basic scraping, many websites rely on JavaScript rendering and require a real browser. Crawlee makes it easy to leverage headless browsers like Puppeteer and Playwright.

Here is an example using Puppeteer:

// Import PuppeteerCrawler
const { PuppeteerCrawler } = require("crawlee");

const crawler = new PuppeteerCrawler({

  // Configure browser and page
  launchContext: {
    launchOptions: {
      headless: false // Change to true for headless crawling
    } 
  },

  // Request handler
  requestHandler: async ({ page, enqueueLinks, skipLinks }) => {

    // Browser logic here

  }

});

crawler.run(["https://example.com"]);

The launchContext property handles configuring the browser, while the request handler defines scraping logic using the page browser object.

Set headless to true for headless crawling. The same code works for Playwright by using PlaywrightCrawler and replacing Page with BrowserContext.

Executing Browser Code

Crawlee provides useful page methods like page.evaluate() and page.$$eval() to execute code within the browser context and extract data.

For example:

// Extract text from all <p> elements
const paragraphs = await page.$$eval("p", els => els.map(e => e.textContent));

Refer to the Puppeteer and Playwright docs for more methods like clicking elements, filling forms etc.

Managing Browser Sessions

To persist cookies and state across pages, reuse the same browser context instance:

// Inside request handler

// Get existing context or create new one
const context = await this.getContext(); 

// Create new page in context 
const page = await context.newPage();

// Now page will share state with other pages 

Using Proxies with Crawlee for Scraping

Websites often block scrapers based on IP, so using proxies is essential. Crawlee has first-class support for proxies via the ProxyConfiguration class.

To use proxies, pass a ProxyConfiguration instance to the crawler:

// Import ProxyConfiguration
const { ProxyConfiguration } = require("crawlee");

// Create proxy configuration
const proxyConfig = new ProxyConfiguration({
  proxyUrls: ["http://localhost:8000"] 
});

const crawler = new PlaywrightCrawler({
  // Pass proxy config to crawler
  proxyConfiguration: proxyConfig,

  // ...rest of crawler code
})

You can also set advanced options like credentials, rotation rules, and middleware to intercept requests.

With proxies, you can avoid blocks and scrape data seamlessly.

Other Handy Crawlee Features

Crawlee packs many other useful features:

  • Storage – Save scraped data to CSV, JSON, databases etc.

  • Queue – Manage crawl queue with different traversal strategies.

  • Hooks – Execute custom logic during the crawler lifecycle via hooks.

  • Error handling – Gracefully handle errors using built-in middlewares.

  • Docker – Easily containerize and deploy crawlers at scale.

  • TypeScript support – Take advantage of TypeScript for auto-complete, type safety and more.

And much more! Crawlee is designed for flexibility making it suitable for small scripts to large enterprise scraping systems.

Conclusion

Crawlee provides a robust and easy-to-use web scraping framework for Node.js. With its intuitive API and headless browser support, you can focus on writing scraping logic without reinventing the wheel.

This tutorial covered the basics of setting up Crawlee, building a simple scraper, integrating browsers like Puppeteer, managing proxies, and using other handy features. Crawlee simplifies and accelerates web scraping to help you extract data faster.

To learn more, refer to the Crawlee docs, API reference, and examples. The active GitHub project also contains helpful guides and discussions.

Happy scraping!

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.