Web scraping is an essential skill for collecting and analyzing data from the web. With the right tools and techniques, anyone can scrape websites to extract valuable information. This comprehensive tutorial will teach you how to leverage Crawlee, a powerful web scraping and automation framework, to easily scrape data from the web.
An Introduction to Crawlee
Crawlee is an open-source web scraping and browser automation tool built on Node.js. It provides a unified interface for scraping websites using simple HTTP requests or fully-featured headless browsers like Puppeteer and Playwright.
Some key features of Crawlee include:
- Simple and flexible API for defining scrapers
- Integrated queue for managing URLs to crawl
- Support for headless and headful browser automation
- Pluggable storage adapters for saving scraped data
- Built-in proxy rotation for avoiding blocks
- Customizable hooks and middlewares
- Easy deployment with Docker
With Crawlee, you can quickly build robust web scrapers without worrying about the underlying complexities. The declarative API abstracts away the gritty details, allowing you to focus on your business logic. Next, let‘s see Crawlee in action with a hands-on web scraping tutorial.
Installing and Setting Up Crawlee
Crawlee offers a Node.js package that can be installed via NPM. Additionally, the Crawlee CLI provides scaffolding capabilities to bootstrap new projects.
Installing Crawlee Package
To install Crawlee in your project, run:
npm install crawlee
This will install the latest version of the crawlee
package and save it to your package.json
dependencies.
Bootstrapping with Crawlee CLI
For quickly scaffolding a new project, you can use the Crawlee CLI:
npx crawlee create my-scraper
This will prompt you to select a template – choose "Getting Started Example (Javascript)" which contains sample code demonstrating Crawlee usage.
The CLI will generate a new my-scraper
directory with an example crawler inside src/main.js
. It will also initialize package.json
with all the required dependencies.
With Crawlee set up, you‘re ready to write your first web scraper!
Creating a Basic Web Scraper with Crawlee
To illustrate Crawlee in action, we‘ll build a simple scraper for books.toscrape.com. The goal is to extract names of all books on the site.
The boilerplate CLI project contains a basic crawler in main.js
:
// Import Crawlee classes
const { PlaywrightCrawler } = require("crawlee");
// Create crawler instance
const crawler = new PlaywrightCrawler({
// Async request handler function
requestHandler: async ({ page }) => {
// Wait for content to load
await page.waitForSelector("h3");
// Extract data
const titles = await page.$$eval("h3", els => els.map(e => e.textContent));
// Log scraped data
titles.forEach((title, index) => {
console.log(`${index + 1}. ${title}`);
});
}
});
// Start crawling
crawler.run(["https://books.toscrape.com/"]);
The key steps are:
-
Import required Crawlee classes
-
Create a
PlaywrightCrawler
instance -
Define
requestHandler
function that will extract data from each page -
Call
crawler.run()
to start crawling the site
The requestHandler
is an async function that receives the page
object allowing you to interact with the browser. Here we wait for <h3>
elements to load, extract the text contents to get book titles, and log them.
To run the scraper, use:
node src/main.js
This crawls the site and prints out all book names – our scraper works!
The full power of Crawlee lies in its flexible APIs that enable complex scraping logic tailored to any website. Next, let‘s look at some advanced features.
Using Crawlee with Headless Browsers
While simple HTTP requests work for basic scraping, many websites rely on JavaScript rendering and require a real browser. Crawlee makes it easy to leverage headless browsers like Puppeteer and Playwright.
Here is an example using Puppeteer:
// Import PuppeteerCrawler
const { PuppeteerCrawler } = require("crawlee");
const crawler = new PuppeteerCrawler({
// Configure browser and page
launchContext: {
launchOptions: {
headless: false // Change to true for headless crawling
}
},
// Request handler
requestHandler: async ({ page, enqueueLinks, skipLinks }) => {
// Browser logic here
}
});
crawler.run(["https://example.com"]);
The launchContext
property handles configuring the browser, while the request handler defines scraping logic using the page
browser object.
Set headless
to true
for headless crawling. The same code works for Playwright by using PlaywrightCrawler
and replacing Page with BrowserContext.
Executing Browser Code
Crawlee provides useful page methods like page.evaluate()
and page.$$eval()
to execute code within the browser context and extract data.
For example:
// Extract text from all <p> elements
const paragraphs = await page.$$eval("p", els => els.map(e => e.textContent));
Refer to the Puppeteer and Playwright docs for more methods like clicking elements, filling forms etc.
Managing Browser Sessions
To persist cookies and state across pages, reuse the same browser context instance:
// Inside request handler
// Get existing context or create new one
const context = await this.getContext();
// Create new page in context
const page = await context.newPage();
// Now page will share state with other pages
Using Proxies with Crawlee for Scraping
Websites often block scrapers based on IP, so using proxies is essential. Crawlee has first-class support for proxies via the ProxyConfiguration
class.
To use proxies, pass a ProxyConfiguration
instance to the crawler:
// Import ProxyConfiguration
const { ProxyConfiguration } = require("crawlee");
// Create proxy configuration
const proxyConfig = new ProxyConfiguration({
proxyUrls: ["http://localhost:8000"]
});
const crawler = new PlaywrightCrawler({
// Pass proxy config to crawler
proxyConfiguration: proxyConfig,
// ...rest of crawler code
})
You can also set advanced options like credentials, rotation rules, and middleware to intercept requests.
With proxies, you can avoid blocks and scrape data seamlessly.
Other Handy Crawlee Features
Crawlee packs many other useful features:
-
Storage – Save scraped data to CSV, JSON, databases etc.
-
Queue – Manage crawl queue with different traversal strategies.
-
Hooks – Execute custom logic during the crawler lifecycle via hooks.
-
Error handling – Gracefully handle errors using built-in middlewares.
-
Docker – Easily containerize and deploy crawlers at scale.
-
TypeScript support – Take advantage of TypeScript for auto-complete, type safety and more.
And much more! Crawlee is designed for flexibility making it suitable for small scripts to large enterprise scraping systems.
Conclusion
Crawlee provides a robust and easy-to-use web scraping framework for Node.js. With its intuitive API and headless browser support, you can focus on writing scraping logic without reinventing the wheel.
This tutorial covered the basics of setting up Crawlee, building a simple scraper, integrating browsers like Puppeteer, managing proxies, and using other handy features. Crawlee simplifies and accelerates web scraping to help you extract data faster.
To learn more, refer to the Crawlee docs, API reference, and examples. The active GitHub project also contains helpful guides and discussions.
Happy scraping!