Web Scraping with Rust

Rust is a relatively new systems programming language that has been gaining popularity in recent years, especially among developers who need to build high-performance applications. Thanks to its speed, safety, and concurrency, Rust is a great choice for building robust web scrapers. In this comprehensive guide, we will explore how to build a web scraper using Rust.

Why Use Rust for Web Scraping?

Here are some of the key advantages of using Rust for web scraping:

Speed: Rust programs compile to native code that achieves performance comparable to C and C++. This makes Rust web scrapers incredibly fast.
Memory Safety: Rust‘s ownership system ensures memory safety without needing a garbage collector. This results in reliable web scrapers that don‘t crash or leak memory.
Concurrency: Rust has excellent support for concurrency through threads, async/await, and more. This allows building highly parallel scrapers that fully utilize modern multi-core CPUs.
Portability: Rust compiles to self-contained binaries that run anywhere, no runtime required. Scrapers can be deployed easily across different platforms.
Community: Rust has an active and welcoming community that has produced many useful web scraping crates.

Overall, Rust provides the right blend of performance, safety, and productivity for building robust and maintainable web scrapers. The strong type system catches bugs early, while zero-cost abstractions allow the flexibility to handle complex scraping tasks.

Overview of a Rust Web Scraper

Let‘s start by outlining the key components of a Rust web scraper:

The reqwest crate for making HTTP requests to fetch web pages.
The select.rs crate for parsing and querying HTML using CSS selectors.
The serde crate for deserializing JSON data into Rust structs.
The csv crate for writing scraped data to CSV files.
The tokio crate for writing asynchronous scrapers using async/await.
The rayon crate for parallelizing scraping over multiple threads.

We‘ll cover most of these crates in this tutorial as we build an example Rust scraper for books.toscrape.com.

Project Setup

Let‘s start by creating a new Cargo project called book-scraper:

$ cargo new book-scraper

This generates a simple project with a Cargo.toml manifest file and src/main.rs source file.

Next, edit Cargo.toml to add some dependencies we‘ll need:

[dependencies]
reqwest = "0.11"
select = "0.6"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
csv = "1.1"
tokio = { version = "1", features = ["full"] }

Now we‘re ready to start writing our scraper in main.rs!

Making HTTP Requests

The first thing our scraper needs to do is make HTTP requests to fetch web pages. For this, we‘ll use the reqwest crate which provides both synchronous and asynchronous HTTP clients.

Let‘s start with a simple synchronous example:

use reqwest;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let resp = reqwest::blocking::get("https://books.toscrape.com")?;
    println!("Status: {}", resp.status());
    println!("Headers:\n{:#?}", resp.headers());

    let body = resp.text()?;
    println!("Body:\n{}", body);

    Ok(())
}

We make a GET request using reqwest::blocking::get, check the status code and headers, then print the response body. The ? operator automatically propagates errors.

This simple request shows how easy it is to fetch web pages with Rust! Next let‘s see how to parse the HTML.

Parsing HTML with select.rs

To query and extract data from HTML, we‘ll use the select.rs crate which supports efficient CSS selector matching similar to jQuery.

First we need to parse the HTML string into a Document:

use select::document::Document;

let doc = Document::from(body.as_str());

Now we can use CSS selector strings to extract elements from the document:

let book_titles = doc.find("article.product_pod h3 a");

for title in book_titles {
    println!("Title: {}", title.text()); 
}

This finds all the book titles inside <h3><a> tags and prints them.

select.rs has many more features like attribute access, boolean logic in selectors, and XPath expressions. This makes it easy to declaratively extract structured data from HTML using Rust.

Scraping into Structs with Serde

For more complex sites, we often want to scrape data into Rust structs rather than just printing values. The serde crate provides powerful deserialization for converting JSON or other formats into structured objects.

Let‘s define a struct to hold information about a book:

use serde::Deserialize;

#[derive(Deserialize)] 
struct Book {
  title: String,
  price: f32,
}

The #[derive(Deserialize)] attribute automatically generates implementation code to deserialize a Book from a JSON string.

Now we can scrape book data into this struct:

let book_json = json!({
    "title": title.text(),
    "price": price.parse::<f32>().unwrap() 
}).to_string();

let book: Book = serde_json::from_str(&book_json)?;
println!("{:#?}", book);

We build a JSON string containing the scraped data, then deserialize it into a Book. The rich types in Rust allow capturing structured data naturally.

Writing Scraped Data to CSV

A common need is persisting scraped data to files, like CSV or JSON. The csv crate provides a fast CSV writer suitable for large datasets.

Let‘s write the books to books.csv:

use csv::Writer; 

let mut wtr = Writer::from_path("books.csv")?;

wtr.write_record(&["title", "price"])?; // header row  

for book in books {
  wtr.write_record(&[book.title, book.price])?;
}

wtr.flush()?;

We create a Writer pointed at the CSV file, write the header row, then write one row per book. The ? operator handles errors automatically.

csv provides options like configuring the delimiter character, quoting rules, and data types. This makes it easy to serialize scraped data for analysis and storage.

Parallel Scraping with Rayon

One benefit of Rust is easy parallelism thanks to the rayon crate. To parallelize our scraper, we just need to add this:

use rayon::prelude::*;

books.par_iter().for_each(|book| {
    println!("{:?}", book);
});

The par_iter() method converts an iterator into a parallel iterator. Operations like for_each will now run concurrently across all CPU cores!

Rayon handles all the threading complexity for us – the Rust compiler guarantees safety. With a few extra lines, our scraper achieves dramatic speedups on multi-core machines.

Building an Async Web Scraper

So far we‘ve used synchronous requests which block the thread until the response arrives. For ultra-high performance, we can build an asynchronous scraper using async/await and tokio.

First we‘ll switch to using the async reqwest client:

use reqwest::Client;

let client = Client::new();

let resp = client.get("https://books.toscrape.com").send().await?;

The send() method returns a future rather than blocking. We await on the future to get the response.

Under the hood, tokio will use a thread pool to send requests concurrently. We can also await on multiple futures simultaneously to overlap I/O.

Here‘s how we could fetch two pages in parallel:

use tokio::join;

let books_resp = client.get("...books...").send(); 
let images_resp = client.get("...images...").send();

let (books, images) = join!(books_resp, images_resp);

The join! macro awaits multiple futures concurrently, improving performance.

Async Rust takes a bit more practice, but unlocks huge scraping speedups by overlapping network requests and parallelizing work.

Comprehensive Example

Let‘s tie together everything we‘ve covered by building a comprehensive scraper for the books website. It will:

Fetch the index page asynchronously
Scrape book URLs into a vector
Use rayon to scrape each book page concurrently
Write scraped books into a CSV file

Here‘s the full code:

use tokio::join;
use reqwest::Client;
use select::{document::Document, predicate::Class};
use rayon::prelude::*;
use serde::Deserialize;
use csv::Writer;

#[derive(Deserialize)]
struct Book {
  // fields  
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
  let client = Client::new();

  let body = client.get("https://books.toscrape.com")
    .send()
    .await?
    .text()
    .await?;

  let doc = Document::from(body.as_str());

  let book_links = doc.find(Class("product_pod")).filter_map(|pod| {
      pod.find(Class("image_container"))
         .filter_map(|image| image.value().attr("href"))  
  });

  let books = book_links.into_par_iter().map(|link| {
      // scrape book page
  }).collect::<Vec<_>>();

  let mut wtr = Writer::from_path("books.csv")?;

  wtr.write_record(&["title", "price"])?;

  for book in books {
      wtr.write_record(&[book.title, book.price])?
  }

  wtr.flush()?;

  Ok(())
}

This demonstrates how Rust‘s great ecosystem makes building robust and high-performance scrapers easy. The compiler checks for errors while abstractions like async/await and rayon provide easy parallelism.

Summary

In this comprehensive tutorial, we built a full-featured web scraper using Rust and saw how:

Rust provides performance, safety and concurrency for building robust scrapers.
The reqwest crate makes sending HTTP requests easy and efficient.
select.rs enablesdeclarative HTML scraping using CSS selectors.
Serde provides automatic struct deserialization.
CSV files can be written for data storage.
Rayon powers easy multi-threading to speed up scraping.
Tokio and async/await enable asynchronous I/O and parallel requests.

Rust may have a learning curve, but mastering it allows building blazingly fast and reliable scrapers. The strong type system prevents bugs while zero-cost abstractions provide flexibility.

Thanks to Rust‘s growing ecosystem, all the tools you need for production-ready web scraping are readily available. Rust is especially well suited for large-scale scrapers where speed, safety and parallelism are critical.