Web Scraping With R Using Rvest Tutorial

Web scraping is an essential skill for data scientists and analysts to extract valuable insights from the wealth of data available online. R, with its vast collection of statistical and data manipulation packages, is one of the most popular languages used for web scraping. In this comprehensive tutorial, we will learn the basics of web scraping in R using the rvest package.

Why Use R for Web Scraping?

Here are some of the key advantages of using R for web scraping:

  • R has a rich ecosystem of packages for data manipulation, analysis, and visualization. After scraping the data, you can seamlessly feed it into other R pipelines for further processing.

  • Rvest provides an easy-to-use interface for querying HTML and XML documents using CSS selectors and XPath expressions. This is similar to popular web scraping libraries like Beautiful Soup in Python.

  • R has excellent support for handling different data formats like CSV, JSON, XML, etc. which are commonly encountered when scraping the web.

  • RStudio provides a user-friendly IDE for developing and debugging scrapers interactively.

  • Advanced web scraping features like handling JavaScript rendered pages, proxies, automation, etc. are available via R packages like RSelenium, xml2, rPython, etc.

  • R has a flexible data structure in the form of data frames to store heterogeneous data scraped from websites.

  • Scaling up to large data volumes is easy as R integrates well with big data systems like Spark, Hadoop, etc.

So if you are already using R for data analysis, doing the scraping part in R as well keeps everything within one ecosystem.

Installing R and RStudio

Let‘s first install R and RStudio which provide the environment to build R programs:

Install Rvest Package

The key package we need for web scraping is rvest. Install it by running this code in RStudio:

install.packages("rvest") 

Load the package:

library(rvest)

This will include all the rvest functions in our R session.

HTML Basics

Before we start scraping, let‘s understand some HTML basics as that will help in identifying what to extract from web pages.

HTML pages are made up of HTML elements which are delimited by angled brackets <> like <p>, <div>, <span> etc. Elements can be nested within other elements as well.

Important attributes of HTML elements are:

  • id – Unique identifier for the element within the page
  • class – One or more class names for styling purposes
  • href – Hyperlink reference
  • src – Source of image, script, etc.

Elements can also contain text content which is what is displayed to the user.

Here is a simple HTML snippet:

<div id="content">
  <p class="text">
    Hello World! 
  </p>
</div>

This div contains a p element with text "Hello World!". The div has an id "content" while the p has class "text".

With this basic idea of HTML structure, let‘s start scraping!

Parsing a Web Page

The first step is to download a web page and parse it into an R object that rvest can query.

This is done using the read_html() function. Let‘s try it on a simple static page:

url <- "https://en.wikipedia.org/wiki/List_of_programming_languages"

page <- read_html(url)

This downloads the Wikipedia page and stores it in the page object.

Note: If you are behind a proxy, you can use the httr::set_config() function to configure the proxy before calling read_html().

library(httr)

set_config(use_proxy(url="http://user:[email protected]:port/", port=8080))

page <- read_html(url)

Querying Page Elements

With the page parsed, we can now query specific parts from it using CSS selectors or XPath expressions.

Using CSS Selectors

CSS selectors identify elements just like in styling web pages with CSS.

For example, to extract all paragraph elements:

paragraphs <- page %>% html_elements("p")

The html_elements() function fetches all matching elements and returns a list.

Some other useful CSS selectors:

  • #id – Elements with that specific ID
  • div.classdiv elements with that class name
  • a[href]a elements with a href attribute
  • ul > lili elements that are direct children of ul

We can get the first matching element using html_element():

first_para <- page %>% html_element("p")

Using XPath Expressions

XPath is a query language for selecting parts of an XML document, which HTML essentially is.

Some sample XPath expressions:

  • //div – All div elements anywhere in the document
  • //div[@class=‘header‘]div elements with class=‘header‘
  • //div/pp elements that are children of div
  • //a[@href=‘contact.html‘]a links to contact.html page

Let‘s use XPath to get headings from the Wikipedia page:

headings <- page %>% html_nodes(xpath="//h1|//h2|//h3") 

This fetches all h1, h2 and h3 elements.

Extracting Element Attributes

We can extract attributes of elements using the html_attr() function:

links <- page %>% 
  html_elements("a") %>%
  html_attr("href")

This will return all link urls in the page.

Some other useful attributes that can be extracted are:

  • src of images
  • alt text of images
  • title of elements
  • Class, id attributes etc.

Extracting Element Text

The text content within an element can be extracted using html_text():

paras <- page %>%
  html_elements("p") %>%
  html_text()

This will extract text from all p elements.

Extracting HTML Tables

Structured data in HTML tables can be extracted into R data frames using html_table():

tables <- page %>% 
  html_elements(xpath="//table") %>%
  html_table()

This will return a list of data frames, one for each table.

Handling JavaScript Pages

Many modern web pages use JavaScript to dynamically render content.

Since read_html() only parses the initial HTML returned by the server, any content rendered by JavaScript will be missing.

To scrape such pages, we need to first render them in a browser so that all JavaScript executes.

This can be done using the RSelenium package which provides bindings to Selenium browser automation tool in R.

Here is an example to scrape a page using RSelenium:

# Start Selenium driver 
library(RSelenium)
driver <- rsDriver(browser="chrome")
remDr <- driver[["client"]]

# Navigate to page
remDr$navigate("https://example.com") 

# Wait for JavaScript to render
remDr$waitFor(timeout = 10000, condition = hasLoaded)

# Get page source after JavaScript executes
page_source <- remDr$getPageSource()[[1]]

page <- read_html(page_source)

# Further scraping steps...

# Close browser
remDr$close()

This provides a browser-like environment to execute JavaScript and get the fully rendered page source for scraping.

Selenium provides many other capabilities like clicking, input handling, screenshots, etc. which are useful for Automated data extraction and browser testing.

Storing Scraped Data

For storing the scraped data, we can use R data structures like vectors, lists and data frames.

Data frames allow storing heterogeneous tabular data in a tidy format:

library(dplyr)

df <- data_frame(
  title = headings %>% html_text() %>% head(5),
  link = links %>% head(5) 
)

This data can then be processed using dplyr functions or exported to CSV for sharing:

write.csv(df, "scraped_data.csv")

With this we have covered the core concepts of web scraping in R with rvest!

The key ideas are:

  • Use read_html() to parse a web page
  • Query elements using CSS selectors or XPath
  • Extract attributes, text, HTML tables into R objects
  • Handle JavaScript pages using RSelenium + Selenium
  • Store scraped data in data frames

Some other helpful packages for web scraping are XML, httr, stringr, rlist, purrr, jsonlite etc. Make sure to check out their documentation for more functionality.

Also brush up your CSS selectors and XPath skills which are useful for identifying the right elements to extract.

With a little practice you will soon be scraping websites in R like a pro!

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.