Web scraping is an essential skill for data scientists and analysts to extract valuable insights from the wealth of data available online. R, with its vast collection of statistical and data manipulation packages, is one of the most popular languages used for web scraping. In this comprehensive tutorial, we will learn the basics of web scraping in R using the rvest package.
Why Use R for Web Scraping?
Here are some of the key advantages of using R for web scraping:
-
R has a rich ecosystem of packages for data manipulation, analysis, and visualization. After scraping the data, you can seamlessly feed it into other R pipelines for further processing.
-
Rvest provides an easy-to-use interface for querying HTML and XML documents using CSS selectors and XPath expressions. This is similar to popular web scraping libraries like Beautiful Soup in Python.
-
R has excellent support for handling different data formats like CSV, JSON, XML, etc. which are commonly encountered when scraping the web.
-
RStudio provides a user-friendly IDE for developing and debugging scrapers interactively.
-
Advanced web scraping features like handling JavaScript rendered pages, proxies, automation, etc. are available via R packages like RSelenium, xml2, rPython, etc.
-
R has a flexible data structure in the form of data frames to store heterogeneous data scraped from websites.
-
Scaling up to large data volumes is easy as R integrates well with big data systems like Spark, Hadoop, etc.
So if you are already using R for data analysis, doing the scraping part in R as well keeps everything within one ecosystem.
Installing R and RStudio
Let‘s first install R and RStudio which provide the environment to build R programs:
-
Go to https://cran.r-project.org/ and download the latest version of R for your operating system.
-
Next, install RStudio from https://rstudio.com/products/rstudio/download/. Choose the free Desktop version.
-
Launch RStudio after installing it, which will provide the IDE to code R programs.
Install Rvest Package
The key package we need for web scraping is rvest. Install it by running this code in RStudio:
install.packages("rvest")
Load the package:
library(rvest)
This will include all the rvest functions in our R session.
HTML Basics
Before we start scraping, let‘s understand some HTML basics as that will help in identifying what to extract from web pages.
HTML pages are made up of HTML elements which are delimited by angled brackets <>
like <p>, <div>, <span>
etc. Elements can be nested within other elements as well.
Important attributes of HTML elements are:
id
– Unique identifier for the element within the pageclass
– One or more class names for styling purposeshref
– Hyperlink referencesrc
– Source of image, script, etc.
Elements can also contain text content which is what is displayed to the user.
Here is a simple HTML snippet:
<div id="content">
<p class="text">
Hello World!
</p>
</div>
This div
contains a p
element with text "Hello World!". The div
has an id
"content" while the p
has class "text".
With this basic idea of HTML structure, let‘s start scraping!
Parsing a Web Page
The first step is to download a web page and parse it into an R object that rvest can query.
This is done using the read_html()
function. Let‘s try it on a simple static page:
url <- "https://en.wikipedia.org/wiki/List_of_programming_languages"
page <- read_html(url)
This downloads the Wikipedia page and stores it in the page
object.
Note: If you are behind a proxy, you can use the httr::set_config()
function to configure the proxy before calling read_html()
.
library(httr)
set_config(use_proxy(url="http://user:[email protected]:port/", port=8080))
page <- read_html(url)
Querying Page Elements
With the page parsed, we can now query specific parts from it using CSS selectors or XPath expressions.
Using CSS Selectors
CSS selectors identify elements just like in styling web pages with CSS.
For example, to extract all paragraph elements:
paragraphs <- page %>% html_elements("p")
The html_elements()
function fetches all matching elements and returns a list.
Some other useful CSS selectors:
#id
– Elements with that specific IDdiv.class
–div
elements with that class namea[href]
–a
elements with ahref
attributeul > li
–li
elements that are direct children oful
We can get the first matching element using html_element()
:
first_para <- page %>% html_element("p")
Using XPath Expressions
XPath is a query language for selecting parts of an XML document, which HTML essentially is.
Some sample XPath expressions:
//div
– Alldiv
elements anywhere in the document//div[@class=‘header‘]
–div
elements withclass=‘header‘
//div/p
–p
elements that are children ofdiv
//a[@href=‘contact.html‘]
–a
links tocontact.html
page
Let‘s use XPath to get headings from the Wikipedia page:
headings <- page %>% html_nodes(xpath="//h1|//h2|//h3")
This fetches all h1
, h2
and h3
elements.
Extracting Element Attributes
We can extract attributes of elements using the html_attr()
function:
links <- page %>%
html_elements("a") %>%
html_attr("href")
This will return all link urls in the page.
Some other useful attributes that can be extracted are:
src
of imagesalt
text of imagestitle
of elements- Class, id attributes etc.
Extracting Element Text
The text content within an element can be extracted using html_text()
:
paras <- page %>%
html_elements("p") %>%
html_text()
This will extract text from all p
elements.
Extracting HTML Tables
Structured data in HTML tables can be extracted into R data frames using html_table()
:
tables <- page %>%
html_elements(xpath="//table") %>%
html_table()
This will return a list of data frames, one for each table.
Handling JavaScript Pages
Many modern web pages use JavaScript to dynamically render content.
Since read_html()
only parses the initial HTML returned by the server, any content rendered by JavaScript will be missing.
To scrape such pages, we need to first render them in a browser so that all JavaScript executes.
This can be done using the RSelenium package which provides bindings to Selenium browser automation tool in R.
Here is an example to scrape a page using RSelenium:
# Start Selenium driver
library(RSelenium)
driver <- rsDriver(browser="chrome")
remDr <- driver[["client"]]
# Navigate to page
remDr$navigate("https://example.com")
# Wait for JavaScript to render
remDr$waitFor(timeout = 10000, condition = hasLoaded)
# Get page source after JavaScript executes
page_source <- remDr$getPageSource()[[1]]
page <- read_html(page_source)
# Further scraping steps...
# Close browser
remDr$close()
This provides a browser-like environment to execute JavaScript and get the fully rendered page source for scraping.
Selenium provides many other capabilities like clicking, input handling, screenshots, etc. which are useful for Automated data extraction and browser testing.
Storing Scraped Data
For storing the scraped data, we can use R data structures like vectors, lists and data frames.
Data frames allow storing heterogeneous tabular data in a tidy format:
library(dplyr)
df <- data_frame(
title = headings %>% html_text() %>% head(5),
link = links %>% head(5)
)
This data can then be processed using dplyr
functions or exported to CSV for sharing:
write.csv(df, "scraped_data.csv")
With this we have covered the core concepts of web scraping in R with rvest!
The key ideas are:
- Use
read_html()
to parse a web page - Query elements using CSS selectors or XPath
- Extract attributes, text, HTML tables into R objects
- Handle JavaScript pages using RSelenium + Selenium
- Store scraped data in data frames
Some other helpful packages for web scraping are XML, httr, stringr, rlist, purrr, jsonlite etc. Make sure to check out their documentation for more functionality.
Also brush up your CSS selectors and XPath skills which are useful for identifying the right elements to extract.
With a little practice you will soon be scraping websites in R like a pro!