Golang Web Scraper Tutorial: A Complete Guide to Building a Fast Web Scraper in Go

Web scraping is the process of extracting data from websites through automated scripts or bots. With the rise of dynamic JavaScript-heavy websites, web scraping has become more challenging. However, Go language provides a fast and efficient way to scrape even complex sites.

In this comprehensive tutorial, we will walk through building a web scraper in Golang step-by-step. We will cover key concepts like sending requests, parsing HTML, handling pagination, proxies, and more. By the end, you will have the skills to build a production-ready scraper in Go.

Why Use Golang for Web Scraping?

Here are some of the key advantages of using Golang for web scraping:

Speed: Go is a compiled language and runs very fast, making it ideal for performance-critical tasks like web scraping. It‘s 10-40x faster than interpreted languages like Python.
Concurrency: Go has built-in concurrency using goroutines and channels. This makes it easy to scrape multiple pages asynchronously.
Portability: Go compiles into a standalone binary which can run on any platform. You don‘t need to worry about dependencies or environments.
Scalability: Go was designed for networking and multiprocessing. It‘s easy to distribute scraping over multiple machines.
Simplicity: Go has a simple and clean syntax without too many magic methods or functional programming concepts. This makes the code readable.
Rich Libraries: Go has excellent libraries like Colly, GoQuery, gocron, etc. that simplify web scraping.

Prerequisites

To follow this Golang web scraping tutorial, you will need:

Go installed on your machine (version 1.11+ recommended)
A text editor like VS Code or Sublime Text
Basic knowledge of Go syntax and standard library
Basic understanding of web scraping concepts like HTTP, HTML, CSS selectors

We‘ll be scraping books.toscrape.com, a sample bookstore website for demo purposes.

Step 1 – Set up a Go Project

Let‘s start by creating a new directory for our project:

mkdir golang-scraper
cd golang-scraper

Next, initialize a new Go module which will manage dependencies:

go mod init scraper

This will create a go.mod file with the module name.

Step 2 – Import Packages

Open a new main.go file and import the following packages:

package main

import (
  "encoding/csv"
  "fmt"
  "log"
  "os"

  "github.com/gocolly/colly"
)

encoding/csv – for writing scraped data to CSV file
fmt – for printing output
log – for logging errors
os – for file operations
github.com/gocolly/colly – scraping framework

We‘ll discuss the usage of each package as we go along.

Step 3 – Instantiate Collector

The engine that sends HTTP requests and crawls pages in Colly is called the Collector. Let‘s create one:

c := colly.NewCollector()

This creates a Collector with default configuration. We can also pass different options to customize it.

For example, to restrict scraping to a single domain:

c := colly.NewCollector(
  colly.AllowedDomains("books.toscrape.com"),
)

Step 4 – Attach Callback Functions

Colly uses callbacks to trigger logic on certain events like requests and responses.

Let‘s print each URL visited:

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Visiting", r.URL)
})

Similarly, we can log each response:

c.OnResponse(func(r *colly.Response) {
  fmt.Println(r.StatusCode)
})

Callbacks will be triggered automatically as the Collector visits URLs.

Step 5 – Parse HTML

To extract data, we need to traverse the HTML DOM and find elements. Colly‘s OnHTML callback provides a goquery instance to query the HTML.

Let‘s try selecting the title:

c.OnHTML("title", func(e *colly.HTMLElement) {
  fmt.Println(e.Text) 
})

We can also use CSS selectors like ".product-pod" to identify elements.

Step 6 – Extract Data

With HTML parsing set up, let‘s populate a struct with extracted data:

type Book struct {
  Title string
  Price string 
}

c.OnHTML(".product-pod", func(e *colly.HTMLElement) {

  book := Book{}

  // Extract title
  book.Title = e.ChildAttr("img", "alt") 

  // Extract price
  book.Price = e.ChildText(".price_color")

  fmt.Printf("Found book: %s (%s)\n", book.Title, book.Price)

})

We locate the title and price values and assign them to the book struct.

Step 7 – Store Scraped Data

To store the scraped data, we will write it to a CSV file.

First, create a file for writing:

file, _ := os.Create("books.csv")
defer file.Close()

writer := csv.NewWriter(file)
defer writer.Flush()

Next, write the header row:

headers := []string{"Title", "Price"}
writer.Write(headers)

Finally, in the callback, convert book struct to slice and write as CSV row:

row := []string{book.Title, book.Price}
writer.Write(row)

This will continuously append books to the CSV as they are scraped.

Step 8 – Crawl Pages

We are ready to start crawling. The Visit method kicks off the scraping of a URL:

c.Visit("https://books.toscrape.com/")

This will crawl the books homepage and trigger our callbacks to extract data.

To scrape multiple pages, we can find the next page link using a selector like .next > a and recursively call Visit:

c.OnHTML(".next > a", func(e *colly.HTMLElement) {
  nextPage := e.Request.AbsoluteURL(e.Attr("href"))
  c.Visit(nextPage)
})

This will automatically scrape through all pages.

Step 9 – Customize Configuration (Optional)

Some ways we can further customize the scraper:

Set a Timeout

c.SetRequestTimeout(30 * time.Second)

Randomize User-Agent

c.RandomUserAgent = true

Use Proxies

c.SetProxies([]string{"http://IP:PORT"})

Limit Scraping Rate

c.Limit(&colly.LimitRule{
  DomainGlob:  "*",
  RandomDelay: 2 * time.Second,
})

Cache Responses

c.CacheDir = "./cache"

Refer to the docs for more details.

And that‘s it! We have built a complete web scraper in Go using Colly.

The full code is available on GitHub.

Bonus: Schedule the Scraper with Cron

To run the scraper automatically on a schedule, we can use the github.com/robfig/cron package along with the OS cron daemon.

First install the package:

go get github.com/robfig/cron

Then create a main function that invokes the scraper:

func main() {

  c := createCollector() // collector logic
  c.Visit("https://books.toscrape.com/") 

}

Finally, schedule it:

func main() {

  cron.AddFunc("@daily", func() {
    main() 
  })

  cron.Run() 

}

This will call the main() method daily. The full cron tutorial is here.

Conclusion

And that‘s a wrap! We went through all the steps to build a web scraper in Go using Colly – from sending requests to storing data.

The key takeaways are:

Use Colly for easy scraping powered by Go‘s speed and concurrency
Attach callbacks for parsing HTML and extracting data
Handle pagination by recursively visiting URLs
Customize configuration like timeouts and user-agents
Store scraped data in CSV/JSON format
Schedule periodic runs with cron

Golang is a versatile language for web scraping. With its fast performance and strong ecosystem, you can build production-grade crawlers.

Thanks for reading! Let me know if you have any other Go scraping topics you would like me to cover.