Golang Web Scraper Tutorial: A Complete Guide to Building a Fast Web Scraper in Go

Web scraping is the process of extracting data from websites through automated scripts or bots. With the rise of dynamic JavaScript-heavy websites, web scraping has become more challenging. However, Go language provides a fast and efficient way to scrape even complex sites.

In this comprehensive tutorial, we will walk through building a web scraper in Golang step-by-step. We will cover key concepts like sending requests, parsing HTML, handling pagination, proxies, and more. By the end, you will have the skills to build a production-ready scraper in Go.

Why Use Golang for Web Scraping?

Here are some of the key advantages of using Golang for web scraping:

  • Speed: Go is a compiled language and runs very fast, making it ideal for performance-critical tasks like web scraping. It‘s 10-40x faster than interpreted languages like Python.

  • Concurrency: Go has built-in concurrency using goroutines and channels. This makes it easy to scrape multiple pages asynchronously.

  • Portability: Go compiles into a standalone binary which can run on any platform. You don‘t need to worry about dependencies or environments.

  • Scalability: Go was designed for networking and multiprocessing. It‘s easy to distribute scraping over multiple machines.

  • Simplicity: Go has a simple and clean syntax without too many magic methods or functional programming concepts. This makes the code readable.

  • Rich Libraries: Go has excellent libraries like Colly, GoQuery, gocron, etc. that simplify web scraping.

Prerequisites

To follow this Golang web scraping tutorial, you will need:

  • Go installed on your machine (version 1.11+ recommended)
  • A text editor like VS Code or Sublime Text
  • Basic knowledge of Go syntax and standard library
  • Basic understanding of web scraping concepts like HTTP, HTML, CSS selectors

We‘ll be scraping books.toscrape.com, a sample bookstore website for demo purposes.

Step 1 – Set up a Go Project

Let‘s start by creating a new directory for our project:

mkdir golang-scraper
cd golang-scraper

Next, initialize a new Go module which will manage dependencies:

go mod init scraper

This will create a go.mod file with the module name.

Step 2 – Import Packages

Open a new main.go file and import the following packages:

package main

import (
  "encoding/csv"
  "fmt"
  "log"
  "os"

  "github.com/gocolly/colly"
)
  • encoding/csv – for writing scraped data to CSV file
  • fmt – for printing output
  • log – for logging errors
  • os – for file operations
  • github.com/gocolly/colly – scraping framework

We‘ll discuss the usage of each package as we go along.

Step 3 – Instantiate Collector

The engine that sends HTTP requests and crawls pages in Colly is called the Collector. Let‘s create one:

c := colly.NewCollector()

This creates a Collector with default configuration. We can also pass different options to customize it.

For example, to restrict scraping to a single domain:

c := colly.NewCollector(
  colly.AllowedDomains("books.toscrape.com"),
)

Step 4 – Attach Callback Functions

Colly uses callbacks to trigger logic on certain events like requests and responses.

Let‘s print each URL visited:

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Visiting", r.URL)
})

Similarly, we can log each response:

c.OnResponse(func(r *colly.Response) {
  fmt.Println(r.StatusCode)
})

Callbacks will be triggered automatically as the Collector visits URLs.

Step 5 – Parse HTML

To extract data, we need to traverse the HTML DOM and find elements. Colly‘s OnHTML callback provides a goquery instance to query the HTML.

Let‘s try selecting the title:

c.OnHTML("title", func(e *colly.HTMLElement) {
  fmt.Println(e.Text) 
})

We can also use CSS selectors like ".product-pod" to identify elements.

Step 6 – Extract Data

With HTML parsing set up, let‘s populate a struct with extracted data:

type Book struct {
  Title string
  Price string 
}

c.OnHTML(".product-pod", func(e *colly.HTMLElement) {

  book := Book{}

  // Extract title
  book.Title = e.ChildAttr("img", "alt") 

  // Extract price
  book.Price = e.ChildText(".price_color")

  fmt.Printf("Found book: %s (%s)\n", book.Title, book.Price)

})

We locate the title and price values and assign them to the book struct.

Step 7 – Store Scraped Data

To store the scraped data, we will write it to a CSV file.

First, create a file for writing:

file, _ := os.Create("books.csv")
defer file.Close()

writer := csv.NewWriter(file)
defer writer.Flush()

Next, write the header row:

headers := []string{"Title", "Price"}
writer.Write(headers)

Finally, in the callback, convert book struct to slice and write as CSV row:

row := []string{book.Title, book.Price}
writer.Write(row) 

This will continuously append books to the CSV as they are scraped.

Step 8 – Crawl Pages

We are ready to start crawling. The Visit method kicks off the scraping of a URL:

c.Visit("https://books.toscrape.com/")

This will crawl the books homepage and trigger our callbacks to extract data.

To scrape multiple pages, we can find the next page link using a selector like .next > a and recursively call Visit:

c.OnHTML(".next > a", func(e *colly.HTMLElement) {
  nextPage := e.Request.AbsoluteURL(e.Attr("href"))
  c.Visit(nextPage)
})

This will automatically scrape through all pages.

Step 9 – Customize Configuration (Optional)

Some ways we can further customize the scraper:

Set a Timeout

c.SetRequestTimeout(30 * time.Second) 

Randomize User-Agent

c.RandomUserAgent = true

Use Proxies

c.SetProxies([]string{"http://IP:PORT"}) 

Limit Scraping Rate

c.Limit(&colly.LimitRule{
  DomainGlob:  "*",
  RandomDelay: 2 * time.Second,
})

Cache Responses

c.CacheDir = "./cache"

Refer to the docs for more details.

And that‘s it! We have built a complete web scraper in Go using Colly.

The full code is available on GitHub.

Bonus: Schedule the Scraper with Cron

To run the scraper automatically on a schedule, we can use the github.com/robfig/cron package along with the OS cron daemon.

First install the package:

go get github.com/robfig/cron

Then create a main function that invokes the scraper:

func main() {

  c := createCollector() // collector logic
  c.Visit("https://books.toscrape.com/") 

}

Finally, schedule it:

func main() {

  cron.AddFunc("@daily", func() {
    main() 
  })

  cron.Run() 

}

This will call the main() method daily. The full cron tutorial is here.

Conclusion

And that‘s a wrap! We went through all the steps to build a web scraper in Go using Colly – from sending requests to storing data.

The key takeaways are:

  • Use Colly for easy scraping powered by Go‘s speed and concurrency
  • Attach callbacks for parsing HTML and extracting data
  • Handle pagination by recursively visiting URLs
  • Customize configuration like timeouts and user-agents
  • Store scraped data in CSV/JSON format
  • Schedule periodic runs with cron

Golang is a versatile language for web scraping. With its fast performance and strong ecosystem, you can build production-grade crawlers.

Thanks for reading! Let me know if you have any other Go scraping topics you would like me to cover.

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.