Web scraping is the process of extracting data from websites through automated scripts or bots. With the rise of dynamic JavaScript-heavy websites, web scraping has become more challenging. However, Go language provides a fast and efficient way to scrape even complex sites.
In this comprehensive tutorial, we will walk through building a web scraper in Golang step-by-step. We will cover key concepts like sending requests, parsing HTML, handling pagination, proxies, and more. By the end, you will have the skills to build a production-ready scraper in Go.
Why Use Golang for Web Scraping?
Here are some of the key advantages of using Golang for web scraping:
-
Speed: Go is a compiled language and runs very fast, making it ideal for performance-critical tasks like web scraping. It‘s 10-40x faster than interpreted languages like Python.
-
Concurrency: Go has built-in concurrency using goroutines and channels. This makes it easy to scrape multiple pages asynchronously.
-
Portability: Go compiles into a standalone binary which can run on any platform. You don‘t need to worry about dependencies or environments.
-
Scalability: Go was designed for networking and multiprocessing. It‘s easy to distribute scraping over multiple machines.
-
Simplicity: Go has a simple and clean syntax without too many magic methods or functional programming concepts. This makes the code readable.
-
Rich Libraries: Go has excellent libraries like Colly, GoQuery, gocron, etc. that simplify web scraping.
Prerequisites
To follow this Golang web scraping tutorial, you will need:
- Go installed on your machine (version 1.11+ recommended)
- A text editor like VS Code or Sublime Text
- Basic knowledge of Go syntax and standard library
- Basic understanding of web scraping concepts like HTTP, HTML, CSS selectors
We‘ll be scraping books.toscrape.com, a sample bookstore website for demo purposes.
Step 1 – Set up a Go Project
Let‘s start by creating a new directory for our project:
mkdir golang-scraper
cd golang-scraper
Next, initialize a new Go module which will manage dependencies:
go mod init scraper
This will create a go.mod
file with the module name.
Step 2 – Import Packages
Open a new main.go
file and import the following packages:
package main
import (
"encoding/csv"
"fmt"
"log"
"os"
"github.com/gocolly/colly"
)
encoding/csv
– for writing scraped data to CSV filefmt
– for printing outputlog
– for logging errorsos
– for file operationsgithub.com/gocolly/colly
– scraping framework
We‘ll discuss the usage of each package as we go along.
Step 3 – Instantiate Collector
The engine that sends HTTP requests and crawls pages in Colly is called the Collector. Let‘s create one:
c := colly.NewCollector()
This creates a Collector with default configuration. We can also pass different options to customize it.
For example, to restrict scraping to a single domain:
c := colly.NewCollector(
colly.AllowedDomains("books.toscrape.com"),
)
Step 4 – Attach Callback Functions
Colly uses callbacks to trigger logic on certain events like requests and responses.
Let‘s print each URL visited:
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
Similarly, we can log each response:
c.OnResponse(func(r *colly.Response) {
fmt.Println(r.StatusCode)
})
Callbacks will be triggered automatically as the Collector visits URLs.
Step 5 – Parse HTML
To extract data, we need to traverse the HTML DOM and find elements. Colly‘s OnHTML
callback provides a goquery instance to query the HTML.
Let‘s try selecting the title:
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
})
We can also use CSS selectors like ".product-pod"
to identify elements.
Step 6 – Extract Data
With HTML parsing set up, let‘s populate a struct with extracted data:
type Book struct {
Title string
Price string
}
c.OnHTML(".product-pod", func(e *colly.HTMLElement) {
book := Book{}
// Extract title
book.Title = e.ChildAttr("img", "alt")
// Extract price
book.Price = e.ChildText(".price_color")
fmt.Printf("Found book: %s (%s)\n", book.Title, book.Price)
})
We locate the title and price values and assign them to the book struct.
Step 7 – Store Scraped Data
To store the scraped data, we will write it to a CSV file.
First, create a file for writing:
file, _ := os.Create("books.csv")
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
Next, write the header row:
headers := []string{"Title", "Price"}
writer.Write(headers)
Finally, in the callback, convert book struct to slice and write as CSV row:
row := []string{book.Title, book.Price}
writer.Write(row)
This will continuously append books to the CSV as they are scraped.
Step 8 – Crawl Pages
We are ready to start crawling. The Visit
method kicks off the scraping of a URL:
c.Visit("https://books.toscrape.com/")
This will crawl the books homepage and trigger our callbacks to extract data.
To scrape multiple pages, we can find the next page link using a selector like .next > a
and recursively call Visit
:
c.OnHTML(".next > a", func(e *colly.HTMLElement) {
nextPage := e.Request.AbsoluteURL(e.Attr("href"))
c.Visit(nextPage)
})
This will automatically scrape through all pages.
Step 9 – Customize Configuration (Optional)
Some ways we can further customize the scraper:
Set a Timeout
c.SetRequestTimeout(30 * time.Second)
Randomize User-Agent
c.RandomUserAgent = true
Use Proxies
c.SetProxies([]string{"http://IP:PORT"})
Limit Scraping Rate
c.Limit(&colly.LimitRule{
DomainGlob: "*",
RandomDelay: 2 * time.Second,
})
Cache Responses
c.CacheDir = "./cache"
Refer to the docs for more details.
And that‘s it! We have built a complete web scraper in Go using Colly.
The full code is available on GitHub.
Bonus: Schedule the Scraper with Cron
To run the scraper automatically on a schedule, we can use the github.com/robfig/cron
package along with the OS cron daemon.
First install the package:
go get github.com/robfig/cron
Then create a main
function that invokes the scraper:
func main() {
c := createCollector() // collector logic
c.Visit("https://books.toscrape.com/")
}
Finally, schedule it:
func main() {
cron.AddFunc("@daily", func() {
main()
})
cron.Run()
}
This will call the main()
method daily. The full cron tutorial is here.
Conclusion
And that‘s a wrap! We went through all the steps to build a web scraper in Go using Colly – from sending requests to storing data.
The key takeaways are:
- Use Colly for easy scraping powered by Go‘s speed and concurrency
- Attach callbacks for parsing HTML and extracting data
- Handle pagination by recursively visiting URLs
- Customize configuration like timeouts and user-agents
- Store scraped data in CSV/JSON format
- Schedule periodic runs with cron
Golang is a versatile language for web scraping. With its fast performance and strong ecosystem, you can build production-grade crawlers.
Thanks for reading! Let me know if you have any other Go scraping topics you would like me to cover.