Wget is a powerful command-line utility for downloading files from the internet. With the ability to download entire websites, resume interrupted transfers, and utilize proxies, Wget is an indispensable tool for web scraping and automation.
In this comprehensive guide, we will go through everything you need to know to use Wget effectively with proxies for web scraping and automation.
What is Wget?
Wget is an open-source command-line utility used to download files from the internet via HTTP, HTTPS, and FTP protocols. Developed in 1996, Wget is considered one of the oldest internet "spider" utilities and is built into most Linux distributions.
Some of the key features of Wget include:
- Downloading files recursively from a web server
- Resuming interrupted downloads
- Password-protected downloading
- Mirroring entire websites
- Configurable with .wgetrc file
- Proxy support
Wget is extremely useful for web scraping because of its ability to recursively download websites and resume interrupted downloads – two essential capabilities when scraping large sites.
Installing Wget
Because it‘s included in most Linux distributions, chances are Wget is already installed on your system. To confirm, simply run:
wget --version
If Wget is not already installed, use your distribution‘s package manager to install it:
Debian/Ubuntu
sudo apt install wget
Fedora/CentOS/RHEL
sudo yum install wget
Arch Linux
sudo pacman -S wget
macOS
brew install wget
Wget is also available for Windows, either through Cygwin or as a standalone executable.
Using Wget Basics
The basic syntax for Wget is simple:
wget [options] [URL]
To download a file, simply provide the URL:
wget https://example.com/file.zip
This will download file.zip from example.com into the current working directory.
Some of the most common options you‘ll use with Wget include:
-O filename
– Saves to filename instead of the remote filename-o logfile
– Logs messages to logfile-c
– Resume a partially downloaded file-r
– Download recursively-l depth
– Depth for recursive downloads-np
– Don‘t traverse parent directories--limit-rate=speed
– Limit download speed-4
– Force IPv4 addresses-6
– Force IPv6 addresses
For example, to recursively download a site to a depth of 3 folders:
wget -r -l 3 https://example.com
See wget --help
for the full list of options.
Downloading Files
Beyond basic file downloads, Wget has several useful options for controlling how files are downloaded and saved.
Downloading Multiple Files
You can download multiple files by simply specifying all the URLs on the command line:
wget url1 url2 url3
Alternatively, you can place URLs in a text file (one per line) and pass the file to Wget using -i
:
wget -i urls.txt
This is useful for downloading lists of URLs, such as those extracted from a webpage.
Resuming Downloads
One of Wget‘s best features is the ability to resume interrupted downloads using the -c
option:
wget -c https://example.com/large-file.zip
This will attempt to resume from where the previous Wget download stopped. This is extremely useful when downloading large files that may timeout.
Limiting Rate
When scraping sites, it‘s good practice to limit the download rate to avoid overloading servers. Use --limit-rate
to restrict the download speed:
wget --limit-rate=200k https://example.com
This limits the download rate to 200 kilobytes per second.
Naming Files
By default, Wget saves files using the last path part of the URL as the filename. To specify a custom filename, use -O
:
wget -O my-file.zip https://example.com/files/archive.zip
This will save archive.zip as my-file.zip instead.
Logging Results
To log downloaded results, errors, and other output to a file, use -o
:
wget -o wget.log https://example.com
This will save all output to wget.log for later review.
Downloading Websites
One of Wget‘s most powerful features is the ability to recursively download entire websites. This functionality makes it invaluable for mirroring sites or collecting data for web scraping.
Mirroring Websites
To mirror an entire website, use the -r -N
options:
wget -r -N http://example.com
This will download example.com and all subpages, recreating the website structure on your local machine.
-r enables recursive downloading
-N ensures previously downloaded files are not re-downloaded
Use -np
(no parent) to prevent going to parent directories:
wget -r -N -np http://example.com/folder/
This will mirror just the /folder/ path.
Crawling Depth
By default, Wget will recurse infinitely with -r
. To limit the depth, add -l
:
wget -r -l 2 http://example.com
This restricts recursive downloads to a max depth of 2 folders below the initial URL.
Continuing Mirrors
If a mirror fails halfway through, you can resume it later using -c
:
wget -c -r -N http://example.com
This will pick up where the previous mirror left off.
Mirror Logs
Mirroring generates a lot of output. To log it to a file, use -o
:
wget -o wget-mirror.log -r -N http://example.com
You can then examine the log to see what files were downloaded and if any errors occurred.
Using Wget With Proxies
Proxies are essential for web scraping to bypass blocks and scale requests. Wget has built-in support for HTTP, HTTPS, and SOCKS proxies.
HTTP Proxies
To use an HTTP proxy with Wget, set the http_proxy
and https_proxy
environment variables:
export http_proxy="http://user:[email protected]:8080"
export https_proxy="https://user:[email protected]:8080"
Or specify them on the command line:
wget --http-proxy=http://user:[email protected]:8080 url
Now all Wget traffic will route through the proxy.
SOCKS Proxies
For SOCKS proxies, install socksproxy
and set the all_proxy
variable:
export all_proxy="socks5://user:[email protected]:1080"
Or specify on command line:
wget --socks-proxy=socks5://user:[email protected]:1080 url
Proxy Chains
To chain multiple proxies (useful for additional IP diversity), use ProxyChains:
proxychains wget url
ProxyChains will load proxies from proxychains.conf
and use them to route Wget through multiple proxies.
Proxy Manager API
For automation at scale, use a proxy manager API to dynamically assign proxies from a pool using your language of choice.
For example, in Python:
import proxy_manager_api
# Retrieve proxy
proxy = proxy_manager_api.get_proxy()
# Pass to Wget
wget_cmd = f"wget --proxy={proxy} url"
os.system(wget_cmd)
# Mark proxy used
proxy_manager_api.mark_proxy_used(proxy)
This allows managing proxies at scale versus setting them manually.
Advanced Wget Configuration
Wget has many powerful advanced configuration options accessible through the .wgetrc
file.
Create .wgetrc
To create a .wgetrc
file, run:
wget --save-config=.wgetrc
This will generate a sample config file in your home folder that you can modify.
Common Options
Some common settings to adjust in .wgetrc
include:
limit_rate
– Global speed limitreject
– File types to ignoreaccept
– File types to allowrobots
– Follow/ignore robots.txtuser_agent
– Set custom User-Agent stringoutput_document
– Log output locationtimeout
– Adjust timeout length
For example:
# .wgetrc
limit_rate=500k
reject=gif,flv,mp4
accept=pdf,docx
robots=off
user_agent=MyBot 1.0
output_document=/home/user/wget.log
timeout=10
This config will apply every time Wget runs.
Automating Wget
Here are some examples of automating Wget using Bash or Python scripts.
Bash Script
Save the following as wget.sh
:
#!/bin/bash
URL_LIST="urls.txt"
while read url; do
wget -P downloads/$url --limit-rate=100k "$url"
done < $URL_LIST
This will read a list of URLs from a text file, download each one to a folder named after the URL, and limit speed to 100kbps.
To run it:
./wget.sh
Python Script
Here is a simple Python script to download a list of URLs:
import os
import subprocess
url_list = [‘http://url1‘, ‘http://url2‘]
for url in url_list:
out_file = url.split(‘/‘)[-1]
cmd = f‘wget {url} -O {out_file}‘
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
This uses Python‘s subprocess
module to run Wget commands for each URL.
Wget vs cURL
cURL is another popular command line utility for transferring data with URLs. How does it compare to Wget?
Some key differences include:
- Wget specializes in downloading; cURL supports more protocols
- Wget offers recursive mirroring; cURL does not
- cURL has better SSL support
- cURL supports FTP uploads; Wget is download only
- cURL is more suited for APIs; Wget for web scraping
However, they share several similarities:
- Open source command line tools
- Support common internet protocols
- Can be scripted and automated
- Offer bandwidth throttling
- Available on all major platforms
In general:
- If you need to recursively crawl or mirror sites, Wget is better suited
- If you require more advanced upload/SSL capabilities, choose cURL
Many users install both since they excel in different use cases.
Conclusion
In summary, Wget is an indispensable web scraping tool thanks to its recursive downloading, proxy support, and resumption capabilities. It can be easily automated through scripts or configured via the .wgetrc
file.
When coupled with proxies, Wget provides a simple yet powerful solution for scraping large websites or collecting datasets. Through careful use of bandwidth throttling, file naming, and other options, Wget can be tailored for most web scraping needs.
To leverage proxies at scale for web scraping, utilize a proxy manager API to dynamically assign IPs. For expert proxy solutions, check out the blog for more guides on how proxies can enable your web scraping and data mining projects.