How to Use Wget With Proxy

Wget is a powerful command-line utility for downloading files from the internet. With the ability to download entire websites, resume interrupted transfers, and utilize proxies, Wget is an indispensable tool for web scraping and automation.

In this comprehensive guide, we will go through everything you need to know to use Wget effectively with proxies for web scraping and automation.

What is Wget?

Wget is an open-source command-line utility used to download files from the internet via HTTP, HTTPS, and FTP protocols. Developed in 1996, Wget is considered one of the oldest internet "spider" utilities and is built into most Linux distributions.

Some of the key features of Wget include:

Downloading files recursively from a web server
Resuming interrupted downloads
Password-protected downloading
Mirroring entire websites
Configurable with .wgetrc file
Proxy support

Wget is extremely useful for web scraping because of its ability to recursively download websites and resume interrupted downloads – two essential capabilities when scraping large sites.

Installing Wget

Because it‘s included in most Linux distributions, chances are Wget is already installed on your system. To confirm, simply run:

wget --version

If Wget is not already installed, use your distribution‘s package manager to install it:

Debian/Ubuntu

sudo apt install wget

Fedora/CentOS/RHEL

sudo yum install wget

Arch Linux

sudo pacman -S wget

macOS

brew install wget

Wget is also available for Windows, either through Cygwin or as a standalone executable.

Using Wget Basics

The basic syntax for Wget is simple:

wget [options] [URL]

To download a file, simply provide the URL:

wget https://example.com/file.zip

This will download file.zip from example.com into the current working directory.

Some of the most common options you‘ll use with Wget include:

-O filename – Saves to filename instead of the remote filename
-o logfile – Logs messages to logfile
-c – Resume a partially downloaded file
-r – Download recursively
-l depth – Depth for recursive downloads
-np – Don‘t traverse parent directories
--limit-rate=speed – Limit download speed
-4 – Force IPv4 addresses
-6 – Force IPv6 addresses

For example, to recursively download a site to a depth of 3 folders:

wget -r -l 3 https://example.com

See wget --help for the full list of options.

Downloading Files

Beyond basic file downloads, Wget has several useful options for controlling how files are downloaded and saved.

Downloading Multiple Files

You can download multiple files by simply specifying all the URLs on the command line:

wget url1 url2 url3

Alternatively, you can place URLs in a text file (one per line) and pass the file to Wget using -i:

wget -i urls.txt

This is useful for downloading lists of URLs, such as those extracted from a webpage.

Resuming Downloads

One of Wget‘s best features is the ability to resume interrupted downloads using the -c option:

wget -c https://example.com/large-file.zip

This will attempt to resume from where the previous Wget download stopped. This is extremely useful when downloading large files that may timeout.

Limiting Rate

When scraping sites, it‘s good practice to limit the download rate to avoid overloading servers. Use --limit-rate to restrict the download speed:

wget --limit-rate=200k https://example.com

This limits the download rate to 200 kilobytes per second.

Naming Files

By default, Wget saves files using the last path part of the URL as the filename. To specify a custom filename, use -O:

wget -O my-file.zip https://example.com/files/archive.zip

This will save archive.zip as my-file.zip instead.

Logging Results

To log downloaded results, errors, and other output to a file, use -o:

wget -o wget.log https://example.com

This will save all output to wget.log for later review.

Downloading Websites

One of Wget‘s most powerful features is the ability to recursively download entire websites. This functionality makes it invaluable for mirroring sites or collecting data for web scraping.

Mirroring Websites

To mirror an entire website, use the -r -N options:

wget -r -N http://example.com

This will download example.com and all subpages, recreating the website structure on your local machine.

-r enables recursive downloading
-N ensures previously downloaded files are not re-downloaded

Use -np (no parent) to prevent going to parent directories:

wget -r -N -np http://example.com/folder/

This will mirror just the /folder/ path.

Crawling Depth

By default, Wget will recurse infinitely with -r. To limit the depth, add -l:

wget -r -l 2 http://example.com

This restricts recursive downloads to a max depth of 2 folders below the initial URL.

Continuing Mirrors

If a mirror fails halfway through, you can resume it later using -c:

wget -c -r -N http://example.com

This will pick up where the previous mirror left off.

Mirror Logs

Mirroring generates a lot of output. To log it to a file, use -o:

wget -o wget-mirror.log -r -N http://example.com

You can then examine the log to see what files were downloaded and if any errors occurred.

Using Wget With Proxies

Proxies are essential for web scraping to bypass blocks and scale requests. Wget has built-in support for HTTP, HTTPS, and SOCKS proxies.

HTTP Proxies

To use an HTTP proxy with Wget, set the http_proxy and https_proxy environment variables:

export http_proxy="http://user:[email protected]:8080"
export https_proxy="https://user:[email protected]:8080"

Or specify them on the command line:

wget --http-proxy=http://user:[email protected]:8080 url

Now all Wget traffic will route through the proxy.

SOCKS Proxies

For SOCKS proxies, install socksproxy and set the all_proxy variable:

export all_proxy="socks5://user:[email protected]:1080"

Or specify on command line:

wget --socks-proxy=socks5://user:[email protected]:1080 url

Proxy Chains

To chain multiple proxies (useful for additional IP diversity), use ProxyChains:

proxychains wget url

ProxyChains will load proxies from proxychains.conf and use them to route Wget through multiple proxies.

Proxy Manager API

For automation at scale, use a proxy manager API to dynamically assign proxies from a pool using your language of choice.

For example, in Python:

import proxy_manager_api

# Retrieve proxy 
proxy = proxy_manager_api.get_proxy() 

# Pass to Wget
wget_cmd = f"wget --proxy={proxy} url"
os.system(wget_cmd)

# Mark proxy used
proxy_manager_api.mark_proxy_used(proxy)

This allows managing proxies at scale versus setting them manually.

Advanced Wget Configuration

Wget has many powerful advanced configuration options accessible through the .wgetrc file.

Create .wgetrc

To create a .wgetrc file, run:

wget --save-config=.wgetrc

This will generate a sample config file in your home folder that you can modify.

Common Options

Some common settings to adjust in .wgetrc include:

limit_rate – Global speed limit
reject – File types to ignore
accept – File types to allow
robots – Follow/ignore robots.txt
user_agent – Set custom User-Agent string
output_document – Log output location
timeout – Adjust timeout length

For example:

# .wgetrc
limit_rate=500k
reject=gif,flv,mp4
accept=pdf,docx
robots=off
user_agent=MyBot 1.0

output_document=/home/user/wget.log
timeout=10

This config will apply every time Wget runs.

Automating Wget

Here are some examples of automating Wget using Bash or Python scripts.

Bash Script

Save the following as wget.sh:

#!/bin/bash

URL_LIST="urls.txt"

while read url; do
  wget -P downloads/$url --limit-rate=100k "$url"
done < $URL_LIST

This will read a list of URLs from a text file, download each one to a folder named after the URL, and limit speed to 100kbps.

To run it:

./wget.sh

Python Script

Here is a simple Python script to download a list of URLs:

import os
import subprocess

url_list = [‘http://url1‘, ‘http://url2‘] 

for url in url_list:

  out_file = url.split(‘/‘)[-1]

  cmd = f‘wget {url} -O {out_file}‘

  process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE) 
  output, error = process.communicate()

This uses Python‘s subprocess module to run Wget commands for each URL.

Wget vs cURL

cURL is another popular command line utility for transferring data with URLs. How does it compare to Wget?

Some key differences include:

Wget specializes in downloading; cURL supports more protocols
Wget offers recursive mirroring; cURL does not
cURL has better SSL support
cURL supports FTP uploads; Wget is download only
cURL is more suited for APIs; Wget for web scraping

However, they share several similarities:

Open source command line tools
Support common internet protocols
Can be scripted and automated
Offer bandwidth throttling
Available on all major platforms

In general:

If you need to recursively crawl or mirror sites, Wget is better suited
If you require more advanced upload/SSL capabilities, choose cURL

Many users install both since they excel in different use cases.

Conclusion

In summary, Wget is an indispensable web scraping tool thanks to its recursive downloading, proxy support, and resumption capabilities. It can be easily automated through scripts or configured via the .wgetrc file.

When coupled with proxies, Wget provides a simple yet powerful solution for scraping large websites or collecting datasets. Through careful use of bandwidth throttling, file naming, and other options, Wget can be tailored for most web scraping needs.

To leverage proxies at scale for web scraping, utilize a proxy manager API to dynamically assign IPs. For expert proxy solutions, check out the blog for more guides on how proxies can enable your web scraping and data mining projects.