// you’re reading...

How in the Tech

Downloading an Entire Web Site with wget

GNU wget is a free utility for non-interactive download of files from the Web, generally available on most Linux installs.  It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. Wget is non-interactive, meaning that it can work in the background, while the user is not logged on.  This allows you to start a retrieval and disconnect from the system, letting wget finish the work.  By contrast, most of the Web browsers require constant user’s presence, which can be a great hindrance when transferring a lot of data.

One of the powerful features of wget is its ability to retreive a complete mirror of a website, locally, on your hard drive. It is also intelligent enough to only download links or files associated with the website, without traversing external links. Though it would be a generous public service, you probably wouldn’t want to mirror the entire Internet!

Here’s how you do it.

From a Linux shell:

$ wget -mk -w 10 http://www.google.com/

  • -m instructs wget to enter mirroring mode
  • -k instructs wget to convert links in the webpages downloaded to local links
  • -w 10 instructs wget to delay 10 seconds between requests. This is really just to maintain proper net etiquette; it’s not required

Not running Linux or otherwise have no access to wget? Don’t fret, wget has been ported to Windows. Wget is a powerful utility that I’ve barely scratched the surface here. Find out more on wget by reading the associated manpages.

Related Posts with Thumbnails

Discussion

No comments for “Downloading an Entire Web Site with wget”

Post a comment