[This was co-authored with Yashwant Keswani for my college’s magazine. You might find references to our college’s intranet, where our professors provide us with course material.]
One of our TA’s once saw us experimenting with wget in the lab and he mentioned in passing that “Wget is the best download manager.” Or something to that effect. He had a way of making these succinct statements and then walking away like a badass. And a badass it turned out to be; both him and wget.
We have both been experimenting in some or the other way with wget for quite some time now. And experimenting is the second best way to learn how to use most things linux. But of course, in the age of Google and more importantly, StackOverflow, we end up finding easier and quicker ways out of our problems rather than what’s best.
What wget essentially does is download. That is, in a nutshell, it’s functionality; and like most things linux, it does one thing but does it very, very well. Wget calls itself a “non-interactive network download manager.” Wget supports HTTP, HTTPS, FTP protocols; it also supports the use of HTTP proxies. One of the best things about wget is that it is non-interactive i.e. it doesn’t need a constant presence like most browsers. Further, robustness being a target while designing, wget can handle many hurdles and finish downloads correctly. Of course all of this and more can be found in wget’s man page.
That brings us to the next important point. The best way to learn something in linux, is to read the man page. RTFM; (Read the f**king manual;) is something you might end up seeing quite a lot if you lurk around on linux forums that have experienced users prowling. An even better way to learn new commands or even master known ones, is to read the man pages and then go experiment.
When we started out we didn’t use the best way (reading man pages is an acquired art), we still don’t do that always (StackOverflow is an addictive shortcut) but we’re improving at it. We’ve definitely come a long way, from searching “how to download” to searching for “things to download”. While we continue to search for things to download let’s acquaint you with some of the features with examples.
We both are avid xkcd readers. Once we decided to scrape the xkcd website and download all the comics. The good thing about the urls of xkcd webpages is that they have a structure. If they didn’t have a proper structure, it would not have been too big an issue, as wget can function as a web spider too. However, the structure made our task easier. This is what we did:
Step 1: Scraped all the xkcd webpages for the link of the image that was embedded on the webpage.[1a]
Step 2: Stored all these links in a text file.
Step 3: Ran the command: wget –input-file=xkcdLinks.txt
Bam!! We were done. It took about 5 minutes to write the script and 5 more minutes for the download to complete.
Step 1 could have been done in many ways. Using the Requests library along with BeautifulSoup in python, or using the urllib2 library, or the mechanize library (which is built on the Requests library). But we were fascinated with wget. We used wget for this too and then used the grep command to extract the link of the images.
This same procedure could be adapted to work on many many websites. We made a similar scraper to download all the images from www.explosm.net, which hosts the Cyanide and Happiness comics. Most of you must be using DC, and would have seen folders containing photographs of the students of various batches being shared. We can use wget for downloading the photographs from the ecampus too.[1b]
Wget is such an amazing thing because it has options that let you configure so many details of the download. You can configure what to download (filter on the website end), how to download it (things like protocols) and where to download it (filtering based on download location etc.) quite precisely. Most of you might have had to download whole folders related to the courses you are taking from the lecture folder. Using wget with the recursive download option is just what you might be looking for. The first time we used those options we had to scrutinize the man page for more than an hour to get the options correct so that the folder downloaded exactly in the way we wanted but that hour wasn’t wasted. The insight into wget that we got from it and the easy with which we can now download intranet folders is definitely worth the hour that was spent in the man page.
There are some websites, like amazon which block your request if they find that you are sending a lot of them or if they feel that you are not accessing the page from the web-browser.
Wget has some options which help you to even deal with problems like these.
- The user-agent option is an important one. It fools the web server into thinking that you are accessing the webpage from a web-browser.
- The –random-wait option. If there are multiple links to be downloaded from the same server, it waits for some random amount of time in between two successive requests to the server. This makes it difficult to track that it is actually a bot that is trying to access their webpages.
Wget has options that allows one to use proxies while accessing websites. There are even things that allow one to provide username+password authorizations. It can also work some magic with cookies. There are just too many things to write about in here. There is only one way to figure out all the things that wget can do (yeah, we’re still rambling about man pages) but one thing we can guarantee: Wget is a brilliant tool, with enough options to satisfy even the most grandiose wishes you might have.
While we would advise one to use this in Linux to better make use of the power of shell scripting, this has been ported to many different OSes including Mac OS X and even Microsoft Windows. We recently found that there is a graphical interface to this called GWget but since we are more interested in the command line prowess of wget I doubt we’ll ever end up using that a lot. There is another similar tool that we haven’t really tried out, cURL, that is more powerful (versatility-wise) than wget in many ways. The only major thing lacking in cURL is that it doesn’t allow recursive downloads which is a really, really important feature that we use in wget. Interested people should try using these other tools as well, they might be better for your purpose. Go take a look at the man pages!
We never intended this to be anything like a tutorial, so if you were disappointed then we apologize; not really, RTFM! Also, this article came about due to the absolute joy we found in using this tool and we just had to write something about it.
- The cURL utility has an interesting use. If you run $curl www.xkcd.com/[1-100] , it will open www.xkcd.com/1, www.xkcd.com/2 ,,… ,, www.xkcd.com/100
- The same thing can be used to download photos from the e-campus. In this particular case cURL might even be a better choice.*
* We are not endorsing that you go and do this, just mentioning how cURL might be used.