[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This chapter contains some references I consider useful.
9.1 Robot Exclusion Wget's support for RES. 9.2 Security Considerations Security with Wget. 9.3 Contributors People who helped.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. `wget -r site', and you're set. Great? Not for the server admin.
As long as Wget is only retrieving static pages, and doing it at a
reasonable rate (see the `--wait' option), there's not much of a
problem. The trouble is that Wget can't tell the difference between the
smallest static page and the most demanding CGI. A site I know has a
section handled by a CGI Perl script that converts Info files to HTML on
the fly. The script is slow, but works well enough for human users
viewing an occasional Info file. However, when someone's recursive Wget
download stumbles upon the index page that links to all the Info files
through the script, the system is brought to its knees without providing
anything useful to the user (This task of converting Info files could be
done locally and access to Info documentation for all installed GNU
software on a system is available from the info
command).
To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion was invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from robots and those they will permit access.
The most popular mechanism, and the de facto standard supported by all the major robots, is the "Robots Exclusion Standard" (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in `/robots.txt' in the server root, which the robots are expected to download and parse.
Although Wget is not a web robot in the strictest sense of the word, it can downloads large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue:
wget -r http://www.server.com/ |
First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt' and, if found, use it for further downloads. `robots.txt' is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft `<draft-koster-robots-00.txt>' titled "A Method for Web Robots Control". The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.
This manual no longer includes the text of the Robot Exclusion Standard.
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the META
tag, like
this:
<meta name="robots" content="nofollow"> |
This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual `/robots.txt' exclusion.
If you know what you are doing and really really wish to turn off the
robot exclusion, set the robots
variable to `off' in your
`.wgetrc'. You can achieve the same effect from the command line
using the -e
switch, e.g. `wget -e robots=off url...'.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
When using Wget, you must be aware that it sends unencrypted passwords through the network, which may present a security problem. Here are the main issues, and some solutions.
ps
. The best
way around it is to use wget -i -
and feed the URLs to
Wget's standard input, each on a separate line, terminated by C-d.
Another workaround is to use `.netrc' to store passwords; however,
storing unencrypted passwords is also considered a security risk.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
GNU Wget was written by Hrvoje Niksic hniksic@xemacs.org. However, its development could never have gone as far as it has, were it not for the help of many people, either with bug reports, feature proposals, patches, or letters saying "Thanks!".
Special thanks goes to the following people (no particular order):
ansi2knr
-ization. Lots of
portability fixes.
Digest
authentication.
The following people have provided patches, bug/build reports, useful suggestions, beta testing services, fan mail and all the other things that make maintenance so much fun:
Ian Abbott Tim Adam, Adrian Aichner, Martin Baehr, Dieter Baron, Roger Beeman, Dan Berger, T. Bharath, Christian Biere, Paul Bludov, Daniel Bodea, Mark Boyns, John Burden, Wanderlei Cavassin, Gilles Cedoc, Tim Charron, Noel Cragg, Kristijan Conkas, John Daily, Ahmon Dancy, Andrew Davison, Andrew Deryabin, Ulrich Drepper, Marc Duponcheel, Damir Dzeko, Alan Eldridge, Aleksandar Erkalovic, Andy Eskilsson, Christian Fraenkel, David Fritz, Masashi Fujita, Howard Gayle, Marcel Gerrits, Lemble Gregory, Hans Grobler, Mathieu Guillaume, Dan Harkless, Aaron Hawley, Herold Heiko, Jochen Hein, Karl Heuer, HIROSE Masaaki, Gregor Hoffleit, Erik Magnus Hulthen, Richard Huveneers, Jonas Jensen, Simon Josefsson, Mario Juric, Hack Kampbjorn, Const Kaplinsky, Goran Kezunovic, Robert Kleine, KOJIMA Haime, Fila Kolodny, Alexander Kourakos, Martin Kraemer, Simos KSenitellis, Hrvoje Lacko, Daniel S. Lewart, Nicolas Lichtmeier, Dave Love, Alexander V. Lukyanov, Thomas Lussnig, Aurelien Marchand, Jordan Mendelson, Lin Zhe Min, Tim Mooney, Simon Munton, Charlie Negyesi, R. K. Owen, Andrew Pollock, Steve Pothier, Jan Prikryl, Marin Purgar, Csaba Raduly, Keith Refson, Bill Richardson, Tyler Riddle, Tobias Ringstrom, Juan Jose Rodrigues, Maciej W. Rozycki, Edward J. Sabol, Heinz Salzmann, Robert Schmidt, Nicolas Schodet, Andreas Schwab, Chris Seawood, Toomas Soome, Tage Stabell-Kulo, Sven Sternberger, Markus Strasser, John Summerfield, Szakacsits Szabolcs, Mike Thomas, Philipp Thomas, Mauro Tortonesi, Dave Turner, Gisle Vanem, Russell Vincent, Charles G Waldman, Douglas E. Wegscheid, Jasmin Zainul, Bojan Zdrnja, Kristijan Zimmer.
Apologies to all who I accidentally left out, and many thanks to all the subscribers of the Wget mailing list.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |