Website mirroring and archival
Update: I published a article on LWN based on this documentation, which is a more "narrative" form. The article, however, will not be updated any further while those notes are a living document that might be updated again eventually.
For various reasons, I've played with website mirroring and archival. Particularly at Koumbit, when a project is over or abandoned, we have tried to keep a static copy of active websites. The koumbit procedure covers mostly Drupal websites, but might still be relevant.
This page aims at documenting my experience with some of those workflows.
TL;DR: wget
works for many sites, but not all. Some sites can't be
mirrored as just static copies of files, as HTTP headers matter. WARC
files come to the rescue. My last attempt at mirroring a complex site
was with crawl and very effective. Next tests of a
Javascript-heavy site should be done with wpull and its PhantomJS
support.
crawl
Autistici's crawl is "a very simple crawler" that only outputs a WARC file. Here is how it works:
crawl https://example.com/
It does say "very simple" in the README. There are some options but
most defaults are sane: it will fetch page requirements from other
domains (unless the -exclude-related
flag is used), but not recurse
out of the domain. By default, it fires up 10 parallel connections to
the remote site, so you might want to tweak that down to avoid
hammering servers too hard, with the -c
flag. Also, use the -keep
flag to keep a copy of the database to crawl the same site repeatedly.
The resulting WARC file must be loaded in some viewer, as explained below. pywb worked well in my tests.
wget
The short version is:
nice wget --mirror --execute robots=off --no-verbose --convert-links --backup-converted --page-requisites --adjust-extension --base=./ --directory-prefix=./ --span-hosts --domains=www.example.com,example.com http://www.example.com/
The explanation of each option is best found in the wget manpage, although some require extra clarification:
--mirror
means-r -N -l inf --no-remove-listing
which means:-r
or--recursive
: recurse into links found in the pages-N
or--timestamping
: do not fetch content if older than local timestamps-l inf
or--level=inf
: infinite recursion--no-remove-listing
: do not remove.listing
files created when listing directories over FTP
--execute robots=off
: turn offrobots.txt
detection--no-verbose
: only show one line per link. use--quiet
to turn off all output--convert-links
: fix links in saved pages to refer to the local mirror--backup-converted
: keep a backup of the original file so that--timestamping
(-N
, implied by--mirror
) works correctly with--convert-links
--page-requisites
: download all files necessary to load the page, including images, stylesheets, etc.--adjust-extension
: add (for example).html
to save filenames, if missing--base=./
and--directory-prefix=./
are magic to make sure the links modified by--convert-links
work correctly--span-hosts
say it's okay to jump to other hostnames provided they are in the list of--domains
The following options might also be useful:
--warc=<name>
: will also record a WARC file for the crawling of the site in<name>.warc.gz
.--warc-cdx
is also useful as it keeps a list of the visited sites, although that file can be recreated from the WARC file later on (see below)--wait 1 --random-wait
and--limit-rate=20k
will limit the download speed and artificially wait between requests to avoid overloading the server (and possibly detection)--reject-regex "(.*)\?(.*)"
: do not crawl URLs with a query string. Those might be infinite loops like calendars or extra parameters that generate the same page.
The query strings problem
A key problem with crawling dynamic websites is that some CMS like to add strange query parameters in various places. For example, Wordpress might load jQuery like this:
http://example.com/wp-includes/js/jquery/jquery.js?ver=1.12.4
When that file gets saved locally, its filename ends up being:
./example.com/wp-includes/js/jquery/jquery.js?ver=1.12.4
This will break content-type detection in webservers, which rely on
the file extension to send the right Content-Type
. Because the
actual extension is really .4
in the above, no Content-Type
is
sent at all, which confuses web browsers. For example, Chromium will
complain with:
Refused to execute script from '<URL>' because its MIME type ('') is not executable, and strict MIME type checking is enabled
Normally, --adjust-extension
should do the right thing here, but it
did not work in my last experiment. The --reject-regex
proposed
above is ineffective, as it will completely skip those links which
means components will be missing. A pattern replacement on the URL
would be necessary to work around this problem, but that is not
supported by wget
(or wget2, for that matter) at the time of
writing. The solution for this is to use WARC files instead, but the
pywb viewer has trouble rendering those generated by wget (see
bug #294.
See also the koumbit wiki for wget-related instructions.
httrack
The httrack program is explicitely designed to create offline copies of websites, so its use is slightly more intuitive than wget. For example, here's a sample interactive session:
$ httrack
Welcome to HTTrack Website Copier (Offline Browser) 3.49-2
Copyright (C) 1998-2017 Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help
Enter project name :Example website
Base path (return=/home/anarcat/websites/) :/home/anarcat/mirror/example/
Enter URLs (separated by commas or blank spaces) :https://example.com/
Action:
(enter) 1 Mirror Web Site(s)
2 Mirror Web Site(s) with Wizard
3 Just Get Files Indicated
4 Mirror ALL links in URLs (Multiple Mirror)
5 Test Links In URLs (Bookmark Test)
0 Quit
: 2
Proxy (return=none) :
You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :
You can define additional options, such as recurse level (-r<number>), separated by blank spaces
To see the option list, type help
Additional options (return=none) :
---> Wizard command line: httrack https://example.com/ -W -O "/home/anarcat/mirror/example/Example website" -%v
Ready to launch the mirror? (Y/n) :
Mirror launched on Wed, 29 Aug 2018 14:49:16 by HTTrack Website Copier/3.49-2 [XR&CO'2014]
mirroring https://example.com/ with the wizard help..
Other than the dialog, httrack is then silent as it logs into
~/mirror/example/Example website/hts-log.txt
, and even there, only
errors are logged.
Some options that might be important:
--update
: resume an interrupted runr--verbose
: start an interactive session which will show transfers in progress and ask questions for URLs it's unsure what to do for-s0
: never followrobots.txt
and related tags. This is important if the website explicitely blocks crawlers.
HTTrack has a nicer user interface than wget, but lacks WARC support which makes archiving more dynamic sites more difficult as it requires post-processing. See the query strings problem above for details.
WARC files
The Web ARChive (WARC) format "specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format[4] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web." (Wikipedia). The Autistici crawl and (optionally) wget both output WARC files.
It needs, however, a viewer like pywb (not packaged in Debian) to do its job which might make the format less convenient than a simple on-disk mirror, unless multiple snapshots are desired. Note that archive team says that it's important to keep HTTP headers when creating an archive, and I confirmed this in my tests of complex websites.
Displaying
The WARC created by crawl from a very dynamic Wordpress site worked perfectly, however. To load a WARC file, use the following commands:
wb-manager init example
wb-manager add example crawl.warc.gz
wayback
Note that, during my tests, I wasn't able to load a WARC file created with wget in pywb, probably because of bug #294.
I documented a sample Apache configuration to put a reverse proxy in front of pywb for authentication. A more elaborate configuration would probably involve starting the program using UWSGI, but unfortunately I wasn't able to make this work.
Extracting
It is possible to extract a static copy of a website out of a WARC file. The warcat package can extract files from (but also concatenate, split, verify, and list the contents of) WARC files. For example, the folowing will explode a WARC file in the current directory:
python3 -m warcat extract crawl.warc.gz
Note that this might not be useable as a static site without severe
modifications. A URL like http://example.com/foo/
will translate
into example.com/foo/_index_da39a3
for example.
warcat depends on warcio which provide a warcio index
command to inspect WARC files more closely and allows extraction of
individual files with warcio extract
. ArchiveTools also has a
warc-extractor.py
script to extract files from a WARC file.
The WARC standard also defines a format for CDX
files which are
an index of a WARC file. Each line represents a file in the archive,
which makes it easier to process. A CDX
file can be created with
the cdx-indexer
shipped with pywb.
More WARC resources are listed in the awesome web archiving list and the archive team resource page.
Future research
- wpull is an interesting crawler behind the ArchiveBot tool used by ArchiveTeam
- other viewers like OpenWayback (yuck, maven), Webrecorder Player (aaaauughh! electron again!), InterPlanetary Wayback (IPFS??)
- other crawlers and proxies: wasp, warcprox
References
Here are the various programs that can archive websites. Some were mentioned above, some not.
- Autistici crawl: a simple and fast WARC crawler
- crau: scrapy-based crawler, writes but also list, extracts and replays WARC files, might be missing redirects and fails to preserve transfer encoding and headers
- Heritrix is the Internet Archive crawler
- httrack: old but basic tool that works. no WARC support.
- wget: a classic HTTP multipurpose tool.
- wget2: rewrite of wget from scratch aimed at supporting multi-threaded operation and might be faster. Missing some features from wget (WARC, reject patterns and FTP, most notably) but also adds some (RSS, DNS caching, improved TLS support) see the wiki for a full comparison.
- wpull: web downloader and crawler with PhantomJS and Youtube-DL integration, designed as a drop-in replacement for wget, but for much larger crawls
The Koumbit wiki has many instructions specific to Drupal archival. In general, it is good practice to turn off dynamic elements (e.g. comment forms, login boxes, search boxes) in the website before archival, if possible, in order to keep the archived website as usable as possible.
Submitted issues:
- fix broken link to specification
- sample Apache configuration for pywb
- make job status less chatty in ArchiveBot
- Debian packaging of the
ia
commandline tool - document the --large flag in ArchiveBot