Update: I published a article on LWN based on this documentation, which is a more "narrative" form. The article, however, will not be updated any further while those notes are a living document that might be updated again eventually.

For various reasons, I've played with website mirroring and archival. Particularly at Koumbit, when a project is over or abandoned, we have tried to keep a static copy of active websites. The koumbit procedure covers mostly Drupal websites, but might still be relevant.

This page aims at documenting my experience with some of those workflows.

TL;DR: wget works for many sites, but not all. Some sites can't be mirrored as just static copies of files, as HTTP headers matter. WARC files come to the rescue. My last attempt at mirroring a complex site was with crawl and very effective. Next tests of a Javascript-heavy site should be done with wpull and its PhantomJS support.

crawl
wget
1. The query strings problem
httrack
WARC files
1. Displaying
2. Extracting
Future research
References

crawl

Autistici's crawl is "a very simple crawler" that only outputs a WARC file. Here is how it works:

crawl https://example.com/

It does say "very simple" in the README. There are some options but most defaults are sane: it will fetch page requirements from other domains (unless the -exclude-related flag is used), but not recurse out of the domain. By default, it fires up 10 parallel connections to the remote site, so you might want to tweak that down to avoid hammering servers too hard, with the -c flag. Also, use the -keep flag to keep a copy of the database to crawl the same site repeatedly.

The resulting WARC file must be loaded in some viewer, as explained below. pywb worked well in my tests.

wget

The short version is:

nice wget --mirror --execute robots=off --no-verbose --convert-links --backup-converted  --page-requisites --adjust-extension --base=./ --directory-prefix=./ --span-hosts --domains=www.example.com,example.com http://www.example.com/

The explanation of each option is best found in the wget manpage, although some require extra clarification:

--mirror means -r -N -l inf --no-remove-listing which means:
- -r or --recursive: recurse into links found in the pages
- -N or --timestamping: do not fetch content if older than local timestamps
- -l inf or --level=inf: infinite recursion
- --no-remove-listing: do not remove .listing files created when listing directories over FTP
--execute robots=off: turn off robots.txt detection
--no-verbose: only show one line per link. use --quiet to turn off all output
--convert-links: fix links in saved pages to refer to the local mirror
--backup-converted: keep a backup of the original file so that --timestamping (-N, implied by --mirror) works correctly with --convert-links
--page-requisites: download all files necessary to load the page, including images, stylesheets, etc.
--adjust-extension: add (for example) .html to save filenames, if missing
--base=./ and --directory-prefix=./ are magic to make sure the links modified by --convert-links work correctly
--span-hosts say it's okay to jump to other hostnames provided they are in the list of --domains

The following options might also be useful:

--warc=<name>: will also record a WARC file for the crawling of the site in <name>.warc.gz. --warc-cdx is also useful as it keeps a list of the visited sites, although that file can be recreated from the WARC file later on (see below)
--wait 1 --random-wait and --limit-rate=20k will limit the download speed and artificially wait between requests to avoid overloading the server (and possibly detection)
--reject-regex "(.*)\?(.*)": do not crawl URLs with a query string. Those might be infinite loops like calendars or extra parameters that generate the same page.

The query strings problem

A key problem with crawling dynamic websites is that some CMS like to add strange query parameters in various places. For example, Wordpress might load jQuery like this:

http://example.com/wp-includes/js/jquery/jquery.js?ver=1.12.4

When that file gets saved locally, its filename ends up being:

./example.com/wp-includes/js/jquery/jquery.js?ver=1.12.4

This will break content-type detection in webservers, which rely on the file extension to send the right Content-Type. Because the actual extension is really .4 in the above, no Content-Type is sent at all, which confuses web browsers. For example, Chromium will complain with:

Refused to execute script from '<URL>' because its MIME type ('') is not executable, and strict MIME type checking is enabled

Normally, --adjust-extension should do the right thing here, but it did not work in my last experiment. The --reject-regex proposed above is ineffective, as it will completely skip those links which means components will be missing. A pattern replacement on the URL would be necessary to work around this problem, but that is not supported by wget (or wget2, for that matter) at the time of writing. The solution for this is to use WARC files instead, but the pywb viewer has trouble rendering those generated by wget (see bug #294.

See also the koumbit wiki for wget-related instructions.

httrack

The httrack program is explicitely designed to create offline copies of websites, so its use is slightly more intuitive than wget. For example, here's a sample interactive session:

$ httrack 

Welcome to HTTrack Website Copier (Offline Browser) 3.49-2
Copyright (C) 1998-2017 Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help

Enter project name :Example website

Base path (return=/home/anarcat/websites/) :/home/anarcat/mirror/example/

Enter URLs (separated by commas or blank spaces) :https://example.com/

Action:
(enter) 1   Mirror Web Site(s)
    2   Mirror Web Site(s) with Wizard
    3   Just Get Files Indicated
    4   Mirror ALL links in URLs (Multiple Mirror)
    5   Test Links In URLs (Bookmark Test)
    0   Quit
: 2     

Proxy (return=none) :

You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :

You can define additional options, such as recurse level (-r<number>), separated by blank spaces
To see the option list, type help
Additional options (return=none) :

---> Wizard command line: httrack https://example.com/ -W -O "/home/anarcat/mirror/example/Example website"  -%v  

Ready to launch the mirror? (Y/n) :

Mirror launched on Wed, 29 Aug 2018 14:49:16 by HTTrack Website Copier/3.49-2 [XR&CO'2014]
mirroring https://example.com/ with the wizard help..

Other than the dialog, httrack is then silent as it logs into ~/mirror/example/Example website/hts-log.txt, and even there, only errors are logged.

Some options that might be important:

--update: resume an interrupted runr
--verbose: start an interactive session which will show transfers in progress and ask questions for URLs it's unsure what to do for
-s0: never follow robots.txt and related tags. This is important if the website explicitely blocks crawlers.

HTTrack has a nicer user interface than wget, but lacks WARC support which makes archiving more dynamic sites more difficult as it requires post-processing. See the query strings problem above for details.

WARC files

The Web ARChive (WARC) format "specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format[4] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web." (Wikipedia). The Autistici crawl and (optionally) wget both output WARC files.

It needs, however, a viewer like pywb (not packaged in Debian) to do its job which might make the format less convenient than a simple on-disk mirror, unless multiple snapshots are desired. Note that archive team says that it's important to keep HTTP headers when creating an archive, and I confirmed this in my tests of complex websites.

Displaying

The WARC created by crawl from a very dynamic Wordpress site worked perfectly, however. To load a WARC file, use the following commands:

wb-manager init example
wb-manager add example crawl.warc.gz
wayback

Note that, during my tests, I wasn't able to load a WARC file created with wget in pywb, probably because of bug #294.

I documented a sample Apache configuration to put a reverse proxy in front of pywb for authentication. A more elaborate configuration would probably involve starting the program using UWSGI, but unfortunately I wasn't able to make this work.

Extracting

It is possible to extract a static copy of a website out of a WARC file. The warcat package can extract files from (but also concatenate, split, verify, and list the contents of) WARC files. For example, the folowing will explode a WARC file in the current directory:

python3 -m warcat extract crawl.warc.gz

Note that this might not be useable as a static site without severe modifications. A URL like http://example.com/foo/ will translate into example.com/foo/_index_da39a3 for example.

warcat depends on warcio which provide a warcio index command to inspect WARC files more closely and allows extraction of individual files with warcio extract. ArchiveTools also has a warc-extractor.py script to extract files from a WARC file.

The WARC standard also defines a format for CDX files which are an index of a WARC file. Each line represents a file in the archive, which makes it easier to process. A CDX file can be created with the cdx-indexer shipped with pywb.

More WARC resources are listed in the awesome web archiving list and the archive team resource page.

Future research

wpull is an interesting crawler behind the ArchiveBot tool used by ArchiveTeam
other viewers like OpenWayback (yuck, maven), Webrecorder Player (aaaauughh! electron again!), InterPlanetary Wayback (IPFS??)
other crawlers and proxies: wasp, warcprox

References

Here are the various programs that can archive websites. Some were mentioned above, some not.

Autistici crawl: a simple and fast WARC crawler
crau: scrapy-based crawler, writes but also list, extracts and replays WARC files, might be missing redirects and fails to preserve transfer encoding and headers
Heritrix is the Internet Archive crawler
httrack: old but basic tool that works. no WARC support.
wget: a classic HTTP multipurpose tool.
wget2: rewrite of wget from scratch aimed at supporting multi-threaded operation and might be faster. Missing some features from wget (WARC, reject patterns and FTP, most notably) but also adds some (RSS, DNS caching, improved TLS support) see the wiki for a full comparison.
wpull: web downloader and crawler with PhantomJS and Youtube-DL integration, designed as a drop-in replacement for wget, but for much larger crawls

The Koumbit wiki has many instructions specific to Drupal archival. In general, it is good practice to turn off dynamic elements (e.g. comment forms, login boxes, search boxes) in the website before archival, if possible, in order to keep the archived website as usable as possible.

Submitted issues:

fix broken link to specification
sample Apache configuration for pywb
make job status less chatty in ArchiveBot
Debian packaging of the ia commandline tool
document the --large flag in ArchiveBot

Created 2018-10-05 13:24. Edited 2019-11-13 16:03.