Update: I published a article on LWN based on this documentation, which is a more "narrative" form. The article, however, will not be updated any further while those notes are a living document that might be updated again eventually.

For various reasons, I've played with website mirroring and archival. Particularly at Koumbit, when a project is over or abandoned, we have tried to keep a static copy of active websites. The koumbit procedure covers mostly Drupal websites, but might still be relevant.

This page aims at documenting my experience with some of those workflows.

TL;DR: wget works for many sites, but not all. Some sites can't be mirrored as just static copies of files, as HTTP headers matter. WARC files come to the rescue. My last attempt at mirroring a complex site was with crawl and very effective. Next tests of a Javascript-heavy site should be done with wpull and its PhantomJS support.

  1. crawl
  2. wget
    1. The query strings problem
  3. httrack
  4. WARC files
    1. Displaying
    2. Extracting
  5. Future research
  6. References

crawl

Autistici's crawl is "a very simple crawler" that only outputs a WARC file. Here is how it works:

crawl https://example.com/

It does say "very simple" in the README. There are some options but most defaults are sane: it will fetch page requirements from other domains (unless the -exclude-related flag is used), but not recurse out of the domain. By default, it fires up 10 parallel connections to the remote site, so you might want to tweak that down to avoid hammering servers too hard, with the -c flag. Also, use the -keep flag to keep a copy of the database to crawl the same site repeatedly.

The resulting WARC file must be loaded in some viewer, as explained below. pywb worked well in my tests.

wget

The short version is:

nice wget --mirror --execute robots=off --no-verbose --convert-links --backup-converted  --page-requisites --adjust-extension --base=./ --directory-prefix=./ --span-hosts --domains=www.example.com,example.com http://www.example.com/

The explanation of each option is best found in the wget manpage, although some require extra clarification:

The following options might also be useful:

The query strings problem

A key problem with crawling dynamic websites is that some CMS like to add strange query parameters in various places. For example, Wordpress might load jQuery like this:

http://example.com/wp-includes/js/jquery/jquery.js?ver=1.12.4

When that file gets saved locally, its filename ends up being:

./example.com/wp-includes/js/jquery/jquery.js?ver=1.12.4

This will break content-type detection in webservers, which rely on the file extension to send the right Content-Type. Because the actual extension is really .4 in the above, no Content-Type is sent at all, which confuses web browsers. For example, Chromium will complain with:

Refused to execute script from '<URL>' because its MIME type ('') is not executable, and strict MIME type checking is enabled

Normally, --adjust-extension should do the right thing here, but it did not work in my last experiment. The --reject-regex proposed above is ineffective, as it will completely skip those links which means components will be missing. A pattern replacement on the URL would be necessary to work around this problem, but that is not supported by wget (or wget2, for that matter) at the time of writing. The solution for this is to use WARC files instead, but the pywb viewer has trouble rendering those generated by wget (see bug #294.

See also the koumbit wiki for wget-related instructions.

httrack

The httrack program is explicitely designed to create offline copies of websites, so its use is slightly more intuitive than wget. For example, here's a sample interactive session:

$ httrack 

Welcome to HTTrack Website Copier (Offline Browser) 3.49-2
Copyright (C) 1998-2017 Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help

Enter project name :Example website

Base path (return=/home/anarcat/websites/) :/home/anarcat/mirror/example/

Enter URLs (separated by commas or blank spaces) :https://example.com/

Action:
(enter) 1   Mirror Web Site(s)
    2   Mirror Web Site(s) with Wizard
    3   Just Get Files Indicated
    4   Mirror ALL links in URLs (Multiple Mirror)
    5   Test Links In URLs (Bookmark Test)
    0   Quit
: 2     

Proxy (return=none) :

You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :

You can define additional options, such as recurse level (-r<number>), separated by blank spaces
To see the option list, type help
Additional options (return=none) :

---> Wizard command line: httrack https://example.com/ -W -O "/home/anarcat/mirror/example/Example website"  -%v  

Ready to launch the mirror? (Y/n) :

Mirror launched on Wed, 29 Aug 2018 14:49:16 by HTTrack Website Copier/3.49-2 [XR&CO'2014]
mirroring https://example.com/ with the wizard help..

Other than the dialog, httrack is then silent as it logs into ~/mirror/example/Example website/hts-log.txt, and even there, only errors are logged.

Some options that might be important:

HTTrack has a nicer user interface than wget, but lacks WARC support which makes archiving more dynamic sites more difficult as it requires post-processing. See the query strings problem above for details.

WARC files

The Web ARChive (WARC) format "specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format[4] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web." (Wikipedia). The Autistici crawl and (optionally) wget both output WARC files.

It needs, however, a viewer like pywb (not packaged in Debian) to do its job which might make the format less convenient than a simple on-disk mirror, unless multiple snapshots are desired. Note that archive team says that it's important to keep HTTP headers when creating an archive, and I confirmed this in my tests of complex websites.

Displaying

The WARC created by crawl from a very dynamic Wordpress site worked perfectly, however. To load a WARC file, use the following commands:

wb-manager init example
wb-manager add example crawl.warc.gz
wayback

Note that, during my tests, I wasn't able to load a WARC file created with wget in pywb, probably because of bug #294.

I documented a sample Apache configuration to put a reverse proxy in front of pywb for authentication. A more elaborate configuration would probably involve starting the program using UWSGI, but unfortunately I wasn't able to make this work.

Extracting

It is possible to extract a static copy of a website out of a WARC file. The warcat package can extract files from (but also concatenate, split, verify, and list the contents of) WARC files. For example, the folowing will explode a WARC file in the current directory:

python3 -m warcat extract crawl.warc.gz

Note that this might not be useable as a static site without severe modifications. A URL like http://example.com/foo/ will translate into example.com/foo/_index_da39a3 for example.

warcat depends on warcio which provide a warcio index command to inspect WARC files more closely and allows extraction of individual files with warcio extract. ArchiveTools also has a warc-extractor.py script to extract files from a WARC file.

The WARC standard also defines a format for CDX files which are an index of a WARC file. Each line represents a file in the archive, which makes it easier to process. A CDX file can be created with the cdx-indexer shipped with pywb.

More WARC resources are listed in the awesome web archiving list and the archive team resource page.

Future research

References

Here are the various programs that can archive websites. Some were mentioned above, some not.

The Koumbit wiki has many instructions specific to Drupal archival. In general, it is good practice to turn off dynamic elements (e.g. comment forms, login boxes, search boxes) in the website before archival, if possible, in order to keep the archived website as usable as possible.

Submitted issues:

Created . Edited .