I am an amateur archivist. I keep an archive of audio (music and audiobooks), books (physical and electronic), video (films and TV episodes), and websites as a hobby but also as a librarian: some stuff should be preciously kept for future generations (and enjoyed by ours as well of course).

Web archives

I specifically worked on archiving web sites and wrote a LWN article (local copy) on the topic. Detailed documentation is in web.

Data rescue

I have some experience in data recovery, mostly built as I dealt with various broken hardware: fake flash cards, old CD-ROMs, dead hard drives... My notes on this are in rescue.

Archive management

Mirroring and restoring data is only part of the problem. Once (re)created, the data needs to be properly indexed otherwise it's an undecipherable pile of garbage where nothing can be found. Metadata need to be created for the content and properly indexed. This can include, for each piece of content:

Determining that data is only one part, you also need a way to store the information in a meaningful way. Unfortunately, I don't have good advice for this but to make sure you name the created folders and files correctly. Various storage mediums have support for metadata (MP3 tags, Exif tags for photos, etc): use them. Otherwise filenames can be used or auxiliary text files.

I mostly use git-annex to manage my archives and make sure I have redundant copies. git-annex also supports "scrubbing" copies by verifying checksums on the content.

I also use the following software to import, index and browse contents:

All of those are stored in multiple locations with git-annex, except software which is managed through git only and web archives which are not replicated and usually stored directly on archive.org.

I do not have good mechanisms for the following:

I need to evaluate the following tools for archive management:

Those come from the awesome self-hosted list.

Created . Edited .