1. ddrescue primer
  2. ddrescueview
  3. Flash memory
  4. CD-ROMs
    1. Data disks
    2. Audio disks
    3. Mixed-mode disks
    4. Identifying disks
    5. Remaining work
    6. Inventory
  5. References

A few general data recovery principles:

Also keep in mind that Recovering data is only the first step: think about how you will archive what you restore. If it's live data, it's easier as it replaces what is already there. But if it's old data, you need to manage metadata on the medium you import. See the parent archive page for a wider discussion on the topic of archive management.

ddrescue primer

Most recovery attempts should be performed with ddrescue: it's fast for quick restores but can also go very deep with multiple retries and checks to ensure a faithful copy.

The ddrescue manual has a nice examples section detailing general principles, but a TL;DR: for disk drives is:

ddrescue -n /dev/sdb2 /srv/backup/sda2-media-20181005T135440.iso /srv/backup/sda2-media-20181005T135440.map

That does a first pass on the drive using a fast algorithm (skip areas that have errors without retrying). If there are errors, you can do a more thorough pass without -n but in "direct I/O" mode but otherwise the same arguments:

ddrescue -d /dev/sdb2 /srv/backup/sda2-media-20181005T135440.iso /srv/backup/sda2-media-20181005T135440.map

The --retry-passes option (-r) can be used to specify how many times to force ddrescue to retry that process. The examples section has more details on those procedures. Special procedures should be followed for CD-ROMs, detailed below.

To copy to another device you need to pass an extra --force:

ddrescue --force --no-scrape /dev/sda /dev/sdb /tmp/sda.map

And then with the scrape phase, in direct I/O mode:

ddrescue --force --idirect /dev/sda /dev/sdb /tmp/sda.map

ddrescueview

The ddrescueview utility can be read to display ddrescue log files, which may give cues as to what is going on with a drive. With automatic refresh, it might show better progress information than the commandline output.

For example, this shows ddrescue running with the --no-scrape argument:

A
grid of mostly green blocks with, in the middle, stripes of blue
blocks delimited by red blocks and stripes of yellow blocks.
Screenshot of ddrescueview showing ddrescue in its trimming phase.

Here you see it skipped areas (in blue) that had read errors (in red). Those areas were "trimmed", that is: ddrescue tried to get as close to the error as possible to see where the faulty sectors are. In contrasts the "non-trimmed" areas (in yellow) indicate that a bulk read of that area failed but ddrescue does not know which part failed exactly.

When we rerun ddrescue without the -n argument, ddrescue will retry the "non-scraped" area and try to restore what's inside of those trimmed blocks as well:

A
grid of mostly green blocks with, in the middle, scattered red blocks
mostly aligned in columns.
Screenshot of ddrescueview showing ddrescue after its scraping phase.

Here we see ddrescue was able to restore a lot of content, save a few sectors that were completely unreadable. Retrying again might eventually save those sectors.

Notice how both images show a typical "moire" pattern typical of rotating medium: a scratch will leave such a pattern on the data. Those results were obtained on a 16 year old CD-R disk.

Also note that jmtd wrote a tool called badiso to evaluate, based on the ddrescue output, which files are actually recoverable.

Flash memory

Flash memory is especially tricky to recover because SSD drives and SD cards are "smart": they have an embeded controller that hides the actual storage layer. It's the same reason why it's hard to reliably destroy data on those devices as well...

I have so far used ddrescue to restore data from hard drives and flash memory is no exception.

When problems occur with flash memory, it's worth testing the card with the Fight Fake Flash (f3) program (debian package: f3). I have written documentation on those operations in the stressant manual.

CD-ROMs

Found a pile of CDs in the basement. Was looking for my old band but found much more: photos, samizdat, old backups, old games (Quake!), distro images (OpenBSD) and old windows "ghosts". Most of this is junk of course, but a few key parts of that are interesting.

Data disks

CDROMs are ripped with ddrescue:

ddrescue -n -b 2048 /dev/cdrom cdimage.iso cdimage.log

ddrescue does no retry by default, so if we're desparate and think there's a chance to recover the rest we enable scraping (remove the --no-scrape, -n flag) and retries (--retry-passes, -r) in direct I/O mode (--idirect, -d):

ddrescue -d -r 3 -b 2048 /dev/cdrom cdimage.iso cdimage.log

If you are luck and have two identical copies of the same data, you can also use the -r flag to retry an existing iso file. This is best explained in the official manual.

Replace cdimage with the label on the disk. If there's no label, write one! If there's already a filename with the same label, increment.

Note that ddrescue does not support multi-session CD-ROMs. Those will have to be ripped with cdrdao with the --session argument, see the mixed-mode section below for examples.

Audio disks

It's unclear if (or how well) ddrescue works with audio disks. In my tests, it yields empty ISO images on audio CDs. Besides, there are other advanced techniques for those. I'm using whipper to do a faithful copy to FLAC files, using this command:

whipper cd rip --unknown --cdr

The flags are optional: --unknown allows for disks not present on MusicBrainz and --cdr allows for copied CDs.

Mixed-mode disks

Mixed-mode disks are CD-ROMs that contain both audio and data tracks. Those are particularly challenging to archive.

Whipper will fail on mixed-mode discs, especially if the data track is at the beginning, which was the case in all the disks I have found, including the original Quake CD-ROM.

ddrescue will extract the ISO part of the disk but the kernel will return errors for the audio part. The resulting file will be usable, but only for the ISO part of things.

According to this article, a good way to rip those is using cdrdao directly, for example:

cdrdao read-cd --read-raw --datafile data.bin data.toc

The problem there is that this creates only a data.bin file covering the entire disk, and does no error correction like ddrescue does.

The files created by cdrdao then needs some post-processing to be readable as audio or ISO. The first step is to convert the .toc file to a .cue file:

toc2cue data.toc data.cue

If toc2cue shows this warning:

ERROR: Cannot convert: toc-file references multiple data files.

This can be corrected by forcing the same datafile to be used in all tracks of the toc file:

sed -i.orig 's/FILE "\([^"]*\)"/FILE "data.bin"/' data.toc

Then the actual data needs to be rewritten. This is done with the bchunk package which can convert between cdrdao data files and ISO/WAV files. As explained in this blog post, the processing needs to be done separately between the audio and ISO parts. In the example, the data tracks were ripped in a different session than the audio tracks, which made it possible to use the --session argument to extract each separately. Unfortunately, that is generally not the case. What we're interested in, anyways, is probably more the audio files, as the ISO file can be extracted by ddrescue. So to extract the audio, you'll need:

data.bin data.cue data

This will convert all audio tracks to WAV files. Normally, it should also convert ISO files, but in my experience those show up as unusable .ugh files and the ddrescue version need to be used there. Then the WAV files can be compressed to FLAC files using the flac command:

flac --delete-input-file data-*.wav

This usually reduces disk usage by about 30-50% at no loss in quality. You should end up with the following files:

data-01.iso
data-02.flac
data-03.flac
data-04.flac
data-05.flac
data-06.flac
data-07.flac
data-08.flac
data-09.flac
data-10.flac
data-11.flac
data.bin
data.cue
data.map
data.toc

The .bin file is a duplicate but can be used to regenerate the others (except the .iso file of course).

Identifying disks

cdrdao can be used to detect when the CD drive is read. A good first command is disk-info which gives general information about the disk but waits for the CD to be ready:

$ cdrdao disk-info
Cdrdao version 1.2.4 - (C) Andreas Mueller <andreas@daneb.de>
/dev/sr0: TSSTcorp CDDVDW TS-L633A  Rev: TO01
Using driver: Generic SCSI-3/MMC - Version 2.0 (options 0x0000)

WARNING: Unit not ready, still trying...
WARNING: Unit not ready, still trying...
WARNING: Unit not ready, still trying...
That data below may not reflect the real status of the inserted medium
if a simulation run was performed before. Reload the medium in this case.

CD-RW                : no
Total Capacity       : n/a
CD-R medium          : Prodisc Technology Inc.
                       Short Strategy Type, e.g. Phthalocyanine
Recording Speed      : n/a
CD-R empty           : no
Toc Type             : CD-DA or CD-ROM
Sessions             : 1
Last Track           : 27
Appendable           : no

The cdir command, from the cdtool package can give a summary of the medium is present (source):

$ cdir -d /dev/cdrom
unknown cd - 40:39 in 9 tracks
 16:46.13  1 [DATA] 
  3:46.73  2 
  5:34.12  3 
  3:05.41  4 
  3:06.36  5 
  2:02.72  6 
  2:13.67  7 
  0:34.67  8 
  3:26.03  9

Then the cdrdaro discid command will try to analyze the disk to compute a CDDB disk identifier from FreeDB:

$ cdrdao discid
Cdrdao version 1.2.4 - (C) Andreas Mueller <andreas@daneb.de>
/dev/sr0: TSSTcorp CDDVDW TS-L633A  Rev: TO01
Using driver: Generic SCSI-3/MMC - Version 2.0 (options 0x0000)


Track   Mode    Flags  Start                Length
------------------------------------------------------------
 1      AUDIO   0      00:00:00(     0)     02:49:71( 12746)
 2      AUDIO   0      02:49:71( 12746)     04:20:43( 19543)
 3      AUDIO   0      07:10:39( 32289)     01:32:23(  6923)
 4      AUDIO   0      08:42:62( 39212)     00:54:16(  4066)
 5      AUDIO   0      09:37:03( 43278)     05:33:64( 25039)
 6      AUDIO   0      15:10:67( 68317)     06:08:05( 27605)
 7      AUDIO   0      21:18:72( 95922)     01:59:06(  8931)
 8      AUDIO   0      23:18:03(104853)     05:07:13( 23038)
 9      AUDIO   0      28:25:16(127891)     05:15:16( 23641)
10      AUDIO   0      33:40:32(151532)     04:00:38( 18038)
11      AUDIO   0      37:40:70(169570)     00:19:28(  1453)
12      AUDIO   0      38:00:23(171023)     00:06:02(   452)
13      AUDIO   0      38:06:25(171475)     00:06:02(   452)
14      AUDIO   0      38:12:27(171927)     00:06:02(   452)
15      AUDIO   0      38:18:29(172379)     00:06:02(   452)
16      AUDIO   0      38:24:31(172831)     00:06:02(   452)
17      AUDIO   0      38:30:33(173283)     00:53:52(  4027)
18      AUDIO   0      39:24:10(177310)     00:38:08(  2858)
19      AUDIO   0      40:02:18(180168)     00:46:41(  3491)
20      AUDIO   0      40:48:59(183659)     00:06:02(   452)
21      AUDIO   0      40:54:61(184111)     00:06:02(   452)
22      AUDIO   0      41:00:63(184563)     00:06:02(   452)
23      AUDIO   0      41:06:65(185015)     00:06:02(   452)
24      AUDIO   0      41:12:67(185467)     00:06:02(   452)
25      AUDIO   0      41:18:69(185919)     00:44:61(  3361)
26      AUDIO   0      42:03:55(189280)     00:38:51(  2901)
27      AUDIO   0      42:42:31(192181)     00:51:51(  3876)
Leadout AUDIO   0      43:34:07(196057)

PQ sub-channel reading (audio track) is supported, data format is BCD.
Raw P-W sub-channel reading (audio track) is supported.
Cooked R-W sub-channel reading (audio track) is supported.
Analyzing track 01 (AUDIO): start 00:00:00, length 02:49:71...
Analyzing track 02 (AUDIO): start 02:49:71, length 04:20:43...
Analyzing track 03 (AUDIO): start 07:10:39, length 01:32:23...
Analyzing track 04 (AUDIO): start 08:42:62, length 00:54:16...
Analyzing track 05 (AUDIO): start 09:37:03, length 05:33:64...
Analyzing track 06 (AUDIO): start 15:10:67, length 06:08:05...
Analyzing track 07 (AUDIO): start 21:18:72, length 01:59:06...
Analyzing track 08 (AUDIO): start 23:18:03, length 05:07:13...
Analyzing track 09 (AUDIO): start 28:25:16, length 05:15:16...
Analyzing track 10 (AUDIO): start 33:40:32, length 04:00:38...
Analyzing track 11 (AUDIO): start 37:40:70, length 00:19:28...
Analyzing track 12 (AUDIO): start 38:00:23, length 00:06:02...
Analyzing track 13 (AUDIO): start 38:06:25, length 00:06:02...
Analyzing track 14 (AUDIO): start 38:12:27, length 00:06:02...
Analyzing track 15 (AUDIO): start 38:18:29, length 00:06:02...
Analyzing track 16 (AUDIO): start 38:24:31, length 00:06:02...
Analyzing track 17 (AUDIO): start 38:30:33, length 00:53:52...
Analyzing track 18 (AUDIO): start 39:24:10, length 00:38:08...
Analyzing track 19 (AUDIO): start 40:02:18, length 00:46:41...
Analyzing track 20 (AUDIO): start 40:48:59, length 00:06:02...
Analyzing track 21 (AUDIO): start 40:54:61, length 00:06:02...
Analyzing track 22 (AUDIO): start 41:00:63, length 00:06:02...
Analyzing track 23 (AUDIO): start 41:06:65, length 00:06:02...
Analyzing track 24 (AUDIO): start 41:12:67, length 00:06:02...
Analyzing track 25 (AUDIO): start 41:18:69, length 00:44:61...
Analyzing track 26 (AUDIO): start 42:03:55, length 00:38:51...
Analyzing track 27 (AUDIO): start 42:42:31, length 00:51:51...

CDDB: Connecting to cddbp://freedb.freedb.org:888 ...
CDDB: Ok.
No CDDB record found for this toc-file.

The read-toc command will also write that data to a file. Note that the above does not show CDTXT information, the only way to extract that is with read-toc:

cdrdao read-toc --fast-toc tocfile

This is the command called by whipper to read the disk metadata. It then computes a discid and a MusicBrainz hash on his own. But at this point, all this information is shown when running whipper, so the disk-info command is probably all we need to run here. I still run the readtoc command to extract a TOC as sometimes that's the only way to fetch the CDTEXT on the disk. It's also useful for archival purposes. It will also tell us if the disk is a blank, like so:

$ cdrdao read-toc --fast-toc tocfile
Cdrdao version 1.2.4 - (C) Andreas Mueller <andreas@daneb.de>
/dev/sr0: TSSTcorp CDDVDW TS-L633A  Rev: TO01
Using driver: Generic SCSI-3/MMC - Version 2.0 (options 0x0000)

WARNING: Unit not ready, still trying...
WARNING: Unit not ready, still trying...
WARNING: Unit not ready, still trying...
WARNING: Unit not ready, still trying...
ERROR: Inserted disk is empty.

To extract disk identifiers however, cdrdao is rather slow. The cd-discid command is much faster:

$ cd-discid /dev/sr0
9e0af30c 12 150 76757 87524 95692 118024 130633 141869 165637 174714 182592 184870 189598 2805

This returns the old FreeDB-style CDDB disc identifier. A more modern version is the MusicBrainz-style checksum, which can be read with flactag's discid command, but it's slower than cd-diskid:

$ discid /dev/cdrom
dL5EmwESIWTPowb192SkUw5S7p4-

The above is an audio CD and will not work for data disks. And unfortunately, just using disk-info does not suffice to identify data CDs. For this you need the full discid run. Here's an example of a home-made data CD:

$ cdrdao discid
Cdrdao version 1.2.4 - (C) Andreas Mueller <andreas@daneb.de>
/dev/sr0: TSSTcorp CDDVDW TS-L633A  Rev: TO01
Using driver: Generic SCSI-3/MMC - Version 2.0 (options 0x0000)


Track   Mode    Flags  Start                Length
------------------------------------------------------------
 1      DATA    4      00:00:00(     0)     42:53:34(193009)
Leadout DATA    4      42:53:34(193009)

PQ sub-channel reading (data track) is supported, data format is BCD.
Raw P-W sub-channel reading (data track) is supported.
Cooked R-W sub-channel reading (data track) is supported.
Analyzing track 01 (MODE1): start 00:00:00, length 42:53:34...

CDDB: Connecting to cddbp://freedb.freedb.org:888 ...
CDDB: Ok.
ERROR: CDDB: QUERY failed: 502 Already performed a query for disc ID: 00000000
ERROR: Querying of CDDB server failed.

Here the blkid command identifies ISO volumes fairly well:

$ blkid /dev/cdrom
/dev/cdrom: UUID="2008-10-22-16-06-25-00" LABEL="EntrevueDeb_fr, 22 oct 2008" TYPE="iso9660"

But it will fail on most audio CDs, except if they have a multimedia track at the beginning, which confuses things. Indeed, CDs can have different track with different medium, so it can be difficult to tell things apart. For example, here's a mixed data/audio CD ("Burn your TV", from Orange Seeds):

$ cdrdao read-toc --fast-toc tocfile
Cdrdao version 1.2.4 - (C) Andreas Mueller <andreas@daneb.de>
/dev/sr0: TSSTcorp CDDVDW TS-L633A  Rev: TO01
Using driver: Generic SCSI-3/MMC - Version 2.0 (options 0x0000)

Reading toc data...

Track   Mode    Flags  Start                Length
------------------------------------------------------------
 1      DATA    4      00:00:00(     0)     17:01:32( 76607)
 2      AUDIO   2      17:01:32( 76607)     02:23:42( 10767)
 3      AUDIO   0      19:24:74( 87374)     01:48:68(  8168)
 4      AUDIO   0      21:13:67( 95542)     04:57:57( 22332)
 5      AUDIO   0      26:11:49(117874)     02:48:09( 12609)
 6      AUDIO   0      28:59:58(130483)     02:29:61( 11236)
 7      AUDIO   0      31:29:44(141719)     05:16:68( 23768)
 8      AUDIO   0      36:46:37(165487)     02:01:02(  9077)
 9      AUDIO   0      38:47:39(174564)     01:45:03(  7878)
10      AUDIO   0      40:32:42(182442)     00:30:28(  2278)
11      AUDIO   0      41:02:70(184720)     01:03:03(  4728)
12      AUDIO   0      42:05:73(189448)     04:37:69( 20844)
Leadout AUDIO   0      46:43:67(210292)

PQ sub-channel reading (data track) is supported, data format is BCD.
Raw P-W sub-channel reading (data track) is supported.
Cooked R-W sub-channel reading (data track) is supported.
PQ sub-channel reading (audio track) is supported, data format is BCD.
Raw P-W sub-channel reading (audio track) is supported.
Cooked R-W sub-channel reading (audio track) is supported.
Analyzing track 01 (MODE1): start 00:00:00, length 16:59:32...
Analyzing track 02 (AUDIO): start 17:01:32, length 02:23:42...
Found pre-gap: 00:02:00
Analyzing track 03 (AUDIO): start 19:24:74, length 01:48:68...
Analyzing track 04 (AUDIO): start 21:13:67, length 04:57:57...
Analyzing track 05 (AUDIO): start 26:11:49, length 02:48:09...
Analyzing track 06 (AUDIO): start 28:59:58, length 02:29:61...
Analyzing track 07 (AUDIO): start 31:29:44, length 05:16:68...
Analyzing track 08 (AUDIO): start 36:46:37, length 02:01:02...
Analyzing track 09 (AUDIO): start 38:47:39, length 01:45:03...
Analyzing track 10 (AUDIO): start 40:32:42, length 00:30:28...
Analyzing track 11 (AUDIO): start 41:02:70, length 01:03:03...
Analyzing track 12 (AUDIO): start 42:05:73, length 04:37:69...

Reading of toc data finished successfully.

Notice the first track is a DATA track (MODE1 later). Extracting this disk will require first running ddrescue on the first track and whipper on the rest. We'll see how it goes...

The isoinfo command, part of the genisoimage package, can provide extended information on data disks. For example, here's the information available in the "Burn Your TV" multimedia disk:

$ isoinfo -d -i /dev/sr0
CD-ROM is in ISO 9660 format
System id: FreeBSD
Volume id: Burn Your TV Multimedia
Volume set id: 
Publisher id: 
Data preparer id: 
Application id: MKISOFS ISO 9660/HFS FILESYSTEM BUILDER & CDRECORD CD-R/DVD CREATOR (C) 1993 E.YOUNGDALE (C) 1997 J.PEARSON/J.SCHILLING
Copyright File id: 
Abstract File id: 
Bibliographic File id: 
Volume set size is: 1
Volume set sequence number is: 1
Logical block size is: 2048
Volume size is: 76352
Joliet with UCS level 3 found
Rock Ridge signatures version 1 found

Remaining work

All the archives created were dumped in the ~/archive or ~/mp3 directories on curie. Data needs to be deduplicated, replicated, and archived somewhere more logical.

Inventory

I have a bunch of piles:

all disks were eventually identified as trash, blanks, perfect, finished, defective, or not processed. A special needs attention stack was the "to do" pile, and would get sorted through the other piles. each pile was labeled with a sticky note and taped together summarily.

this page was printed and attached included in the box, along with a post-it linking to the blog post announcing the work for posterity.

here is a summary of what's in the box.

Type Count Note
trash 13 non-recoverable. not detected by the Linux kernel at all and no further attempt has been made to recover them.
blanks 3 never written to, still usable
perfect 28 successfully archived, without errors
finished 4 almost perfect: but mixed-mode or multi-session
defective 21 found to have errors but not considered important enough to re-process
total 69
not processed ~100 visual estimate

Note that this might be poor storage on my part, others have had more luck with their CDs, see this report from a fellow Debian developer for example.

References

I'm following the path blazed by jmtd here and here. (Update: inspired by this very post, jmtd collected his notes in a new page.) The forensics wiki also has docs on ddrescue which were useful.

Tools used:

Other tools:

Created . Edited .