blog » Converting my old content to markdown

So I have just converted all the content on this blog to markdown. It was rather painful. I had really old content ranging as far back as 2005 in here, and I went through about 3 distinct markup filters here, most of which were irregular and changing according to the position of the sun, the drupal.org releases and wind speed. Now it's all markdown. This involved patience, drush and 3 hours of wasted time. Now, the fact that Markdown picked up speed is always a little strange to me. The syntax isn't particularly complete, which leads to non-standard extension like markdown-extra popping up, with the inevitable variations according to the language. Github, for example, has its own flavor of the famous markup. Finally, Drupal's filters are kind of klunky: the usual < url > markup doesn't work. So things are a little weird, but Markdown seems to be here to stay, or anyways it's the only markup I have seen supported reliably across multiple CMS and sites. One has to wonder why we are still stuck with plain old HTML on Drupal.org...

The actual conversion

The conversion was rather annoying. I had to track down all those formats, which meant mostly converting a wiki-like syntax from the freelinking module to markdown. (It's actually more complicated than that, because there was also the simplewiki filter, but let's ignore that because they were few and I just did them by hand.)

In the end, I arrived to the following script:

 2) {
    $mdwn = "[" . $match[2] . "](" . $match[1] . ")";
  } else {
    $mdwn = "[" . $match[1] . "](" . $match[1] . ")"; # hack: drupal fails on 
  }
  print "$orig\t=>\t$mdwn\n";
  return $mdwn;
}

$q = db_query("select node.nid, format, FROM_UNIXTIME(created) AS c, body, teaser, node.title from node_revisions inner join node on node.vid = node_revisions.vid where format = 1 AND ( teaser like '%[[%' OR body like '%[[%' ) order by created LIMIT 1;");

while ($row = db_fetch_object($q)) {
  print $row->nid . " | " . $row->format . " | " . $row->c . " | " . $row->title . "\n";
  $node = null;
  foreach (array('teaser', 'body') as $part) {
    print "checking $part... ";
    $newpart = preg_replace_callback('/\[\[(\|]*)(?:\|(]*))?\]\]/', 'wiki2mdwn', $row->$part);
    if ($newpart != $row->$part) {
      print "replacement... ";
      if (is_null($node)) {
        $node = node_load($row->nid);
        print "node loaded... ";
      }
      $node->$part = $newpart;
    }
  }
  if (!is_null($node)) {
    node_save($node);
    print "node {$node->nid} saved... ";
  }
  print "\n";
}

$q = db_query("SELECT nid, cid,FROM_UNIXTIME(timestamp),format, subject, comment FROM comments WHERE format = 1 AND comment LIKE '%[[%' ORDER BY cid LIMIT 1;");

while ($row = db_fetch_object($q)) {
  print "checking comment {$row->cid} in node {$row->nid} with subject {$row->subject}... ";
  $newcom = preg_replace_callback('/\[\[(\|]*)(?:\|(]*))?\]\]/', 'wiki2mdwn', $row->comment);
  print "\nsaving... ";
  db_query("UPDATE comments SET comment = '%s' WHERE cid = %d", $newcom, $row->cid);
  print "comment {$row->cid} in node {$row->nid} saved.\n";
}

Yes. This is klunky and ugly. But it works. If you have more than... say.. 200 nodes or comments to convert, I would strongly recommend optimizing this into SQL directly, but I was worried I would break stuff so I preferred operating on a preg_replace_callback() than plain SQL.

Oh, and this is a drush snippet, for those who don't know about that (rather old) drush feature, by the way. :) To run this, you basically dump this in a file and run it:

drush @anarcat.koumbit.org wiki2mdwn.php

Notice how I use a drush alias there - this one is automatically created by the Aegir this site lives on. Time saver.

So long and annoying, but at long last done!

I know right! It's awesome.
I know right! It's awesome. At long last, standards! Or somehow standard.
Comment by anarcat late Saturday evening, December 15th, 2012
why would you want to convert
why would you want to convert existing content to markdown if the main advantage to me is enhanced writing experience. why not just skip that, write new content using markdown and optionally convert legacy content to HTML if it wasn't already?
Comment by dasjo early Sunday morning, December 16th, 2012
That's a good point, i didn't

That's a good point, i didn't actually think of that. I guess it was a tad harder to hook into the filter system than to just do simple regex replacements and node_save(), which I am more familiar with. The possibility of easily editing previous content is also attractive.

I was already writing new content in markdown, though...

Comment by anarcat Sunday afternoon, December 16th, 2012
Add a comment