Export WordPress XML file to separate html files.

I needed to quickly export all the articles in a WordPress install to separate html files.  There were over 400 posts so a copy and paste was not an option.  The quickest way to do this was to export it using the built in export option, then process it using php.

Here is the quick and hacky code I wrote for this specific job.  It is not a good example of php code but it did the job required.  The str_replace lines are to replace specific problems for filenames.  You will need to delete modify these to suite your particular file naming issues.  If you comment out the file_put_contents you will be able to spot any filename issues.

This is a php-cli script so don’t try to run it in your browser.

Here is the code to convert the wordpress xml export file to separate html files.

#!/usr/bin/php -q
<?php
/* Really nasty, really fast hack to extract all wordpress articles to .html files.
 *	test before using.  It worked for a particular site to save time copy and pasting.
 *	There is no guarantee it will work for you.
 */

$filename = "mycms.wordpress.2016-01-27.xml";
$file=file_get_contents ($filename);
$xml=simplexml_load_string($file);


foreach ($xml->channel->item as $item) {
	$filename = $item->title;
	$filename = str_replace ( " ", "-", $filename );
	$special_chars = array("/", "(", ")", ",", ";", ":", "'", ".");

	$filename = str_replace ( $special_chars, "", $filename );
	$filename = str_replace ( "---", "-", $filename );

	$filename = strtolower($filename);

	echo "writing $filename".".html" ."\n";
	$content = $item->children("content", true);
	$content = (string)$content->encoded; 
	$filename = $filename.".html";
	file_put_contents($filename, $content);
	}
?>

I then used the following to convert all the html to markdown using pandoc.

find ./ -iname "*.html" -type f -exec sh -c 'pandoc "${0}" -o "./markdown/$(basename ${0%.html}.md)"' {} \;

You can get a clean version of this with some file filtering from my github account.
https://gist.github.com/karlgray/c3ab17615b3c0f712cb4144a4734c25b

 

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.