Home » Ancient Posts » RetroRoleplaying: The Blog Disappears and (Fortunately) Returns    
0 0 votes
Article Rating
8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Talysman

Started trying to fix the script so it will handle the huge archive-style Atom XML file you get via Takeout or the dashboard. So far, it's not going too well. I think that huge Blogger template is screwing things up. It's certainly behaving in unexpected ways.

Randall

Talysman: +++It's just a filter; I didn't try to use any of the Perl modules for XML handling or anything like that; it just loads lines one at a time, drops them if it matches certain Atom elements, and converts them to something usable if it's one of about 8-10 different elements.++++

Sometimes simple is best. I remember writing a quick awk filter in the late 1990s to allow Aladdin (the offline reader for the GEnie Online Service) to work with another service whose name I forget by dropping everything from the text capture of the session that would mess up Aladdin's parsing of the file and making minor word changes and the like so topic and area name labels matched what Aladdin was expecting. Quick and dirty, but it did the job.

Talysman

I should add, though, that the reason I haven't tried the script on the big file is because I used an XSL stylesheet to transform it into an HTML version. I downloaded something called XML Wrench and then found an Atom 2 HTML XSLT … there's a menu option that lets you perform a conversion using that stylesheet.

Talysman

It should work with the huge Atom file, too, although I didn't test it. It's just a filter; I didn't try to use any of the Perl modules for XML handling or anything like that; it just loads lines one at a time, drops them if it matches certain Atom elements, and converts them to something usable if it's one of about 8-10 different elements. It will accept input from standard in or a glob pattern, and it creates one .html file for every .xml file.

I should probably add a filter to get rid of Blogger's template information before trying it on an all-in-one Atom file, Also need to fix it so that it can handle both foo.xml and foo.comments.xml without overwriting the former.

Thinking about adding a couple other scripts for blog backup twiddling, like one that lets you merge or filter out posts with one or more tags, or one that merges comments with their parent post. Might also need one that fixes file names so they aren't as cumbersome; the backup utility spits out XML files that include date and time of post in the file name, like: 20140104T150000-RetroRoleplaying_ the Blog Disappears and (Fortunately) Returns .xml

Randall

Talysman: I can import that huge Atom file into Drupal if I have to (and probably WordPress as well).

Your program sounds interesting, however. Does it work with the huge Atom XML file or with the one per post version. Either way, it might be useful as I'm starting to play with Pandoc with the idea of writing things in some ascii-only format — like one of the "improved" versions of Markdown — so I could use git to handle versioning instead of many word files with timestamps in their names.

Talysman

Not sure if it's the same with Takeout, but using the dashboard backup feature gets you a huge Atom file, which I find mostly useless, since few things other than news readers let you do anything with them. Using the Blogger Backup utility also gets you Atom format backups, but you can specify one file per post, for example, which may give you more manageable file sizes.

I used the utility and then wrote a crude Perl script that converts Atom XML files to HTML files. Pandoc can then convert these to various other formats like Markdown, which might be easier to deal with. Let me know if that sounds useful.

Randall

Huzzahs certainly are in order.

Rachel Ghoul

Everything I know is right once again! Huzzahs are in order!