Hey, even more on character encoding

Submitted by Herb on Sun, 04/27/2008 - 03:05

Ok. let's say I have to downgrade, if you will, from a file that was pushed to me as valid xml with UTF-8 character encoding, to iso-8859-1 as html. What follows is the only way I have found to do this which is both easily scriptable and that uses common command line tools in a *nix environment (including OSX!). If there's a better way, or if any of this frankencode can be improved on, please leave a note in the comments :)

The xml in question is coming from InDesign cs2, and I've seen to it that the xml tags used in the document are actually html, so that part is taken care of.

The next step, since there are special characters that will not translate from UTF-8, like fancy quotes and apostrophes, is to get them into the ascii equivalent *before* converting the rest of the file. Html Tidy seems to be the best tool for the job, using the -b flag to strip fancyness from the characters.

[After testing this a bit more, I've added the -wrap 0 flag and specified xml]

tidy -q -b -xml -wrap 0 -utf8 filename

Tidy will also do a bunch more problem solving like balance tags and add proper html head, title and body tags (Specified xml so this does't happen). I don't need the xml declaration so it comes out on the fly using sed.

tidy -q -b -xml -wrap 0 -utf8 filename | sed '1,1d'

Once the file's in good shape I can convert it with iconv:

/usr/bin/iconv -c -f UTF-8 -t iso-8859-1 filename >newfilename

Two for loops in a bash script and we should have converted files sitting in a 'done' folder:

for f in $( ls *.xml ); do

tidy -q -b -xml -wrap 0 -utf8 $f 2>&1 | more | sed '1,1d' > $f.tidy

done

for i in $( ls *xml.html ); do

/usr/bin/iconv -c -f UTF-8 -t iso-8859-1 $i >done/$i.conv

done

The more pipe is a workaround to suppress tidy messages found at Dave Raggett's Tidy page.

All that's left to do now is rename the files to something a bit shorter.