Friday, August 07, 2009

Makeshift XML beautifier in python

Lately I had to deal with some XML dumps. It's a pain to analyse XML if it's not properly indented. Browsers do the best job of rendering XML. Those collapsable XML elements are very handy. But they work only if the XML file is downloaded with right MIME type. That's why XML dumped to a local file and opened using a browser doesn't get the same treatment.

I am sure there are other XML beautifiers, but I couldn't find one that will work for me. (I am sure in comments someone will post better solutions). Finally following simple python script did the trick. I found it here and corrected a little to take care of </> tags. It worked perfectly on many XML dumps I worked with.

#!/usr/bin/python
import sys
import re

data = open(sys.argv[1],'r').read()

fields = re.split('(<.*?>)',data)
level = 0
for f in fields:
if f.strip() == '': continue
if f[0]=='<' and f[1] != '/':
print ' '*(level*4) + f
level = level + 1
if f[-2:] == '/>':
level = level - 1
elif f[:2]=='</':
level = level - 1
print ' '*(level*4) + f
else:
print ' '*(level*4) + f

It's all about keeping track of depth.

4 comments:

Steve Taylor said...

Thanks! This works great.

Anonymous said...

Thanks for your yousefull simple script.

I extend it, to preserve CDATA-Section which contains HTML-Tags:

#!c:/Pythom26/python26.exe
# XMLBeautifier.py
# Quelle: http://jyro.blogspot.com/2009/08/makeshift-xml-beautifier-in-python.html
# modified by Thomas Haeny (dev@haeny.de), 9.8.2010
import sys
import re

#init's:
preserveCDATA = 1
intendCols = 4

data = open(sys.argv[1],'r').read()

fields = re.split('(<.*?>)',data)
level = 0
cdataFlag=0

for f in fields:
if f.strip() == '': continue

if preserveCDATA :
# rejoin splitted CDATA-Tags which contains HTML-Tags
if f[:8] == '' :
cdataFlag=0
print ' '*(level*intendCols) + cdata
continue

if f[0]=='<' and f[1] != '/' and f[1] != '!' :
print ' '*(level*intendCols) + f
level = level + 1
if f[-2:] == '/>':
level = level - 1

elif f[:2]=='</':
level = level - 1
print ' '*(level*intendCols) + f

else:
print ' '*(level*intendCols) + f

You can decide to publish it or not.

Jayesh said...

Thanks for the improvements.

Over time, I found out about xmllint. It's nice little utility that can beautify XML.

Anonymous said...

Very cool - I know there are more elaborate tools out there, but I needed it for just a couple files and your script is perfect.