Friday 25 July 2008

Some of the SAX Parser issues I came up to...

1) SAX Parser issue No 1 : Sax won’t release the file

When I need to parse XML files in JAVA projects, I use SAX parser.
I like using SAX. It’s quite easy.
Of course, there’s nothing like parsing in PHP ( at least for me ) but SAX can also be interesting…
In one of my projects , I had this situation:
I had a thread that needed to parse N .xml documents , and in case of SaxException or IOException my thread needed to delete or move the current file to some new location.
I was using SAX parser.
It looks really easy thing to do, and it is! well, if you know and do some extra things…
Probably everyone who had this kind of situation very soon noticed that SAX parser won’t delete or move this file!
Why is that so?
Well , I don’t know if this is bug,or it just could have been done better, but it seems that XMLReader used when creating parser does not “release” that document when Exception occurs!
I cannot believe this is the way SAX works, but seems like it is!
What we need to do in a case of Exception? We need to somehow explicitly close this file.
And how can this be done?

I recommend you not to use

void parse(String systemId) method when parsing.

Use

void parse(InputSource input) instead!

This way, you can always explicitly close this InputSource when you need to “release” parsing document when Exception occurs.

Here is the example:

Let’s make some class called XMLDocumentParser that implements ContentHandler:

public class XmlDocumentParser implements ContentHandler

and let’s make some method doParse() that returns some wanted object that represents parsed document data:

If we make it like this:


public class XmlDocumentParser implements ContentHandler
{

private String XML_FILE; // holds the name of the document
WantedObject wantedobject; // object that holds xml data

public WantedObject doParse() throws IOException, SAXException
{
XMLReader parser =new SAXParser();
parser.setContentHandler(this);
parser.parse(XML_FILE);
return wantedobject;
}

}

we won’t be able to control it when Exception occurs.

But, if we use void parse(InputSource input) instead:
public WantedObject doParse() throws IOException, SAXException
{
stream = new BufferedInputStream(new FileInputStream(XML_FILE));
XMLReader parser = new SAXParser();
parser.setContentHandler(this);
parser.parse(new InputSource(stream));
return wantedobject;
}



and somewhere in the caller-method, catch Exception, close the stream and do what ever you want with the file:

// caller-method:
...
XmlDocumentParser xmldocumentparser=null;
WantedObject wantedobject=null;
...
// start parsing:
xmldocumentparser = new XmlDocumentParser(“c:\\example.xml”); // set XML_FILE
wantedobject = xmldocumentparser.doParse();
catch (SAXException ex)
{
if(xmldocumentparser.stream != null)
xmlporukaparser.strim.close();
// delete document or move it , or something else…
}




2) SAX Parser issue No 2 : Problem with large text inside tags


I came up to this problem when parsing xml files with large text inside tags.
When a tag values is a small text or a number, SAX will parse your document without a problem, but when a tag value is some large text, then the SAX will return only a part of that text. The last part of that text.
Why is this so?
Do not ever assume and expect that a SAX parser will return entire String it finds inside some tag. SAX parser ( and a great number of other parsers) will return several chunks of that value, and it is your job to put that pieces together…

Lets see how this works:
In sax, in order to read tag values, among other methods, we need to implement characters() method.


StringBuffer sb=new StringBuffer();
Person person= new Person();


public void characters(char[] arg0, int arg1, int arg2) throws SAXException
{
sb.append(arg0,arg1,arg2);
String s = sb.toString().trim();
if(!s.equals(""))
{
if( currTag.equalsIgnoreCase("name") )
man.setName(s);
.
.
.
// do all the things you want…

}
}


So, do not use just some string:

String s = new String(arg0, arg1, arg2);

and assume that’s all there is inside current tag.

Use StringBuffer instead and append every chunk to it…

and, ofcourse, do

sb.setLength(0);

on start of startElement() and EndElement() methods.


There is also another way of solving this problem:



This is the way of saying to parser “Do not parse what’s in it, just read it all at once…”. This is used when some other xml is inside some tag, for example…

3 comments:

Anonymous said...

you might want to look at vtd-xml as the best alternative for SAX

http://vtd-xml.sf.net

Darko Kalinic said...

Thank you, I never used vtd-xml. I'll probably give it a try.

Anonymous said...

Life saver