Showing posts with label SAX Parser. Show all posts
Showing posts with label SAX Parser. Show all posts

Tuesday, 16 September 2008

Parsing XML: PHP vs Java

Let's see how much time does PHP programmer spend to parse some XML file.
Here is our test XML file:


and this is how I would parse this document in PHP:


and the result would be:
 
Nice, don't you think?
And this is how I would parse the same file in Java:


Main.java:

package me.tcom.kdarko.blog;

import java.io.IOException;
import java.util.ArrayList;
import org.xml.sax.SAXException;

public class Main
{
   
    private ArrayList arraylist = null ;
   
    public static void main(String[] args)
    {
    try {
        Main m = new Main();
        m.arraylist = new EmployeesListParser("list_of_employees.xml").doParse();
        for(Employee e:m.arraylist)
        System.out.println("Employee: ID: " + e.getId() + " , " + e.getName() + "  " + e.getLast_name());       
        }
        catch (IOException ex)
        {
        System.out.println(ex.getMessage());
        }
        catch (SAXException ex)
        {
        System.out.println(ex.getMessage());
        }
        catch (Exception ex)
        {
        System.out.println(ex.getMessage());
        } 
    }

}
Employee.java:

package me.tcom.kdarko.blog;

public class Employee
{

    String id , name, last_name;

    public Employee() {
    }
  
    public Employee(String id, String name, String last_name) {
        this.id = id;
        this.name = name;
        this.last_name = last_name;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getLast_name() {
        return last_name;
    }

    public void setLast_name(String last_name) {
        this.last_name = last_name;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }
       
}
EmployeesListParser.java:

package me.tcom.kdarko.blog;

import java.io.IOException;
import java.util.ArrayList;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import com.sun.org.apache.xerces.internal.parsers.SAXParser;

public class EmployeesListParser implements ContentHandler
{
    private String XML_FILE_NAME;
    String temp = null;
    Employee employee = new Employee();
    StringBuffer sb=new StringBuffer();
    private ArrayList arraylist = new ArrayList();

    public EmployeesListParser(String XML_FILE_NAME) {
        this.XML_FILE_NAME = XML_FILE_NAME;
    }
   
    public ArrayList doParse() throws IOException, SAXException
    {
     XMLReader parser = new SAXParser();
     parser.setContentHandler(this);
     parser.parse(XML_FILE_NAME);
     return arraylist;  
    }

    public void startElement(String uri, String name, String qName, Attributes atts) throws SAXException
    {
         sb.setLength(0);
         if(name.equalsIgnoreCase("id"))
             temp="id";
         else if(name.equalsIgnoreCase("name"))
             temp="name";
         else if(name.equalsIgnoreCase("last_name"))
             temp = "last_name";
         else if(name.equalsIgnoreCase("employee"))
          employee = new Employee();
    }

    public void endElement(String uri, String name, String qName) throws SAXException
    {
    sb.setLength(0);
    if(name.equalsIgnoreCase("employee"))
    {
     arraylist.add(employee);
     employee=null;
    }
    }

    public void characters(char[] arg0, int arg1, int arg2) throws SAXException
    {
        sb.append(arg0,arg1,arg2);
        String s = sb.toString().trim();
        if(!s.equals(""))
        {
        if( temp.equalsIgnoreCase("id") )
        {
         employee.setId( s );
        }
        else if( temp.equalsIgnoreCase("name") )
        {
        employee.setName( s );
         }
         else if( temp.equalsIgnoreCase("last_name") )
         {
        employee.setLast_name(s);
         }
    }
    }

    public void setDocumentLocator(Locator locator) {
    }

    public void startDocument() throws SAXException {
    }

    public void endDocument() throws SAXException {
    }

    public void startPrefixMapping(String prefix, String uri) throws SAXException {
    }

    public void endPrefixMapping(String prefix) throws SAXException {
    }

    public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
    }

    public void processingInstruction(String target, String data) throws SAXException {
    }

    public void skippedEntity(String name) throws SAXException {
    }
   
}
 
and the result would be:



:-)
No comment...

Friday, 25 July 2008

Some of the SAX Parser issues I came up to...

1) SAX Parser issue No 1 : Sax won’t release the file

When I need to parse XML files in JAVA projects, I use SAX parser.
I like using SAX. It’s quite easy.
Of course, there’s nothing like parsing in PHP ( at least for me ) but SAX can also be interesting…
In one of my projects , I had this situation:
I had a thread that needed to parse N .xml documents , and in case of SaxException or IOException my thread needed to delete or move the current file to some new location.
I was using SAX parser.
It looks really easy thing to do, and it is! well, if you know and do some extra things…
Probably everyone who had this kind of situation very soon noticed that SAX parser won’t delete or move this file!
Why is that so?
Well , I don’t know if this is bug,or it just could have been done better, but it seems that XMLReader used when creating parser does not “release” that document when Exception occurs!
I cannot believe this is the way SAX works, but seems like it is!
What we need to do in a case of Exception? We need to somehow explicitly close this file.
And how can this be done?

I recommend you not to use

void parse(String systemId) method when parsing.

Use

void parse(InputSource input) instead!

This way, you can always explicitly close this InputSource when you need to “release” parsing document when Exception occurs.

Here is the example:

Let’s make some class called XMLDocumentParser that implements ContentHandler:

public class XmlDocumentParser implements ContentHandler

and let’s make some method doParse() that returns some wanted object that represents parsed document data:

If we make it like this:


public class XmlDocumentParser implements ContentHandler
{

private String XML_FILE; // holds the name of the document
WantedObject wantedobject; // object that holds xml data

public WantedObject doParse() throws IOException, SAXException
{
XMLReader parser =new SAXParser();
parser.setContentHandler(this);
parser.parse(XML_FILE);
return wantedobject;
}

}

we won’t be able to control it when Exception occurs.

But, if we use void parse(InputSource input) instead:
public WantedObject doParse() throws IOException, SAXException
{
stream = new BufferedInputStream(new FileInputStream(XML_FILE));
XMLReader parser = new SAXParser();
parser.setContentHandler(this);
parser.parse(new InputSource(stream));
return wantedobject;
}



and somewhere in the caller-method, catch Exception, close the stream and do what ever you want with the file:

// caller-method:
...
XmlDocumentParser xmldocumentparser=null;
WantedObject wantedobject=null;
...
// start parsing:
xmldocumentparser = new XmlDocumentParser(“c:\\example.xml”); // set XML_FILE
wantedobject = xmldocumentparser.doParse();
catch (SAXException ex)
{
if(xmldocumentparser.stream != null)
xmlporukaparser.strim.close();
// delete document or move it , or something else…
}




2) SAX Parser issue No 2 : Problem with large text inside tags


I came up to this problem when parsing xml files with large text inside tags.
When a tag values is a small text or a number, SAX will parse your document without a problem, but when a tag value is some large text, then the SAX will return only a part of that text. The last part of that text.
Why is this so?
Do not ever assume and expect that a SAX parser will return entire String it finds inside some tag. SAX parser ( and a great number of other parsers) will return several chunks of that value, and it is your job to put that pieces together…

Lets see how this works:
In sax, in order to read tag values, among other methods, we need to implement characters() method.


StringBuffer sb=new StringBuffer();
Person person= new Person();


public void characters(char[] arg0, int arg1, int arg2) throws SAXException
{
sb.append(arg0,arg1,arg2);
String s = sb.toString().trim();
if(!s.equals(""))
{
if( currTag.equalsIgnoreCase("name") )
man.setName(s);
.
.
.
// do all the things you want…

}
}


So, do not use just some string:

String s = new String(arg0, arg1, arg2);

and assume that’s all there is inside current tag.

Use StringBuffer instead and append every chunk to it…

and, ofcourse, do

sb.setLength(0);

on start of startElement() and EndElement() methods.


There is also another way of solving this problem:



This is the way of saying to parser “Do not parse what’s in it, just read it all at once…”. This is used when some other xml is inside some tag, for example…