Warning: Parameter 1 to Language::getMagic() expected to be a reference, value given in /home/wikija5/public_html/w/includes/StubObject.php on line 58

Warning: Parameter 3 to renderSEO() expected to be a reference, value given in /home/wikija5/public_html/w/includes/parser/Parser.php on line 3243
XML statistics with SAX - WikiJava
Tuesday, 2nd September 2014

    Strict Standards: Only variables should be passed by reference in /home/wikija5/public_html/w/skins/GuMax.php on line 126
  • Log in
Follow WikiJava on twitter now. @Wikijava

XML statistics with SAX

From WikiJava

Jump to: navigation, search
The author suggests:

buy this book


In this article I will show you how to create a very simple software for generating statistics about an XML file using a SAX parser. The software proposed implements a simple org.xml.sax.ContentHandler to manage the SAX events.

Contents

the article

Image:250px-Subversion.png
You can download the complete code of this article from the Subversion repository at this link

Using the username:readonly and password: readonly

See the using the SVN repository instructions page for more help about this.

SAX is an event based XML parser, this means that it will parse an XML file and generate specific events each time the parser incurs in particular portion of an XML document.

The strengths of the SAX parser are it's speed and the fact that it doesn't need to load the whole document in memory. These characteristics make SAX the only viable option when you have to parse very big XML documents (consider that a Dom Document such as org.w3c.dom.Document normally occupies in memory 4 times the size of the XML source).

In order to parse a document with SAX you will need to write an event handler, implementing the interface org.xml.sax.ContentHandler, in which you will specify the actions to execute upon each event.

The events supported are:

   public void setDocumentLocator (Locator locator);
   public void startDocument ()
   public void endDocument()
   public void startPrefixMapping (String prefix, String uri)
   public void endPrefixMapping (String prefix)
   public void startElement (String uri, String localName, String qName, Attributes atts)
   public void endElement (String uri, String localName, String qName)
   public void characters (char ch[], int start, int length)
   public void ignorableWhitespace (char ch[], int start, int length)
   public void processingInstruction (String target, String data)
   public void skippedEntity (String name)

Sax will execute these methods on your event handler when it will find the relating XML constructs. All you have to do is to create your handler class implementing the interface org.xml.sax.ContentHandler and executing the specific operations.

To simplify the life for the programmer, SAX offers org.xml.sax.helpers.DefaultHandler that already implements all the required methods, and makes an empty implementation for each of them. You can extend this class and override the methods for the events that you need in your program, and forgetting about having to implement every single method of the org.xml.sax.ContentHandler interface.

XMLStatistics program is compound of three classes:

XMLStatistics.java
containing the main method
StatisticsContentHandler.java
the core of the parser, extending DefaultHandler
XMLStatisticsBean.java
a pojo bean containing the statistics about the XML

XMLStatistics.java

The core of the main method is the :

	    XMLReader parser = XMLReaderFactory.createXMLReader();
	    // sets our contentHandler to be used by the parser
	    StatisticsContentHandler handler = new StatisticsContentHandler();
	    parser.setContentHandler(handler);
	    InputSource source = new InputSource(filename);
	    parser.parse(source);

With this snippet we obtain a parser object from the org.xml.sax.helpers.XMLReaderFactory, we set our handler in the parser and finally we just parse the document.

the rest of the class prints out the statistic results.

package org.wikijava.xml.sax.XMLstatistics;
 
import java.io.IOException;
 
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
 
/**
 * 
 * generates statistics about an XML document
 * 
 * @author Giulio
 */
public class XMLStatistics {
 
    public static void main(String[] args) {
 
	if (args.length < 1 ){
	    System.err.println("usage: XMLStatistics [filename]");
	    return;
	}
 
	String filename = args[0];
 
	try {
	    XMLReader parser = XMLReaderFactory.createXMLReader();
	    // sets our contentHandler to be used by the parser
	    StatisticsContentHandler handler = new StatisticsContentHandler();
	    parser.setContentHandler(handler);
	    InputSource source = new InputSource(filename);
	    parser.parse(source);
 
	    XMLStatisticsBean stats = ((StatisticsContentHandler) (parser
		    .getContentHandler())).getStatistics();
 
	    System.out.println("document is well formed");
	    System.out.println("total number of elements: "
		    + stats.getNumberOfElements());
	    System.out.println("max Element depth: "
		    + stats.getMaxElementDepth());
	    System.out.println("total number of Attributes: "
		    + stats.getTotalNumberOfAttributes());
 
	} catch (SAXException e) {
	    System.out.println(filename + " is not well-formed.");
	} catch (IOException e) {
	    System.out
		    .println("Due to an IOException, the parser could not check "
			    + filename);
	}
 
    }
 
}


StatisticsContentHandler.java

This is were the real parsing take place.

For this particular example only the start document, start element and end element events are considered, but you can implement the other methods of the interface for handling any other event.


package org.wikijava.xml.sax.XMLstatistics;
 
import java.util.Stack;
 
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
 
public class StatisticsContentHandler extends DefaultHandler {
 
    // contains the statistics
    protected XMLStatisticsBean statistics;
 
    //remembers the depth of the element currently found
    protected int curDepth;
 
    //remembers the elements opened, to verify the correct nesting
    protected Stack<String> elementsStack = new Stack<String>();
 
    /**
     * very similar to a constructor
     */
    @Override
    public void startDocument() throws SAXException {
	statistics = new XMLStatisticsBean();
    }
 
    /**
     * called when the parser finds the start of an element
     */
    @Override
    public void startElement(String uri, String localName, String qName,
	    Attributes attributes) throws SAXException {
	statistics.setNumberOfElements(statistics.getNumberOfElements() + 1);
	curDepth++;
	elementsStack.push(qName);
	if (curDepth > statistics.getMaxElementDepth())
	    statistics.setMaxElementDepth(curDepth);
 
	statistics.setTotalNumberOfAttributes(statistics
		.getTotalNumberOfAttributes()
		+ attributes.getLength());
 
    }
 
    /**
     * called when the parser finds the end of an element
     */
    @Override
    public void endElement(String uri, String localName, String qName)
	    throws SAXException {
	curDepth--;
	if (!(elementsStack.pop().equals(qName))) {
	    throw new SAXException("Document elements not correctly nested");
	}
 
    }
 
    public XMLStatisticsBean getStatistics() {
	return statistics;
    }
 
}


XMLStatisticsBean

This is a POJO Bean to carry around the statistics, nothing too interesting about it.

package org.wikijava.xml.sax.XMLstatistics;
 
public class XMLStatisticsBean {
 
    private int numberOfElements;
 
    private int totalNumberOfAttributes;
 
    private float averageNumberOfAttributes;
 
    private int maxElementDepth;
 
    private float averageElementDepth;
 
    private int numberOfCharacters;
 
    private int maxNumberOfcharactersInAnElement;
 
    public float getAverageElementDepth() {
	return this.averageElementDepth;
    }
 
    public void setAverageElementDepth(float averageElementDepth) {
	this.averageElementDepth = averageElementDepth;
    }
 
    public float getAverageNumberOfAttributes() {
	return this.averageNumberOfAttributes;
    }
 
    public void setAverageNumberOfAttributes(float averageNumberOfAttributes) {
	this.averageNumberOfAttributes = averageNumberOfAttributes;
    }
 
    public int getMaxElementDepth() {
	return this.maxElementDepth;
    }
 
    public void setMaxElementDepth(int maxElementDepth) {
	this.maxElementDepth = maxElementDepth;
    }
 
    public int getMaxNumberOfcharactersInAnElement() {
	return this.maxNumberOfcharactersInAnElement;
    }
 
    public void setMaxNumberOfcharactersInAnElement(
	    int maxNumberOfcharactersInAnElement) {
	this.maxNumberOfcharactersInAnElement = maxNumberOfcharactersInAnElement;
    }
 
    public int getNumberOfCharacters() {
	return this.numberOfCharacters;
    }
 
    public void setNumberOfCharacters(int numberOfCharacters) {
	this.numberOfCharacters = numberOfCharacters;
    }
 
    public int getNumberOfElements() {
	return this.numberOfElements;
    }
 
    public void setNumberOfElements(int numberOfElements) {
	this.numberOfElements = numberOfElements;
    }
 
    public int getTotalNumberOfAttributes() {
	return this.totalNumberOfAttributes;
    }
 
    public void setTotalNumberOfAttributes(int totalNumberOfAttributes) {
	this.totalNumberOfAttributes = totalNumberOfAttributes;
    }
}

See Also

Comments from the users

To be notified via mail on the updates of this discussion you can login and click on watch at the top of the page


Comments on wikijava are disabled now, cause excessive spam.