Monday, September 7, 2009

Parsing a Text on Android Phone

Introduction
I have a wiki markup text that I need to parse. My first version, which I use in DroidWiki application for Android, is a wiki custom parsing. I wrote my own parsing, because I haven't found a parsing code on the internet that was light enough to be used on Android phone. That code does a regular expression matching for each wiki tag. So, every line is matched at least Recently, I decided to try my hand at more fancy parsing: just parse into a sequence of tokens. Below, are the results of my research on this topic.

Using One-Character-At-A-Time Parsing
One way to solve the problem is to use a one-pass parsing, having the Java code to look at each character (just once) and isolate the tokens this way. Using the character iterator goes like this:
StringCharacterIterator iter = new StringCharacterIterator(markup);

for( char c = iter.first(); c != CharacterIterator.DONE; c = iter.next() ) {
// process the char: is it one of the characters that starts any of the syntax tokens?
}

This could be fast, but the code would be complicated. I decided for a different approach.

Using Interpreter Design Pattern

The idea:
  1. Parse the text and convert it to a sequence of basic tokens:
    • every continuous piece of a regular text is a token
    • every sequence of syntax is a token (for example, the char less-than if parsing HTML text would be a simgle token by itself)
  2. Process the list of these basic tokens and aggregate them into complete syntactical elements (for example, every complete HTML tag would be a single token).
If you want, create
  • a SimpleToken class for item 1 above,
  • a base Token class and specialize it for more significant/complex syntactical elements for item 2 above

No comments: