I have a wiki markup text that I need to parse. My first version, which I use in DroidWiki application for Android, is a wiki custom parsing. I wrote my own parsing, because I haven't found a parsing code on the internet that was light enough to be used on Android phone. That code does a regular expression matching for each wiki tag. So, every line is matched at least Recently, I decided to try my hand at more fancy parsing: just parse into a sequence of tokens. Below, are the results of my research on this topic.
Using One-Character-At-A-Time Parsing
One way to solve the problem is to use a one-pass parsing, having the Java code to look at each character (just once) and isolate the tokens this way. Using the character iterator goes like this:
StringCharacterIterator iter = new StringCharacterIterator(markup);
for( char c = iter.first(); c != CharacterIterator.DONE; c = iter.next() ) {
// process the char: is it one of the characters that starts any of the syntax tokens?
}
This could be fast, but the code would be complicated. I decided for a different approach.
Using Interpreter Design Pattern
The idea:
- Parse the text and convert it to a sequence of basic tokens:
- every continuous piece of a regular text is a token
- every sequence of syntax is a token (for example, the char less-than if parsing HTML text would be a simgle token by itself)
- Process the list of these basic tokens and aggregate them into complete syntactical elements (for example, every complete HTML tag would be a single token).
- a SimpleToken class for item 1 above,
- a base Token class and specialize it for more significant/complex syntactical elements for item 2 above
No comments:
Post a Comment