Parse an InputStream for multiple patterns
I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like
<span class="filename"><a href="http://example.com/foo">foo</a>
I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)
Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be<span class="filename"><a href="
and closing "
to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char)
function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match()开发者_如何转开发 again. I then call a function with the ID of the Pattern and the String read, to process it further.
While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like
<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>
At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.
Unfortunately theScanner
class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.You mean you want to match any <span>
element with a given class
attribute, irrespective of other attributes it may have? That's easy enough:
Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
"<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
MatchResult m = sc.match();
System.out.println(m.group(1));
}
The file "test.txt" contains the text of your question, and the output is:
http://example.com/foo and closing http://example.com/foo
the Scanner.useDelimiter(Pattern) API seems to be what you're looking for. You would have to use an OR (|) separated pattern string.
This pattern can get really complicated really quickly though.
You are right to think this has all been done before :) What you are talking about is a problem of tokenizing and parsing and I therefore suggest you consider JavaCC.
There is something of a learning curve with JavaCC as you learn to understand it's grammar, so below is an implementation to get you started.
The grammar is a chopped down version of the standard JavaCC grammar for HTML. You can add more productions for matching other patterns.
options {
JDK_VERSION = "1.5";
static = false;
}
PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
private String currentTag;
private String currentSpanClass;
private String currentHref;
public static void main(String args []) throws ParseException {
System.out.println("Starting parse");
eg1 parser = new eg1(System.in);
parser.parse();
System.out.println("Finishing parse");
}
}
PARSER_END(eg1)
SKIP :
{
< ( " " | "\t" | "\n" | "\r" )+ >
| < "<!" ( ~[">"] )* ">" >
}
TOKEN :
{
<STAGO: "<" > : TAG
| <ETAGO: "</" > : TAG
| <PCDATA: ( ~["<"] )+ >
}
<TAG> TOKEN [IGNORE_CASE] :
{
<A: "a" > : ATTLIST
| <SPAN: "span" > : ATTLIST
| <DONT_CARE: (["a"-"z"] | ["0"-"9"])+ > : ATTLIST
}
<ATTLIST> SKIP :
{
< " " | "\t" | "\n" | "\r" >
| < "--" > : ATTCOMM
}
<ATTLIST> TOKEN :
{
<TAGC: ">" > : DEFAULT
| <A_EQ: "=" > : ATTRVAL
| <#ALPHA: ["a"-"z","A"-"Z","_","-","."] >
| <#NUM: ["0"-"9"] >
| <#ALPHANUM: <ALPHA> | <NUM> >
| <A_NAME: <ALPHA> ( <ALPHANUM> )* >
}
<ATTRVAL> TOKEN :
{
<CDATA: "'" ( ~["'"] )* "'"
| "\"" ( ~["\""] )* "\""
| ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
> : ATTLIST
}
<ATTCOMM> SKIP :
{
< ( ~["-"] )+ >
| < "-" ( ~["-"] )+ >
| < "--" > : ATTLIST
}
void attribute(Map<String,String> attrs) :
{
Token n, v = null;
}
{
n=<A_NAME> [ <A_EQ> v=<CDATA> ]
{
String attval;
if (v == null) {
attval = "#DEFAULT";
} else {
attval = v.image;
if( attval.startsWith("\"") && attval.endsWith("\"") ) {
attval = attval.substring(1,attval.length()-1);
} else if( attval.startsWith("'") && attval.endsWith("'") ) {
attval = attval.substring(1,attval.length()-1);
}
}
if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
}
}
void attList(Map<String,String> attrs) : {}
{
( attribute(attrs) )+
}
void tagAStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <A> [ attList(attrs) ] <TAGC>
{
currentHref=attrs.get("href");
if( currentHref != null && "filename".equals(currentSpanClass) )
{
System.out.println("Found URL: "+currentHref);
}
}
}
void tagAEnd() : {}
{
<ETAGO> <A> <TAGC>
{
currentHref=null;
}
}
void tagSpanStart() : {
Map<String,String> attrs = new HashMap<String,String>();
}
{
<STAGO> <SPAN> [ attList(attrs) ] <TAGC>
{
currentSpanClass=attrs.get("class");
}
}
void tagSpanEnd() : {}
{
<ETAGO> <SPAN> <TAGC>
{
currentSpanClass=null;
}
}
void tagDontCareStart() : {}
{
<STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}
void tagDontCareEnd() : {}
{
<ETAGO> <DONT_CARE> <TAGC>
}
void parse() : {}
{
(
LOOKAHEAD(2) tagAStart() |
LOOKAHEAD(2) tagAEnd() |
LOOKAHEAD(2) tagSpanStart() |
LOOKAHEAD(2) tagSpanEnd() |
LOOKAHEAD(2) tagDontCareStart() |
LOOKAHEAD(2) tagDontCareEnd() |
<PCDATA>
)*
}
精彩评论