Parse an InputStream for multiple patterns

2023-02-25 19:14 问答作者：

I am parsing an InputStream for certain patterns to extract values from it, e.g. I would have something like

<span class="filename"><a href="http://example.com/foo">foo</a>

I don't want to use a full fledged html parser as I am not interested in the document structure but only in some well defined bits of information. (Only their order is important)

Currently I am using a very simple approach, I have an Object for each Pattern that contains a char[] of the opening and closing 'tag' (in the example opening would be <span class="filename"><a href="and closing " to get the url) and a position marker. For each character read by of the InputStream, I iterate over all Patterns and call the match(char) function that returns true once the opening pattern does match, from then on I collect the following chars in a StringBuilder until the now active pattern does match()开发者_如何转开发 again. I then call a function with the ID of the Pattern and the String read, to process it further.

While this works fine in most cases, I wanted to include regular expressions in the pattern, so I could also match something like

<span class="filename" id="234217"><a href="http://example.com/foo">foo</a>

At this point I was sure I would reinvent the wheel as this most certainly would have been done before, and I don't really want to write my own regex parser to begin with. However, I could not find anything that would do what I was looking for.

Unfortunately the Scanner class only matches one pattern, not a list of patterns, what alternatives could I use? It should not be heavy and work with Android.

You mean you want to match any <span> element with a given class attribute, irrespective of other attributes it may have? That's easy enough:

Scanner sc = new Scanner(new File("test.txt"), "UTF-8");
Pattern p = Pattern.compile(
    "<span[^>]*class=\"filename\"[^>]*>\\s*<a[^>]*href=\"([^\"]+)\""
);
while (sc.findWithinHorizon(p, 0) != null)
{
  MatchResult m = sc.match();
  System.out.println(m.group(1));
}

The file "test.txt" contains the text of your question, and the output is:

http://example.com/foo
and closing
http://example.com/foo

the Scanner.useDelimiter(Pattern) API seems to be what you're looking for. You would have to use an OR (|) separated pattern string.

This pattern can get really complicated really quickly though.

You are right to think this has all been done before :) What you are talking about is a problem of tokenizing and parsing and I therefore suggest you consider JavaCC.

There is something of a learning curve with JavaCC as you learn to understand it's grammar, so below is an implementation to get you started.

The grammar is a chopped down version of the standard JavaCC grammar for HTML. You can add more productions for matching other patterns.

options {
  JDK_VERSION = "1.5";
  static = false;
}

PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
  private String currentTag;
  private String currentSpanClass;
  private String currentHref;

  public static void main(String args []) throws ParseException {
    System.out.println("Starting parse");
    eg1 parser = new eg1(System.in);
    parser.parse();
    System.out.println("Finishing parse");
  }
}

PARSER_END(eg1)

SKIP :
{
    <       ( " " | "\t" | "\n" | "\r" )+   >
|   <       "<!" ( ~[">"] )* ">"            >
}

TOKEN :
{
    <STAGO:     "<"                 >   : TAG
|   <ETAGO:     "</"                >   : TAG
|   <PCDATA:    ( ~["<"] )+         >
}

<TAG> TOKEN [IGNORE_CASE] :
{
    <A:      "a"              >   : ATTLIST
|   <SPAN:   "span"           >   : ATTLIST
|   <DONT_CARE: (["a"-"z"] | ["0"-"9"])+  >   : ATTLIST
}

<ATTLIST> SKIP :
{
    <       " " | "\t" | "\n" | "\r"    >
|   <       "--"                        >   : ATTCOMM
}

<ATTLIST> TOKEN :
{
    <TAGC:      ">"             >   : DEFAULT
|   <A_EQ:      "="             >   : ATTRVAL

|   <#ALPHA:    ["a"-"z","A"-"Z","_","-","."]   >
|   <#NUM:      ["0"-"9"]                       >
|   <#ALPHANUM: <ALPHA> | <NUM>                 >
|   <A_NAME:    <ALPHA> ( <ALPHANUM> )*         >

}

<ATTRVAL> TOKEN :
{
    <CDATA:     "'"  ( ~["'"] )* "'"
        |       "\"" ( ~["\""] )* "\""
        | ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
                            >   : ATTLIST
}

<ATTCOMM> SKIP :
{
    <       ( ~["-"] )+         >
|   <       "-" ( ~["-"] )+         >
|   <       "--"                >   : ATTLIST
}



void attribute(Map<String,String> attrs) :
{
    Token n, v = null;
}
{
    n=<A_NAME> [ <A_EQ> v=<CDATA> ]
    {
        String attval;
        if (v == null) {
            attval = "#DEFAULT";
        } else {
            attval = v.image;
            if( attval.startsWith("\"") && attval.endsWith("\"") ) {
              attval = attval.substring(1,attval.length()-1);
            } else if( attval.startsWith("'") && attval.endsWith("'") ) {
              attval = attval.substring(1,attval.length()-1);
            }
        }
        if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
    }
}

void attList(Map<String,String> attrs) : {}
{
    ( attribute(attrs) )+
}


void tagAStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <A> [ attList(attrs) ] <TAGC>
    {
      currentHref=attrs.get("href");    
      if( currentHref != null && "filename".equals(currentSpanClass) )
      {
        System.out.println("Found URL: "+currentHref);
      }
    }
}

void tagAEnd() : {}
{
    <ETAGO> <A> <TAGC>
    {
      currentHref=null;
    }
}

void tagSpanStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <SPAN> [ attList(attrs) ] <TAGC>
    {
      currentSpanClass=attrs.get("class");
    }
}

void tagSpanEnd() : {}
{
    <ETAGO> <SPAN> <TAGC>
    {
      currentSpanClass=null;
    }
}

void tagDontCareStart() : {}
{
   <STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}

void tagDontCareEnd() : {}
{
   <ETAGO> <DONT_CARE> <TAGC>
}

void parse() : {}
{
    (
      LOOKAHEAD(2) tagAStart() |
      LOOKAHEAD(2) tagAEnd() |
      LOOKAHEAD(2) tagSpanStart() |
      LOOKAHEAD(2) tagSpanEnd() |
      LOOKAHEAD(2) tagDontCareStart() |
      LOOKAHEAD(2) tagDontCareEnd() |
      <PCDATA>
    )*
}

继续阅读：inputstream pattern-matching regex

Parse an InputStream for multiple patterns

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？