How to get attributes and values from badly formatted string in Java

2023-03-03 05:07 问答作者：

I need to get the attributes and values from multiple strings such as these:

<img src = "the source" class=class01 />
<img class=class02 src=folder/img.jpg />
<img class= "class01" / >

Spaces and slashes are accepted in values, and some values are enclosed in quotes, while not all are. Some equal signs are spaced.

I'm new to this, so the code i开发者_如何学Gos messy and probably not foolproof.

My attempt:

//remove unnecessary spacing and "<img" and "/>"
str = str.replaceAll("/ >", "/>");
str = str.substring(4, str.length()-1);
str = str.replaceAll(" =", "=");
str = str.replaceAll("= ", "=");

//remove quotes
str = str.replaceAll("\"", "");

//creating a matcher and compiling the regex pattern is omitted, because I know how to do that using matcher.group();
regexSrc = "src=(.*?)($| class=)";
String srcString = matcherSrc.group(1);

regexClass = "class=(.*?)($| src=)";
String classString = matcherClass.group(1);

System.out.println("the source is: " + srcString);
System.out.println("the class is: " + classString);

Any suggestions how to do this is a better way are appreciated.

If it is a poorly formatted HTML code, then use JTidy to clean it up and then use some simpler regular expression or HTML parser.

You say you've already extracted the <img> tag and you're working on it as a standalone string. That makes the job simpler, but there's still a great deal of complexity to deal with. For example, how would you handle this tag?

<img  foosrc="whatever" barclass=noclass src =
folder/img.jpg class   ='ho hum' ></img>

Here you've got:

more than one space following the tag name
attributes whose names only end with src and class
a linefeed instead of a space after the second =
more than one space between an attribute name and the =
single-quotes instead of double-quotes around an attribute value
no final / because the author used an old HTML-style image tag with a closing tag, not an XML-style self-closing tag.

...and it's all just as valid as the sample tags you provided. Maybe you know you'll never have to deal with any of those issues, but we don't. If we supply you with a regex tailored to your sample data without even mentioning these other issues, are we really helping you? Or helping the others with similar problems who happen to find this page?

Her you go then:

String[] tags = { "<img src = \"the source\" class=class01 />",
                  "<img class=class02 src=folder/img02.jpg />",
                  "<img class= \"class03\" / >", 
                  "<img  foosrc=\"whatever\" barclass=noclass" +
                  "    class='class04' src =\nfolder/img04.jpg></img>" };

String regex = 
  "(?i)\\s+(src|class)\\s*=\\s*(?:\"([^\"]+)\"|'([^']+)'|(\\S+?)(?=\\s|/?\\s*>))";
Pattern p = Pattern.compile(regex);
int n = 1;
for (String tag : tags)
{
  System.out.printf("%ntag %d: %s%n", n++, tag);
  Matcher m = p.matcher(tag);
  while (m.find())
  {
    System.out.printf("%8s: %s%n", m.group(1),
        m.start(2) != -1 ? m.group(2) :
        m.start(3) != -1 ? m.group(3) :
        m.group(4));
  }
}

output:

tag 1: <img src = "the source" class=class01 />
     src: the source
   class: class01

tag 2: <img class=class02 src=folder/img02.jpg />
   class: class02
     src: folder/img02.jpg

tag 3: <img class= "class03" / >
   class: class03

tag 4: <img  foosrc="whatever" barclass=noclass    class='class04' src =
folder/img04.jpg></img>
   class: class04
     src: folder/img04.jpg

Here's a more readable form of the regex:

(?ix)   # ignore-case and free-spacing modes
\s+           # leading \s+ ensures we match the whole name
(src|class)   # the attribute name is stored in group1
\s*=\s*       # \s* = any number of any whitespace
(?:           # the attribute value, which may be...
   "([^"]+)"              # double-quoted (group 2)
 | '([^']+)'              # single-quoted (group 3)
 | (\S+?)(?=\s|/?\s*>)    # or not quoted (group 4)
)

A lot of people think it is a bad idea to use regexes to parse HTML:

Regular Expressions - Where Angels Fear to Tread
Regex for Specific Tag
what is Regex pattern for html tag in java or android?

and top them all off ...

RegEx match open tags except XHTML self-contained tags

(though this guy seems to disagree - RegEx match open tags except XHTML self-contained tags)

As Stephen C answered it might be generally not so safe to use regex for that. It might get you into troubles.

But here is something that might do what you need, at least for the given example:

 ([a-z]+) *= *"?((?:(?! [a-z]+ *=|/? *>|").)+)

See in rubular.

You may have to test it against more possible inputs and maybe there need to be adjustments.

Here in java code:

Pattern p = Pattern.compile("([a-z]+) *= *\"?((?:(?! [a-z]+ *=|/? *>|\").)+)", Pattern.DOTALL);
Matcher m = p.matcher(input);
while (m.find()){
    String key = m.group(1);
    String value = m.group(2);
    System.out.printf("%1s:%2s\n", key, value);
}

继续阅读：parsing regex

How to get attributes and values from badly formatted string in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？