How to get attributes and values from badly formatted string in Java
I need to get the attributes and values from multiple strings such as these:
<img src = "the source" class=class01 />
<img class=class02 src=folder/img.jpg />
<img class= "class01" / >
Spaces and slashes are accepted in values, and some values are enclosed in quotes, while not all are. Some equal signs are spaced.
I'm new to this, so the code i开发者_如何学Gos messy and probably not foolproof.
My attempt:
//remove unnecessary spacing and "<img" and "/>"
str = str.replaceAll("/ >", "/>");
str = str.substring(4, str.length()-1);
str = str.replaceAll(" =", "=");
str = str.replaceAll("= ", "=");
//remove quotes
str = str.replaceAll("\"", "");
//creating a matcher and compiling the regex pattern is omitted, because I know how to do that using matcher.group();
regexSrc = "src=(.*?)($| class=)";
String srcString = matcherSrc.group(1);
regexClass = "class=(.*?)($| src=)";
String classString = matcherClass.group(1);
System.out.println("the source is: " + srcString);
System.out.println("the class is: " + classString);
Any suggestions how to do this is a better way are appreciated.
If it is a poorly formatted HTML code, then use JTidy to clean it up and then use some simpler regular expression or HTML parser.
You say you've already extracted the <img>
tag and you're working on it as a standalone string. That makes the job simpler, but there's still a great deal of complexity to deal with. For example, how would you handle this tag?
<img foosrc="whatever" barclass=noclass src =
folder/img.jpg class ='ho hum' ></img>
Here you've got:
- more than one space following the tag name
- attributes whose names only end with
src
andclass
- a linefeed instead of a space after the second
=
- more than one space between an attribute name and the
=
- single-quotes instead of double-quotes around an attribute value
- no final
/
because the author used an old HTML-style image tag with a closing tag, not an XML-style self-closing tag.
...and it's all just as valid as the sample tags you provided. Maybe you know you'll never have to deal with any of those issues, but we don't. If we supply you with a regex tailored to your sample data without even mentioning these other issues, are we really helping you? Or helping the others with similar problems who happen to find this page?
Her you go then:
String[] tags = { "<img src = \"the source\" class=class01 />",
"<img class=class02 src=folder/img02.jpg />",
"<img class= \"class03\" / >",
"<img foosrc=\"whatever\" barclass=noclass" +
" class='class04' src =\nfolder/img04.jpg></img>" };
String regex =
"(?i)\\s+(src|class)\\s*=\\s*(?:\"([^\"]+)\"|'([^']+)'|(\\S+?)(?=\\s|/?\\s*>))";
Pattern p = Pattern.compile(regex);
int n = 1;
for (String tag : tags)
{
System.out.printf("%ntag %d: %s%n", n++, tag);
Matcher m = p.matcher(tag);
while (m.find())
{
System.out.printf("%8s: %s%n", m.group(1),
m.start(2) != -1 ? m.group(2) :
m.start(3) != -1 ? m.group(3) :
m.group(4));
}
}
output:
tag 1: <img src = "the source" class=class01 />
src: the source
class: class01
tag 2: <img class=class02 src=folder/img02.jpg />
class: class02
src: folder/img02.jpg
tag 3: <img class= "class03" / >
class: class03
tag 4: <img foosrc="whatever" barclass=noclass class='class04' src =
folder/img04.jpg></img>
class: class04
src: folder/img04.jpg
Here's a more readable form of the regex:
(?ix) # ignore-case and free-spacing modes
\s+ # leading \s+ ensures we match the whole name
(src|class) # the attribute name is stored in group1
\s*=\s* # \s* = any number of any whitespace
(?: # the attribute value, which may be...
"([^"]+)" # double-quoted (group 2)
| '([^']+)' # single-quoted (group 3)
| (\S+?)(?=\s|/?\s*>) # or not quoted (group 4)
)
A lot of people think it is a bad idea to use regexes to parse HTML:
- Regular Expressions - Where Angels Fear to Tread
- Regex for Specific Tag
- what is Regex pattern for html tag in java or android?
and top them all off ...
- RegEx match open tags except XHTML self-contained tags
(though this guy seems to disagree - RegEx match open tags except XHTML self-contained tags)
As Stephen C answered it might be generally not so safe to use regex for that. It might get you into troubles.
But here is something that might do what you need, at least for the given example:
([a-z]+) *= *"?((?:(?! [a-z]+ *=|/? *>|").)+)
See in rubular.
You may have to test it against more possible inputs and maybe there need to be adjustments.
Here in java code:
Pattern p = Pattern.compile("([a-z]+) *= *\"?((?:(?! [a-z]+ *=|/? *>|\").)+)", Pattern.DOTALL);
Matcher m = p.matcher(input);
while (m.find()){
String key = m.group(1);
String value = m.group(2);
System.out.printf("%1s:%2s\n", key, value);
}
精彩评论