Help with java regex
Hey, I've been struggling with this regex and I'm out of ideas. I have this types of strings (not all of them are here, but only this 2 types) and I have to extract the part between the th tags.
<th class="tip" title='manje'>manje</th>
<th class="tip" title='ne d.'>ne d.</th>
<th class="tip" title='manje'>manje</th>
<th class="tip" title='točno'>točno</th>
<th class="tip" title='više'>više</th>
<th class="tip" title='m./t.'>m./t.</th>
<th class="tip" title='v./t.'>v./t.</th>
<th class="tip">daje</th>
<th class="tip">X2</th>
<th class="ti开发者_如何学运维p">12</th>
I've tried some combinations bu I only get the value if there is no that attribute "title" in th tag.
This pattern only extracts the content if there is no "title" attribute in th tag:
Pattern pattern = Pattern.compile("<th class=\"tip\"[\\s*|[.]{0,20}]>(.*?)\\s*</th>");
This one also:
Pattern patternType = Pattern.compile("<th class=\"tip\"[\\s*|[.]{0,20}]>(.*?)\\s*</th>");
Any suggestions? Tnx
Regular expressions are not suitable in all cases. Use Jsoup instead:
package so6235727;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class PrintContent {
private static final String html = //
"<th class=\"tip\" title='manje'>manje</th>\r\n" + //
"<th class=\"tip\" title='ne d.'>ne d.</th>\r\n" + //
"<th class=\"tip\" title='manje'>manje</th>\r\n" + //
"<th class=\"tip\" title='točno'>točno</th>\r\n" + //
"<th class=\"tip\" title='više'>više</th>\r\n" + //
"<th class=\"tip\" title='m./t.'>m./t.</th>\r\n" + //
"<th class=\"tip\" title='v./t.'>v./t.</th>\r\n" + //
"<th class=\"tip\">daje</th>\r\n" + //
"<th class=\"tip\">X2</th>\r\n" + //
"<th class=\"tip\">12</th>\r\n";
public static void main(String[] args) {
Document jsoup = Jsoup.parse(html);
Elements headings = jsoup.select("th.tip");
for (Element element : headings) {
System.out.println(element.text());
}
}
}
See how easy this is?
Try this one:
Pattern pattern = Pattern.compile("<th class=\"tip\"[^>]*>(.*)</th>");
Try this:
Pattern pattern = Pattern.compile("<th[^>]*>(.*?)\\s*</th>");
What the heck, one more Pattern answer attempt, this one with look ahead and look behind:
Pattern pattern = Pattern.compile("(?<=<th .{0,100}>).*(?=</th>)");
EDIT 1
Regarding I tried it and it doesn't work in any case
: perhaps your harness is different from mine:
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Foo1 {
private static final String FOO_TXT = "Foo1.txt";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=<th .{0,100}>).*(?=</th>)");
Scanner scan = new Scanner(Foo1.class.getResourceAsStream(FOO_TXT));
while (scan.hasNextLine()) {
String line = scan.nextLine();
System.out.println("Line: " + line);
Matcher match = pattern.matcher(line);
if (match.find()) {
System.out.println("Match: " + match.group());
} else {
System.out.println("No match found");
}
}
}
}
This assumes that the text file is named Foo1.txt and that it is located with the class files.
I'm including my test code because it seems I have positive/negative matches when others have negative/positive matches.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Regex {
public static void test(String patternString) {
System.out.println("Test with pattern: " + patternString);
Pattern pattern = Pattern.compile(patternString);
String[] testStrings = {"<th class=\"tip\" title='manje'>manje</th>", "<th class=\"tip\">daje</th>"};
for (String testString : testStrings) {
System.out.println("> Test on " + testString);
Matcher matcher = pattern.matcher(testString);
if (matcher.matches()) {
System.out.println(">> number of matches in group = " + matcher.groupCount());
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(">>group " + i + " is " + matcher.group(i));
}
} else {
System.out.println(">> no match");
}
}
System.out.println("");
}
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
test("<th class=\"tip\"[\\s*|[.]{0,20}]>(.*?)\\s*</th>"); // op
test("<th[^>]*>(.*?)\\s*</th>"); // Billy Moon
test("<th class=\"tip\"[^>]*>(.*)</th>"); // stuken.yuri
test("(?<=<th .{0,100}>).*(?=</th>)"); // Hovercraft full of Eels
test("(?:<th .{0,100}>).*(?:</th>)");
}
}
My output is that I get a match for Billy Moon and stuken.yuri, but no match for the OP or Hovercraft. I would be curious to see if others get the same. I am using Java 7 beta with Windows 7.
精彩评论