Help extracting text from html tag with Java and Regex
I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body>
and </body>
here it is:
public class Harn2 {
public static void main(String[] args) throws IOException{
String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
}
private static String readFile() {
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("user.html");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine = null;
//Read File Line By Line
while (br.readLine() != null) {
// Print the content on the conso开发者_JS百科le
//System.out.println (strLine);
strLine+=br.readLine();
}
//Close the input stream
in.close();
return strLine;
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
return "";
}
}
}
Well it works fine like this but now I would like to extract the text between the tag:
<table class="claroTable">
and </table>
So I replace my regex string by ".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"
I have also tried
".*?<table class=\"claroTable\">(.*?)</table>.*?"
but it doesn't work and I don't understand why. There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?
Thank you in advance for helping me,
EDIT: the html text to extranct is something like:
<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>
What I would like to extract is anything between <table class="claroTable">
and </table>
Here's how you can do it with the JSoup parser:
File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Yes, you can also somehow do it with regex, but it will never be this easy.
Update: The main problem with your regex pattern is that you are missing the DOTALL
flag:
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);
And if you just want the specified table tag with contents, you can do something like this:
String tableTag =
Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
.matcher(html)
.replaceFirst("$1");
(Updated: now returns the contents of the table tag only, not the table tag itself)
As stated, this is a bad place to use regex. Only use regex when you actually need to, so basically try to stay away from it if you can. Take a look at this post though for parsers:
How to parse and modify HTML file in Java
精彩评论