Extract some contents from the url using regular expressions in java
I want to extract contents from this url http://www.xyz.com/default.aspx
and this is the below content that I want to extract using regular expression.
String expr = "
What Regular Expression should I use here
";
Pattern patt = Pattern.compile(expr, Pattern.DOTALL | Pattern.UNIX_LINES);
URL url4 = null;
try {
url4 = new URL("http://www.xyz.com/default.aspx");
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Text" +url4);
Matcher m = null;
try {
m = patt.matcher(getURLContent(url4));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Match" +m);
while (m.find()) {
String stateURL = m.group(1);
System.out.println("Some Data" +stateURL);
}
public static CharSequence getURLContent(URL url8) throws IOException {
URLConnection conn = url8.openConnection();
String encoding = conn.getContentEncoding();
if (encoding == null) {
开发者_高级运维 encoding = "ISO-8859-1";
}
BufferedReader br = new BufferedReader(new
InputStreamReader(conn.getInputStream(), encoding));
StringBuilder sb = new StringBuilder(16384);
try {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
System.out.println(line);
sb.append('\n');
}
} finally {
br.close();
}
return sb;
}
As @bkent314 has mentioned, jsoup is a better and cleaner approach than using regular expression.
If you inspect the source code of that website, you basically want content from this snippet:-
<div class="smallHd_contentTd">
<div class="breadcrumb">...</div>
<h2>Services</h2>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
By using jsoup, your code may look something like this:-
Document doc = Jsoup.connect("http://www.ferotech.com/Services/default.aspx").get();
Element content = doc.select("div.smallHd_contentTd").first();
String header = content.select("h2").first().text();
System.out.println(header);
for (Element pTag : content.select("p")) {
System.out.println(pTag.text());
}
Hope this helps.
精彩评论