How to use HTML Parser to get complete information about all tags in the HTML page

2022-12-21 00:20 问答作者：

I am using HTML Parser to develop an application. The code below is not able to get the entire set of tags in the page. There are some tags which are missed out and the attributes and text body of them are also missed out. Please help me to explain why is this happening.....or suggest me other way....

 URL url = new URL("...");
 PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));

 URLConnection connection = url.openConnection();
 InputStream is = connection.getInputStream();
 InputStreamReader isr = new InputStreamReader(is);
 BufferedReader br = new BufferedReader(isr);

 HTMLEditorKit htmlKit = new HTMLEditorKit();
 HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
 HTMLEditorKit.Parser parser = new ParserDelegator();
 HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
 parser.parse(br, callback, true);

 ElementIterator iterator = new ElementIterator(htmlDoc);
 Element element;
   while ((element = iterator.next()) != null) 
   {
     AttributeSet attributes = element.getAttributes();
     Enumeration e=attributes.getAttributeNames();

     pw.println("Element Name :"+element.getName());
     while(e.hasMoreElements())
     {
      Object key=e.nextElement();
      Object val=attributes.getAttribute(key);
      int startOffset = element.getStartOffset();
   int endOffset = element.getEndOffset();
   int length = endOffset - startOffset;
   String text=htmlDoc.getText(startOffset, length);

      pw.println("Key :"+key.toS开发者_Go百科tring()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");

     }
   }

}

I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.

General idea:

You first have to know in what tags (div, meta, span, etc) the information you want are in, and know the attributes to identify those tags. Example :

 <span class="price"> $7.95</span>

if you are looking for this "price", then you are interested in span tags with class "price".

HTML Parser has a filter-by-attribute functionality.

filter = new HasAttributeFilter("class", "price");

When you parse using a filter, you will get a list of Nodes that you can do a instanceof operation on them to determine if they are of the type you are interested in, for span you'd do something like

if (node instanceof Span) // or any other supported element.

See list of supported tags here.

An example with HTML Parser to grab the meta tag that has description about a site:

Tag Sample :

<meta name="description" content="Amazon.com: frankenstein: Books"/>

Code:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

As per the comments:

actually i want to extract information such as product name,price etc of all products listed in an online shopping site such as amazon.com How should i go about it???

Step 1: read their robots file. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a Disallow on an User-Agent of *, then stop here. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.

Step 2: check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.

Step 3: learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.

You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever. I believe you would have better results using, as an example, NekoHtml.

Or another simple library you can use is jtidy that can clean up your html before parsing it. Hope this helps.

http://sourceforge.net/projects/jtidy/

Ciao!

继续阅读：screen-scraping

How to use HTML Parser to get complete information about all tags in the HTML page

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？