Jsoup: How to get all html between 2 header tags

2023-03-16 05:36 问答作者：

I am trying to开发者_JS百科 get all html between 2 h1 tags. Actual task is to break the html into frames(chapters) based of the h1(heading 1) tags.

Appreciate any help.

Thanks Sunil

If you want to get and process all elements between two consecutive h1 tags you can work on siblings. Here's some example code:

public static void h1s() {
  String html = "<html>" +
  "<head></head>" +
  "<body>" +
  "  <h1>title 1</h1>" +
  "  <p>hello 1</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>1</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 2</h1>" +
  "  <p>hello 2</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>2</td>" +
  "    </tr>" +
  "  </table>" +
  "  <h1>title 3</h1>" +
  "  <p>hello 3</p>" +
  "  <table>" +
  "    <tr>" +
  "      <td>hello</td>" +
  "      <td>world</td>" +
  "      <td>3</td>" +
  "    </tr>" +
  "  </table>" +    
  "</body>" +
  "</html>";
  Document doc = Jsoup.parse(html);
  Element firstH1 = doc.select("h1").first();
  Elements siblings = firstH1.siblingElements();
  List<Element> elementsBetween = new ArrayList<Element>();
  for (int i = 1; i < siblings.size(); i++) {
    Element sibling = siblings.get(i);
    if (! "h1".equals(sibling.tagName()))
      elementsBetween.add(sibling);
    else {
      processElementsBetween(elementsBetween);
      elementsBetween.clear();
    }
  }
  if (! elementsBetween.isEmpty())
    processElementsBetween(elementsBetween);
}

private static void processElementsBetween(
    List<Element> elementsBetween) {
  System.out.println("---");
  for (Element element : elementsBetween) {
    System.out.println(element);
  }
}

I don't know Jsoup that good, but a straight forward approach could look like this:

public class Test {

    public static void main(String[] args){

        Document document = Jsoup.parse("<html><body>" +
            "<h1>First</h1><p>text text text</p>" +
            "<h1>Second</h1>more text" +
            "</body></html>");

        List<List<Node>> articles = new ArrayList<List<Node>>();
        List<Node> currentArticle = null;

        for(Node node : document.getElementsByTag("body").get(0).childNodes()){
            if(node.outerHtml().startsWith("<h1>")){
                currentArticle = new ArrayList<Node>();
                articles.add(currentArticle);
            }

            currentArticle.add(node);
        }

        for(List<Node> article : articles){
            for(Node node : article){
                System.out.println(node);
            }
            System.out.println("------- new page ---------");
        }

    }

}

Do you know the structure of the articles and is it always the same? What do you want to do with the articles? Have you considered splitting them on the client side? This would be an easy jQuery Job.

Iterating over the elements between consecutive <h> elements seems to be fine, except one thing. Text not belonging to any tag, like in <h1/>this<h1/>. To workaround this I implemented splitElemText function to get this text. First split whole parent element using this method. Then except the element, process the suitable entry from the splitted text. Remove calls to htmlToText if you want raw html.

/** Splits the text of the element <code>elem</code> by the children
  * tags.
  * @return An array of size <code>c+1</code>, where <copde>c</code>
  * is the number of child elements.
  * <p>Text after <code>n</code>th element is found in <code>[n+1]</code>.
  */
public static String[] splitElemText(Element elem)
{
  int c = elem.children().size();
  String as[] = new String[c + 1];
  String sAll = elem.html();
  int iBeg = 0;
  int iChild = 0;
  for (Element ch : elem.children()) {
    String sChild = ch.outerHtml();
    int iEnd = sAll.indexOf(sChild, iBeg);
    if (iEnd < 0) { throw new RuntimeException("Tag " + sChild
                    +" not found in its parent: " + sAll);
    }
    as[iChild] = htmlToText(sAll.substring(iBeg, iEnd));
    iBeg = iEnd + sChild.length();
    iChild += 1;
  }
  as[iChild] = htmlToText(sAll.substring(iBeg));
  assert(iChild == c);
  return as;
}

public static String htmlToText(String sHtml)
{
  Document doc = Jsoup.parse(sHtml);
  return doc.text();
}

继续阅读：jsoup

Jsoup: How to get all html between 2 header tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？