Page content is loaded with JavaScript and Jsoup doesn't see it

2023-04-06 09:57 问答作者：

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?

Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw

Here's element which content I need: <div id='tags_list'></div>

I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:

<div id="tags_list">
    <a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a>
    <a href=开发者_如何学C"/tagsc0t1879.html" style="font-size:14;">Sr</a>
    <a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a>
</div>

Java code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Test
{
    public static void main( String[] args )
    {
        try
        {
            Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
            Elements Tags = Doc.select( "#tags_list a" );

            for ( Element Tag : Tags )
            {
                System.out.println( Tag.text() );
            }
        }
        catch ( IOException e )
        {
            e.printStackTrace();
        }
    }
}

JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.

To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?

Solved in my case with com.codeborne.phantomjsdriver NOTE: it is groovy code.

pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

PhantomJsUtils.groovy

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver

class PhantomJsUtils {
    private static String filePath = 'data/temp/';

    public static Document renderPage(String filePath) {
        System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
        WebDriver ghostDriver = new PhantomJSDriver();
        try {
            ghostDriver.get(filePath);
            return Jsoup.parse(ghostDriver.getPageSource());
        } finally {
            ghostDriver.quit();
        }
    }

    public static Document renderPage(Document doc) {
        String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
        FileUtils.writeToFile(tmpFileName, doc.toString());
        return renderPage(tmpFileName);
    }
}

ClassInProject.groovy

Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

You need to understand what is happening :

When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.

The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.

Here are some solutions for this kind of problems:

If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
- it might be that the url is dynamic, and depends on what is on the page at that time
- if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.

You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.

pom.xml

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>3.35</version>
</dependency>

Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit

// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());

// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());

// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
    for (Element col : row.select("td"))
        // print results
        System.out.println(col.ownText());

// clean up resources        
webClient2.close();

A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)

import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;

import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();

String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");

HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");

WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
            new URL("https://url"),
            "https://referrer");

// Add other cookies/ Session ...

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);

requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);

// Wait
synchronized (page) {
    try {
        page.wait(15000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());

I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...

    String RedirectedUrl=null;
    Elements meta = page.select("html head meta");
    if (meta.attr("http-equiv").contains("REFRESH")) {
        RedirectedUrl = meta.attr("content").split("=")[1];
    } else {
        if (page.toString().contains("window.location.href")) {
            meta = page.select("script");
            for (Element script:meta) {
                String s = script.data();
                if (!s.isEmpty() && s.startsWith("window.location.href")) {
                    int start = s.indexOf("=");
                    int end = s.indexOf(";");
                    if (start>0 && end >start) {
                        s = s.substring(start+1,end);
                        s =s.replace("'", "").replace("\"", "");        
                        RedirectedUrl = s.trim();
                        break;
                    }
                }
            }
        }
    }

... now retrieve the redirected page again...

It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

...

WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);

Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");

After specifying user agent, my problem is solved.

https://github.com/jhy/jsoup/issues/287#issuecomment-12769155

Try:

Document Doc = Jsoup.connect(url)
    .header("Accept-Encoding", "gzip, deflate")
    .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
    .maxBodySize(0)
    .timeout(600000)
    .get();

继续阅读：jsoup web-scraping

Page content is loaded with JavaScript and Jsoup doesn't see it

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？