Groovy htmlunit getFirstByXPath returning null + OCR Question

2023-02-03 00:09 问答作者：

I have had a few issues with HtmlUnit returning nulls lately and am looking for guidance. each of my results for grabbing the first row of a website have returned null. I am wondering if someone can

A) explain why they might be returning null

B) explain better ways (if there are some) to go about getting the information

Here is my current code (URL is in the source):

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

def url = "http://www.hidemyass.com/proxy-list/"

page = client.getPage(url)

IpAddress = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[2]").getValue()
println "IP Address is: $data"          //returns null

//Port_Number is an Image

Country = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[4][@class='country']/@rel").getValue()
println "Country abbreviation is: $Country"

//differentiate speed and connection by name of gif?

Type = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[7]").getValue()
println "Proxy type is: $Type"

Anonymity = page.getFirstByXPath("//html/body/div/div/form/table/tbody/tr/td[8]").getValue()
println "Anonymity Level is: $Anonymity"

client.closeAllWindows()

Right now all of my XPaths return null and .getValue() obviously doesn't work on null.

I also have questions as to what I should do about the PORT since it is an image? Is there a better alternative than downloading it and attempting t开发者_JAVA百科o solve it by OCR?

Side Note

There is no significance in this site, I was just looking for a site that I could practice scraping on (the last one I ran into issues of fragment identities and couldn't get an answer to: HtmlUnit getByXpath returns null and HtmlUnit and Fragment Identities )

It looks like your xpath query is incorrect. Based on the url provided in the code sample the form element should be removed from the search path.

Groovy htmlunit getFirstByXPath returning null + OCR Question

Here is an xpath query that will be less prone to breaking when the layout of the page changes.

//table[@id='proxylist-table']/tbody/tr/td[2]

As far as the port number goes The author of that page must have wanted that portion of the data to not be scraped for some reason. Doing OCR might be your best option.

However, one thing you could do is look at the size of the image that is returned to guess the port number. For example I've noticed that images that display port 80 all have a content length of 406 or 411. Port 8080 are either 402 or 409. There are two different sizes to the images to blend in with the row color. If the Url ends in a 1 it will have a white back ground if it ends in 0 it will have a light grey back ground and always be a few bytes larger. There are obvious drawbacks to this approach but it may work.

继续阅读：groovy htmlunit screen-scraping

Groovy htmlunit getFirstByXPath returning null + OCR Question

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？