开发者

How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol

I 开发者_Python百科am able to do it on local file systems using file:// protocol but not http protocol


add this property in the nutch-site.xml file then you will crawl the pdf files

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>protocol-httpclient|urlfilter-regex|parse-(html|text|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</description>
</property>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜