开发者

how to parse (only text) web sites while crawling

i can succesfully run crawl command via cygwin 开发者_JS百科on windows xp. and i can also make web search via using tomcat.

but i also want to save parsed pages during crawling event

so when i start crawling with like this

bin/nutch crawl urls -dir crawled -depth 3

i also want save parsed html files to text files

i mean during this period which i started with above command

nutch when fetched a page it will also automaticly save that page parsed (only text) to text files

these files names could be fetched url

i really need help about this

this will be used at my university language detection project

ty


The crawled pages are stored in the segments. You can have access to them by dumping the segment content:

nutch readseg -dump crawl/segments/20100104113507/ dump

You will have to do this for each segment.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜