开发者

Which metadata I should save when downloading web-pages?

I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important.

<title>
<link>
<publish_date>
<date_downloaded>
<source>  // to this page
<keyword开发者_StackOverflow社区> // for Solr indexing
<text>    // cleaned body of page

Is there something important what I could miss in future?


There is some others that you might find interesting:

  • Document type (is it an article, a publicity, a landing page, etc)
  • Subtitle/Headline/Abstract
  • Image location (url of images if you want to display in your webapp)
  • Author
  • Section (so you could use fq in your Solr queries to restrict results to specific sections)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜