Which metadata I should save when downloading web-pages?
I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important.
<title>
<link>
<publish_date>
<date_downloaded>
<source> // to this page
<keyword开发者_StackOverflow社区> // for Solr indexing
<text> // cleaned body of page
Is there something important what I could miss in future?
There is some others that you might find interesting:
- Document type (is it an article, a publicity, a landing page, etc)
- Subtitle/Headline/Abstract
- Image location (url of images if you want to display in your webapp)
- Author
- Section (so you could use fq in your Solr queries to restrict results to specific sections)
精彩评论