开发者

Effective way of getting an article's published date/author dynamically?

I'm working on a referencing webapp as part of a cou开发者_如何学Pythonrse I am studying, the aim of which is to allow students to quickly and easily reference the materials they find information in and I'm running into a couple of issues with things.

The first is getting an article/site's published date. When dealing with static HTML sites this is easy, as I can simply use document.lastModified to pull in the time it was last modified. Issues arise when dealing with the much more common CMS powered website, as pages are dynamically generated which causes document.lastModified to always return the equivalent of 'now'... which isn't accurate at all.

There are steps that developers of sites can take to make this a bit easier with the implementation of HTML5, namely with the addition of the element, which can have additional attributes set to define it as the time a post was published. Sites like these are fine, but the vast majority of sites aren't using HTML5 and I don't really see this changing any time soon. Anyone out there got some ideas on how to accurately identify when a post was created?

The second is accurately identifying the author of a post or page. There are a couple of ways to identify this. The first is if a site has used the hAtom microformat to identify elements of the site, which makes things easy... but as with post dates isn't common.

The next is looking at the meta data of a site, and identifying the author based on content stored there. This is both uncommon and also generally the owner of the site, or another person not responsible for the post, which leaves it somewhat unreliable for use as a resource.


If the website has an RSS feed and the article is recent enough to be included in it you could extract metadata about the article from it.


Sounds like a pretty tough thing to make, only because there is absolutely no standardization for this information that I know of. Some sites might put it in their keywords, others not.

I did some scraping as part of a media criticism class, and I find that pretty much each cms has to be processed individually. Overall, making something that would find the author info on a random web pages sounds very difficult.

You might be able to make something specifically for capturing this info from WordPress blogs, since those have so many commonalities. But something designed to just hit up any site and grab specific pieces of info, that's pretty tough.

Not trying to discourage you at all, just saying that you've set a pretty high goal, imho.


Sorry I can't help very much, but what about using regex to scan the page for 'By ___' or 'Source: ___' to get the author / source of the information?

As for the date last modified, as far as I know there's no easy way to grab this, as regex'ing for a date would return recent articles in sidebars, links, etc. And yeah, as you said document.lastmodified wouldn't work. You could consider replacing this with "date added" to your referencer, or similar.

Hope this helps you at least a little bit, and if not, gives you an idea or two.

Of course, if there's any API / RSS available, you could scan it for the last updated / posted date, and use that?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜