Effective way of getting an article's published date/author dynamically?

2023-02-10 12:17 问答作者：

I'm working on a referencing webapp as part of a cou开发者_如何学Pythonrse I am studying, the aim of which is to allow students to quickly and easily reference the materials they find information in and I'm running into a couple of issues with things.

The first is getting an article/site's published date. When dealing with static HTML sites this is easy, as I can simply use document.lastModified to pull in the time it was last modified. Issues arise when dealing with the much more common CMS powered website, as pages are dynamically generated which causes document.lastModified to always return the equivalent of 'now'... which isn't accurate at all.

There are steps that developers of sites can take to make this a bit easier with the implementation of HTML5, namely with the addition of the element, which can have additional attributes set to define it as the time a post was published. Sites like these are fine, but the vast majority of sites aren't using HTML5 and I don't really see this changing any time soon. Anyone out there got some ideas on how to accurately identify when a post was created?

The second is accurately identifying the author of a post or page. There are a couple of ways to identify this. The first is if a site has used the hAtom microformat to identify elements of the site, which makes things easy... but as with post dates isn't common.

The next is looking at the meta data of a site, and identifying the author based on content stored there. This is both uncommon and also generally the owner of the site, or another person not responsible for the post, which leaves it somewhat unreliable for use as a resource.

If the website has an RSS feed and the article is recent enough to be included in it you could extract metadata about the article from it.

Sounds like a pretty tough thing to make, only because there is absolutely no standardization for this information that I know of. Some sites might put it in their keywords, others not.

I did some scraping as part of a media criticism class, and I find that pretty much each cms has to be processed individually. Overall, making something that would find the author info on a random web pages sounds very difficult.

You might be able to make something specifically for capturing this info from WordPress blogs, since those have so many commonalities. But something designed to just hit up any site and grab specific pieces of info, that's pretty tough.

Not trying to discourage you at all, just saying that you've set a pretty high goal, imho.

Sorry I can't help very much, but what about using regex to scan the page for 'By ___' or 'Source: ___' to get the author / source of the information?

As for the date last modified, as far as I know there's no easy way to grab this, as regex'ing for a date would return recent articles in sidebars, links, etc. And yeah, as you said document.lastmodified wouldn't work. You could consider replacing this with "date added" to your referencer, or similar.

Hope this helps you at least a little bit, and if not, gives you an idea or two.

Of course, if there's any API / RSS available, you could scan it for the last updated / posted date, and use that?

继续阅读：external javascript jquery php

Effective way of getting an article's published date/author dynamically?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？