开发者

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.

Possibly their is a way to find keywords or something?

Unfortunately I don't have any idea so far, so I don't have any code to show you.

But if anybody has any ideas at all, let me know.

Thanks!

EDIT @Nican

Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).

Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).

开发者_运维知识库

Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.

Hope that makes sense.


What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".

A basic system might work along the following lines:

  • Get page content
  • Get X most commonly used words (omitting stuff like "and" "or" etc.)
  • Get words used in headings
  • Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
  • Match the filtered words against a database of words related to a specific "category"
  • If cumulative score > treshold, classify site as belonging to category
  • Rinse and repeat


Folksonomy may be a way of accomplishing what you're looking for:

http://en.wikipedia.org/wiki/Folksonomy

For instance, in Drupal they have a Folksonomy module:

http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)

Couple that with a tag cloud generator, and you may get somewhere:

http://drupal.org/project/searchcloud

Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.

http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html

EDIT

In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...

  1. Get unique word values from your content (index values or create a bot to crawl your site)
  2. Remove all words and symbols you can't use (at, the, or, and, etc...)
  3. Count the number of times the unique words appear on the page
  4. Add them to some type of datastore so you can call them based on the relationships you're mapping
  5. If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)

This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜