开发者

Can I identify intranet page content using Named Entity Recognition?

I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be 开发者_如何转开发popular in NLP so I will use it in my project.

Here is what I would like to do:

  • I want to scan our company's intranet pages; approximately 3K pages
  • I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...

From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.

Is this the right approach? I appreciate any direction and ideas...

Thanks


It looks like you want to do text/document classification, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.

The NLTK book has a chapter on basic text classification.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜