开发者

How do you scrape fields for auto-tagging?

We have a form with a large开发者_JAVA百科 textarea and a couple text fields. We also have a list of 1500 tags (some have spaces) categorized in 5 types. What is the best way to scrape the text entered by users to extract tags that they may have entered.

We do not want to give them a tag field - it needs to happen automatically.

Any ideas?


Front-end wise:

I would suggest you using one of the available autocompletion jquery plugins (there are many, just google around) that does an AJAX request per tag, returning a JSON object with the similar tags. To do this you'll need to make a route where you can query; example: http://mysite.com/tags?s=%s which returns JSON.

The other way to do it, the lazy way, which is doable considering the amount of tags you have (and of course depending if this is something users can view) is outputing the whole tag array as a JSON object embeded on the document. I don't recommend this unless you're in a really urge to solve the problem and you don't mind loading extra amount of stuff.

The tags should be separated by commas.

Back-end wise:

Once you submit the form you'll need to add an extra procedure to parse the given tags. Just do a tags.split(',') and you'll get a tag array which you can later iterate over to insert the data into the database.


If I understand your problem correctly, one solution could be this:

  1. On application load, build a Set with all the tags.
  2. When a user posts a text, iterate through all the words and check them against the Set.

This would be pretty fast for your purpose, considering looking up in a Set takes constant time.

If a word is included in your tag-set, add the word to a new Set. When done iterating through all the words, do the database queries to associate the new tags with the uploaded text.


Well if I understand this right.

You could use regex, but I am not sure about its efficiency when working with 1500 match-able results (if you can define multiple tags in a single regex statement that would be good).

for(var index = 0; index < textAreas.length; index++)
{
    textAreas[index].innerHTML.match(new Regex("/" + tags + "/", g));  //will return an array of the found tags.
}

//Where Tags is in the format tag1|tag2|tag3
//Where tagN can be a regex that matches multiple tags in your list.


I won't edit my previous answer since this one is a completely different approach to the one proposed; and editing it would mean remaking it, which is a bad idea considering the answer may be useful to someone.

One way to make "auto tagging", in the sense that you never tell your people to write a single keyword, is to parse the content being aware of the context (for instance, if your people will write about Bikes, you need to avoid ignoring those words).

To begin with the content:

  • Remove Pronouns
  • Remove Common Names (non-related)
  • Remove Conjunctions
  • Remove Prepositions
  • Remove addresses (but take the word that is linked)
  • Split all the words remaining words, and weight them based on appearance.
  • Give more weight to words that are linked or that appear on the title tag.

This should be done on the back-end; since the odds are you're going to be doing a lot of preparing. Removing HTML at especial points, iterate through arrays, weight the words and sanitize them.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜