How do you scrape fields for auto-tagging?

2023-03-08 19:48 问答作者：

We have a form with a large开发者_JAVA百科 textarea and a couple text fields. We also have a list of 1500 tags (some have spaces) categorized in 5 types. What is the best way to scrape the text entered by users to extract tags that they may have entered.

We do not want to give them a tag field - it needs to happen automatically.

Any ideas?

Front-end wise:

I would suggest you using one of the available autocompletion jquery plugins (there are many, just google around) that does an AJAX request per tag, returning a JSON object with the similar tags. To do this you'll need to make a route where you can query; example: http://mysite.com/tags?s=%s which returns JSON.

The other way to do it, the lazy way, which is doable considering the amount of tags you have (and of course depending if this is something users can view) is outputing the whole tag array as a JSON object embeded on the document. I don't recommend this unless you're in a really urge to solve the problem and you don't mind loading extra amount of stuff.

The tags should be separated by commas.

Back-end wise:

Once you submit the form you'll need to add an extra procedure to parse the given tags. Just do a tags.split(',') and you'll get a tag array which you can later iterate over to insert the data into the database.

If I understand your problem correctly, one solution could be this:

On application load, build a Set with all the tags.
When a user posts a text, iterate through all the words and check them against the Set.

This would be pretty fast for your purpose, considering looking up in a Set takes constant time.

If a word is included in your tag-set, add the word to a new Set. When done iterating through all the words, do the database queries to associate the new tags with the uploaded text.

Well if I understand this right.

You could use regex, but I am not sure about its efficiency when working with 1500 match-able results (if you can define multiple tags in a single regex statement that would be good).

for(var index = 0; index < textAreas.length; index++)
{
    textAreas[index].innerHTML.match(new Regex("/" + tags + "/", g));  //will return an array of the found tags.
}

//Where Tags is in the format tag1|tag2|tag3
//Where tagN can be a regex that matches multiple tags in your list.

I won't edit my previous answer since this one is a completely different approach to the one proposed; and editing it would mean remaking it, which is a bad idea considering the answer may be useful to someone.

One way to make "auto tagging", in the sense that you never tell your people to write a single keyword, is to parse the content being aware of the context (for instance, if your people will write about Bikes, you need to avoid ignoring those words).

To begin with the content:

Remove Pronouns
Remove Common Names (non-related)
Remove Conjunctions
Remove Prepositions
Remove addresses (but take the word that is linked)
Split all the words remaining words, and weight them based on appearance.
Give more weight to words that are linked or that appear on the title tag.

This should be done on the back-end; since the odds are you're going to be doing a lot of preparing. Removing HTML at especial points, iterate through arrays, weight the words and sanitize them.

继续阅读：javascript jquery ruby ruby-on-rails

How do you scrape fields for auto-tagging?

Front-end wise:

Back-end wise:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Front-end wise:

Back-end wise:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？