Text mining on large database (data mining)

2022-12-26 23:28 问答作者：

I have a large database of resumes (CV), and a certain table skills grouping all users skills.

inside that table there's a field skill_text that describes the skill in full text.

I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..

Here are some examples skills extracted from the DB :

Sectoral and competitive analysis
Business Development (incl. in international settings)
Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
Creative work (Photoshop, In-Design, Illustrator)
checking and reporting back on campaign progress
organising and attending events and exhibitions
Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.

The output shoud be something like :

Sectoral and competitive analysis
Business Development
Specific structure and road design software -
Macao
AutoCAD
Photoshop
In-Design
Illustrator
orga开发者_运维知识库nising events
Development
Aptana Studio
PHP
HTML
CSS
JavaScript
SQL
AJAX
Mix marketing
Viral Marketing
Social network marketing
emailing
SEO
One to one marketing

As you see only skills remains no other representation text.

I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?

Edit : please don't tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!

thanks

If I was doing this programmatically I would:

Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so Creative work (Photoshop, In-Design, Illustrator) becomes

 Skill            
 -------------
 Creative work    
 Photoshop        
 In-Design        
 Illustrator

Then, after you've proceed all CVs, query for the most common skills (this is MySQL)

SELECT skill, COUNT(1) cnt FROM newTable GROUP BY skill ORDER BY cnt DESC;

Which may look like this contrived example

 Skill            Cnt
 ---------------------
 Photoshop        3293
 Illustrator      2134
 Creative work     932
 In-Design         123

Then you decide, from the top X skills, which you want to capture, which must map to other skills (Indesign and In-design should map to the same skill, for example) and which to discard, then script the process using a data map.

Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.

Many databases will do this for you via their full-text search functionality. I know that PostgreSQL's full-text search would be able to do this easily with the aid of a custom dictionary.

Alternatively, you can use PHP's strtok or equivalent to index your text. Once indexed you can compare to dictionary, or simply use occurrences to create a sheet for yourself. Word clouds are made in a similar fashion.

Doing this well requires knowledge; otherwise what's to tell "organising events" is a 'skill' while "creative work" isn't? But a stupid program can take a first cut at it by analyzing statistics of collocations: see the answers to How to extract common / significant phrases from a series of text entries and Algorithms to detect phrases and keywords from text.

继续阅读：data-mining database text-mining

Text mining on large database (data mining)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？