开发者

Text mining on large database (data mining)

I have a large database of resumes (CV), and a certain table skills grouping all users skills.

inside that table there's a field skill_text that describes the skill in full text.

I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..

Here are some examples skills extracted from the DB :

  • Sectoral and competitive analysis
  • Business Development (incl. in international settings)
  • Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
  • Creative work (Photoshop, In-Design, Illustrator)
  • checking and reporting back on campaign progress
  • organising and attending events and exhibitions
  • Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
  • Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.

The output shoud be something like :

  • Sectoral and competitive analysis
  • Business Development
  • Specific structure and road design software -
  • Macao
  • AutoCAD
  • Photoshop
  • In-Design
  • Illustrator
  • orga开发者_运维知识库nising events
  • Development
  • Aptana Studio
  • PHP
  • HTML
  • CSS
  • JavaScript
  • SQL
  • AJAX
  • Mix marketing
  • Viral Marketing
  • Social network marketing
  • emailing
  • SEO
  • One to one marketing

As you see only skills remains no other representation text.

I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?

Edit : please don't tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!

thanks


If I was doing this programmatically I would:

Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so Creative work (Photoshop, In-Design, Illustrator) becomes

 Skill            
 -------------
 Creative work    
 Photoshop        
 In-Design        
 Illustrator      

Then, after you've proceed all CVs, query for the most common skills (this is MySQL)

SELECT skill, COUNT(1) cnt FROM newTable GROUP BY skill ORDER BY cnt DESC;

Which may look like this contrived example

 Skill            Cnt
 ---------------------
 Photoshop        3293
 Illustrator      2134
 Creative work     932
 In-Design         123

Then you decide, from the top X skills, which you want to capture, which must map to other skills (Indesign and In-design should map to the same skill, for example) and which to discard, then script the process using a data map.

Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.


Many databases will do this for you via their full-text search functionality. I know that PostgreSQL's full-text search would be able to do this easily with the aid of a custom dictionary.

Alternatively, you can use PHP's strtok or equivalent to index your text. Once indexed you can compare to dictionary, or simply use occurrences to create a sheet for yourself. Word clouds are made in a similar fashion.


Doing this well requires knowledge; otherwise what's to tell "organising events" is a 'skill' while "creative work" isn't? But a stupid program can take a first cut at it by analyzing statistics of collocations: see the answers to How to extract common / significant phrases from a series of text entries and Algorithms to detect phrases and keywords from text.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜