Text mining on large database (data mining)
I have a large database of resumes (CV), and a certain table skills grouping all users skills.
inside that table there's a field skill_text that describes the skill in full text.
I'm looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..
Here are some examples skills extracted from the DB :
- Sectoral and competitive analysis
- Business Development (incl. in international settings)
- Specific structure and road design software - Microstation, Macao, AutoCAD (basic knowledge)
- Creative work (Photoshop, In-Design, Illustrator)
- checking and reporting back on campaign progress
- organising and attending events and exhibitions
- Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
- Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.
The output shoud be something like :
- Sectoral and competitive analysis
- Business Development
- Specific structure and road design software -
- Macao
- AutoCAD
- Photoshop
- In-Design
- Illustrator
- orga开发者_运维知识库nising events
- Development
- Aptana Studio
- PHP
- HTML
- CSS
- JavaScript
- SQL
- AJAX
- Mix marketing
- Viral Marketing
- Social network marketing
- emailing
- SEO
- One to one marketing
As you see only skills remains no other representation text.
I know this is possible using text mining technics but how to do it ? the database is realy large.. it's a good thing because we can calculate text frequency and decide if it's a real skill or just meaningless text... The big problem is .. how to determin that "blablabla" is a skill ?
Edit : please don't tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!
thanks
If I was doing this programmatically I would:
Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so Creative work (Photoshop, In-Design, Illustrator)
becomes
Skill
-------------
Creative work
Photoshop
In-Design
Illustrator
Then, after you've proceed all CVs, query for the most common skills (this is MySQL)
SELECT skill, COUNT(1) cnt FROM newTable GROUP BY skill ORDER BY cnt DESC;
Which may look like this contrived example
Skill Cnt
---------------------
Photoshop 3293
Illustrator 2134
Creative work 932
In-Design 123
Then you decide, from the top X skills, which you want to capture, which must map to other skills (Indesign
and In-design
should map to the same skill, for example) and which to discard, then script the process using a data map.
Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.
Many databases will do this for you via their full-text search functionality. I know that PostgreSQL's full-text search would be able to do this easily with the aid of a custom dictionary.
Alternatively, you can use PHP's strtok or equivalent to index your text. Once indexed you can compare to dictionary, or simply use occurrences to create a sheet for yourself. Word clouds are made in a similar fashion.
Doing this well requires knowledge; otherwise what's to tell "organising events" is a 'skill' while "creative work" isn't? But a stupid program can take a first cut at it by analyzing statistics of collocations: see the answers to How to extract common / significant phrases from a series of text entries and Algorithms to detect phrases and keywords from text.
精彩评论