parsing a cv file [closed]
开发者_开发技巧
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this questionI want to write a code either in Java or PHP (Codeigniter) to extract information such as email and phone number of a user uploading hbis resume or cv to the site. Basically I want to build a cv parser.
Need help for this.
thanks
EDIT The cv format will be in doc.
Since there is no standard CV format, parsing will be next to impossible.
Instead, consider collecting contact information in an HTML form when they upload.
I'd suggest you to build it using a set of regular expressions. If you just want to extract phone number and email the parser is very simple. It will work almost 100% for emails and (I believe) 98% for phone numbers.
If you wish to extract other information it will be more complicated because there is no standards for CVs; information may be formatted using different ways. Anyway, good luck!
you should use python and write your own scraper, its easy and it can be done really quickly in your case with modules like beautiful soup, urllib2 ...
what its this all about
beautiful soup documentation
Ditto AlexR. If ALL you want to find is email address and phone number, you could scan for strings of characters in the appropriate format. A couple of simple regular expressions could do that fairly reliably. Even that wouldn't be 100%. If someone included, "Learned Java@Technocorp. US citizen." etc, you might easily be fooled into thinking that's an email address of "java@technocorp.us". Okay, that's a strained example, but it's the sort of thing that shoots down natural language parsing.
If you want more than that, there is no easy answer. You could search for keywords, like to find where he went to school you could look for the words "college" or "university". But even then, someone might put "Graduate of Foobar College" or "College: Foobar" or "BA from Foobar" or many many other possible formats.
As @Corbin said, there is no standard CV format. It will be quite difficult to parse with 100% accuracy.
Though, you can try Apache Tika - A Content Analysis Toolkit to parse resume doc/docx format. Apache also support many document format including pdf, txt, xml, odf etc.
Btw, extracting email and phone number from resume can be achieved with few lines of code with the help of regex after getting whole contents from cv using Apache Tika.
Let me know if you get stuck.
Hope this helps!
Note- (I am working on resume summarizer).
精彩评论