is it possible to create a search engine to index new jobs from many company websites?
I dont think it's possible without som开发者_运维知识库e xml/api given by the employer websites?
basically can i extract and identify information from a html page?
You can in theory, but scraping employer websites for job adverts is a futile, futile endeavour requiring awfully complex programming, pattern recognition, manual post-processing for the (many) times the system will get it wrong, and constant updating.
Also, there are legal issues. While the process of scraping is often allowed, most web sites forbid automatic processing of their data, so you may be in for a lot of trouble when you re-publish any job offers fetched this way.
You need to go for XML or other kinds of structured, standardized, legal data.
If you can't get that, I'd say forget it and do something more joyful with your time.
Some people would attempt screen-scraping - literally fetching the text and trying to parse out the information, based on a knowledge of the (x)html structure. This is highly frowned upon, as the assumption is that if the owner of the target site wanted to share data, the data would be made available as a feed or webservice.
Maybe ask them?
It might be possible, but I guess it isn't legal, it's at least very shady. I would go for a better solution, like ask the companies to get an xml feed or something like that.
精彩评论