Index xml files from a outside website
Using python django I would like to access this site http://www.reta-vortaro.de/revo/ It is a dictionary site for a language called esperanto, I need to be able to search for a word, and get back its definition, it looks like Each Esperanto root word has an xml开发者_JAVA技巧 file,
- I need to index each xml file
- store the name of each xml file in a database.
- On my website I need to $_GET the word.
- I need to search for combinations of these root words with a xml file named after it.
Most programming languages have access to both some sort of XML parser as well as some persistent embedded key-value store. Once you've decided on a programming language, just find one of each that you can feel comfortable with.
Wonder, if you have access to WSDL. You might be able to access the data that way. What exactly is the problem you are encountering?
As soon as you need indexing and fast search, it might worth looking for XML database for storing your dictionary (especially for complex queries and big dictionaries). You can easily access most XML databases from PHP.
I would consider such workflow for you:
- Download all the files
- Load their contents and filenames into database (any database will fit)
- Setup sphinx search tool ( http://sphinx.pocoo.org/ )
- Run sphinx to build an index for xml_contents
- Design your application to use sphinx for search in index
- Delete all file contains, leaving only filename and sphinx index in a database
- When using sphinx search, you'll get a filename, operate with it as you supposed before
I'm not very familiar with sphinx and don't know is it capable to use files, to build it's index, that why I offer you to load all the information into database
Have you tried asking the site's administrator for the data? Or perhaps he could set up a web service for you?
Well, you can fetch each XML file with file_get_contents()
, curl, wget or the tool that pleases you the most.
Then, You can save the XML files on the file system, or even better use Oracle's Berkeley DBXML, with it you can actually save the XML in the Database and query it, kinda like if it was SQL. It has PHP bindings and lets you query with XQuery. I used it to replace a XML web service and works like a charm, blazing fast.
For PHP XML parsing I used to use Keith Devens' XML to Array parser, which is EASY, but it's now oldish Now I use CakePHP's own, you may want to use PHP's SimpleXML. There are also JavaScript based parsers you could use on the client side of the app like jParse (jQuery).
This is the Page for PHP + dbXML but seems to be down: http://phpdbxml.4641.org/ but you can download it from here: http://www.oracle.com/technology/software/products/berkeley-db/index.html (there's the license too).
I hope it helps.
精彩评论