开发者

Processing xml files with Hadoop

I'm new to Hadoop. I know very little about it. My case is as follows: I have a set of xml files (700GB+) with the same schema.

    <article>
     <title>some title</title>
     <abstract>some abstract</abstract>
     <year>2000</year>
     <id>E123456</id>
     <authors>
      <author id="1开发者_如何转开发">
       <firstName>some name1</firstName>
       <lastName>some name1</lastName>
       <email>email1@domain.com</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <author id="2">
       <firstName>some name2</firstName>
       <lastName>some name2</lastName>
       <email>email2@domain.com</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <tags>
       <tag>medicin</tag>
       <tag>inheritance</tag>
      </tags>
     </authors>
     <references>
      <reference>some reference text1</reference>
      <reference>some reference text2</reference>
     </references>
    </article>

I convert the data within the xml files into a relational database containing the following tables

  • Articles
  • Authors
  • Tags
  • References

I have a set of tools that work on the tables for generating a list of statistical reports and doing some other staff. Because of a tool that uses a full text search on the References table, I stored it in a Lucene Solr index.

My question is: can I use Hadoop for:

  1. Storing the data that is in the xml files
  2. Making some kind of separation between the entities listed above(Authors,Articles,Tag and References)
  3. Running my tools that perform a very complex set of queries on the data and if that can be done using hadoop, will it be in a good performance?

If Hadoop is not a good candidate for case, will be any other nosql database like MongoDB or Cassandra a better solution (because my big problem with the relational system is the very bad performance with the complex algorithms I'm using to do my job)?


What you are asking for sounds very similar to what Google, Yahoo, Bing etc do with the web- suck in documents as some form of markup, store them, process them to extract the relevant information, and provide a query interface on top of that. I'd suggest looking into how these search engines leverage MapReduce and BigTable implementations (like HBase and Cassandra) to do exactly that.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜