Processing xml files with Hadoop

2023-02-16 15:23 问答作者：

I'm new to Hadoop. I know very little about it. My case is as follows: I have a set of xml files (700GB+) with the same schema.

    <article>
     <title>some title</title>
     <abstract>some abstract</abstract>
     <year>2000</year>
     <id>E123456</id>
     <authors>
      <author id="1开发者_如何转开发">
       <firstName>some name1</firstName>
       <lastName>some name1</lastName>
       <email>email1@domain.com</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <author id="2">
       <firstName>some name2</firstName>
       <lastName>some name2</lastName>
       <email>email2@domain.com</email>
       <affiliations affid="123">
        <org>some organization1</org> 
        <org>some organization2</org>
       </affiliations>
      </author>
      <tags>
       <tag>medicin</tag>
       <tag>inheritance</tag>
      </tags>
     </authors>
     <references>
      <reference>some reference text1</reference>
      <reference>some reference text2</reference>
     </references>
    </article>

I convert the data within the xml files into a relational database containing the following tables

Articles
Authors
Tags
References

I have a set of tools that work on the tables for generating a list of statistical reports and doing some other staff. Because of a tool that uses a full text search on the References table, I stored it in a Lucene Solr index.

My question is: can I use Hadoop for:

Storing the data that is in the xml files
Making some kind of separation between the entities listed above(Authors,Articles,Tag and References)
Running my tools that perform a very complex set of queries on the data and if that can be done using hadoop, will it be in a good performance?

If Hadoop is not a good candidate for case, will be any other nosql database like MongoDB or Cassandra a better solution (because my big problem with the relational system is the very bad performance with the complex algorithms I'm using to do my job)?

What you are asking for sounds very similar to what Google, Yahoo, Bing etc do with the web- suck in documents as some form of markup, store them, process them to extract the relevant information, and provide a query interface on top of that. I'd suggest looking into how these search engines leverage MapReduce and BigTable implementations (like HBase and Cassandra) to do exactly that.

继续阅读：nosql performance relational-database xml

Processing xml files with Hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？