开发者

Converting a legacy EAV schema to Mongo or Couch

Let's say I have a legacy application that, for various reasons, previous developers decided must have an arbitrarily flexible schema, and they reinvented the Entity-Attribute-Value model yet again. They were actually trying to build a document repository, for which tools like Mongo or Couch would now be a better fit in today's world, but were not available or not known to the previous teams.

To stay competitive, let's say we need to build more powerful methods for querying and analyzing information in our system. Based on the sheer number and variety of attributes, it seems like map/reduce is a better fit for our set of problems than gradually refactoring the system into a more relational schema.

The original source database has millions of documents, but only a small number of distinct document types. There ar开发者_开发技巧e some commonalities across the distinct document types.

What's an effective strategy for doing a migration from a massive EAV implementation in, say, MySql, to a document-oriented store like Mongo or Couch?

I can certainly imagine an approach to attack this, but I'd really like to see a tutorial or war story to learn from someone who has already attacked this type of problem.

What were some strategies for doing this kind of conversion that worked well? What lessons did you learn? What pitfalls should I avoid? How did you deal with legacy apps that still expect to be able to interact with the existing database?


My first usage of Couch was after I had written a Ruby and Postgres web crawler (directed crawl of mp3 blogs to build a recommendation engine).

The relational schema got deeply gnarly as I tried to record ID3 metadata, audio signatures, etc etc, and the detect overlaps and otherwise do deduplication. It worked but it was slow. So slow I started caching my JSON API rows onto the corresponding primary ActiveRecord objects as blob fields.

I had a choice: dig in and learn Postgres performance tuning, or move to a horizontal approach. So I used Nutch and Hadoop to spider the web, and the PipeMapper to parse pages with Ruby / Hpricot. So I was able to reuse all my parser code, and just change it from saving as a normalized database, into saving as JSON. I wrote a little library to handle the JSON and the REST URL endpoints, called CouchRest, which I used to save the Hpricot results into CouchDB.

For that project I just ran Couch on a single EC2 node, with a small 6 node Hadoop cluster populating it. It was only when I got around to building the browsing interface for the spidered data, that I really got a good feeling for the query capabilities.

I turned out to be flexible and especially well suited to OLTP applications, I quickly started using it in all my project and eventually founded a company around the technology with two of the creators.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜