开发者

Wikipedia Graph Database Insertion

I am trying to create a database from dbpedia RDF triples. I have a table Categories which contains all the Categories in wikipedia. To store categorizations i have created a table with child and parent f开发者_运维百科ields, both foreign keys to Categories table. To load categories from NTriples iam using the following SQL Query

INSERT INTO CatToCat (`child`, `parent`)
values((SELECT id FROM Categories WHERE BINARY identifier='Bar'),
       (SELECT id FROM Categories WHERE BINARY identifier='Bar'));

But the insertion is very slow.. inserting 2.5Million relationships would take very long time.. is there better way to optimize the query, schema??


you could try a Graph Database like Neo4j, with RDF layers on top, there is for instance the Tinkerpop SAIL implementation, see https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation

That should work a bit better than RDBMS, at least for Neo4j.

/peter


  1. Consider loading SELECT id, indentifier from Categories into a hash table (or trie) on the client side, and using that to fill CatToCat. On a database the size of wikipedia, I'd expect to see a huge performance difference between constant time hash lookups and trie lookups (which are constant with respect to the number of different data items), and log n B-Tree lookups. (Of course, you need to have the memory available.)

  2. Consider using a single PreparedStatement, with parameter binding so that MySQL doesn't have to re-parse and re-optimize the query for every insertion.

You'll have to benchmark these to figure out how much of an improvement they actually are.


I solved the problem. Was some indexing issues. Made identifier in Categories unique and binary. I guess that sped up the two selects.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜