MongoDB storage along with MySQL XPath features

2023-03-17 01:22 问答作者：

I have: large dataset of complex documents with tree-like data structure (each document could have its own data tree which may differ from document to document). Backend is implemented using Django 1.3 and MySQL.

I need:

store these documents using scalable and fast storage
filter documents by some predefined queries
search against the data within limited subset of document's fields
extra feature: search documents using any possible query and extract any possible information from the data tree. This extra feature is customer's requirement, and it should be comfortable for non-technical users to use it. At the moment we have an agreement that XPath is enough.

Note: It is not required to change documents frequently. 90% of time will be for reading operations.

Note: I rarely need all possible fields from data tree. The data is needed for about 90% of time is about 10% of the whole data tree. The only case all data is needed is extra feature described above. However, practically it isn't the popular feature of the system.

Note: Data tree that comes with each document is the representation of some custom format that could not be changed. I can only pick needed pieces of data from the tree and convert them to readable form (as well as write them back using this custom format).

Currently, I use:

MySQL to store data tree for each document as XML
some pre-picked data from XML as extra columns in the same table to speed up search
all other actually needed fields are extracted from XML on the fly during each query using MySQL's extractvalue() function (more than 10 extractvalue() calls in each SQL query)
all searched and filterings are performed using XPath queries and extractvalue() against stored XML

The problem is that this temporary solution is very poor from performance point of view. On 100k+ records dataset I have to wait for minutes to execute single query with ~10 extractvalue() calls in one single query.

The solutions I see at the moment:

1) Continue using the approach with pre-picked fields in extra columns. These fields are extracted once while the document comes to system.

Pros:

Use of proven technology (MySQL)
Most searches will be performed against pre-picked fields not using very slow extractvalue() function
I have tested XPath query searches in this approach on 100k+ records dataset, and seems performance of 1 extractvalue() call isn't too slow (<1s for query) in comparison with 10+ simultaneous extractvalue() calls in current temporary approach

Cons:

Due to each document could have its own data tree, as well as different set of pre-picked fields, I have to create a bunch of tables to store these sets of data, and join these tables depending on document type
Pre-picked fields can be lists, not only single values, so each list should have separate table
Synchronization is required of pre-picked fields and XML

2) Use MySQL for XPath custom searches (开发者_运维知识库i.e. extra feature), use MongoDB for all other searches. MongoDB stores all needed pre-picked fields, MySQL stores just XML.

Note: I don't think it is worth for me to store all possible system's data in MongoDB, like user accounts, sessions etc. MongoDB just for documents will be enough.

Pros:

90% of needed queries should be fast
I could store any nested data against each document. The data trees can be different from document to document. No need to join a lot of tables.
MongoDB seems have very convenient tools to use it from Python

Cons:

Unproven technology (at least for me). I have no experience with MongoDB, however I have consulted with some programmers using it, and seems it looks promising.
MongoDB has no XPath-like features (and seems it will not have them in near future), so I have to continue using MySQL for XPath (as well as in solution 1)). As a developer, I want to avoid inventing new custom query language, so XPath looks like a good compromise.
Thus, synchronization between MySQL and MongoDB is required

My questions:

Could there be some hidden or unobvious problems with solution 2)?
I am still worrying about performance with extractvalue(). From my customer's view, it is possible to replace XPath approach with some similar one, however I have no idea of that, with tools available for MySQL or MongoDB. Can there be any similar solutions like XPath?

Will be grateful for any feedback, thoughts and comments on the written above.

So if I understand the question right you want to

find a given node in a tree, given some path through a portion of the tree to that node plus extra query expressions.
then get back that node and everything below it.

With a materialized paths approach you can do the above. The main thing that needs tweaking is if there is a path "a..b..c..d..e" to a document and you want to find documents with a path "..b..c..d..", how to make that fast. If we are starting from the very top it is easy. However here we aren't. It may make sense to use a combination approach where one has the materialized path in the document for a node plus an array of the node's ancestors, something like:

{ path : ",a,b,c,d,e,", 
  ancestor : ['a','b','c','d','e']
}

We could index on ancestors which will create a multikey index. Then we would do a query like the following to find nodes on path "...b,c,d..." with some efficiency:

find( { path : /,b,c,d,/, ancestor : 'd', <more_query_expressions_optionally> } )

In the above the index on ancestor would be used and only docs from 'd' down need be inspected. The following could be tried which might be even better depending on how smart the query optimizer is:

find( { path : /,b,c,d,/, ancestor : { $all : ['a','d'] }, ... } )

This is a very broad question, but some things I would consider are: XML datastores like MarkLogic and eXist - they are very good at optimizing queries on tree-structured data.

You might also consider rolling your own with a basic search index like MySQL or possibly Lucene/Solr if you want better full-text search capabilities (phrases, synonyms near queries, etc).

I would do something like index the text content of every named element and attribute (this is more or less the approach taken by the XML datastores I referred to above), use that to retrieve a candidate list of documents, and then evaluate the XPath expressions on the candidates to weed out the false positives. This is not a small project, though.

继续阅读：django mongodb xml

MongoDB storage along with MySQL XPath features

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？