开发者

How to write a xml database file efficiently?

I want to build an XML file as a datastore. It should look something like this:

<datastore>
    <item>
        <subitem></subitem>
        ...
        <subitem></subitem>
    </item>
    ....
    <item>
        <subitem></subitem>
        ...
        <subitem></subitem>
    </item>
</datastore>

At runtime I may need to add items to the datastore. The number of items may be high, so that I don't want to hold the whole document in memory and can't use DOM. I ju开发者_JAVA技巧st want to write the part where a change occures. Or does DOM supports this?

I had a first look at StAX, but I am not sure if it does what I want.

Wouldn't it be the best to remember a cursor position at the end of the file just right before the root element is beeing closed? That is always the position where new items will be added. So if I remember that position and keep it up to date during changes, I could add an new item at the end, without iterating through the whole file .

Maybe a second cursor, could be used independendly from the first one, to iterate over the document just for reading purposes.

I can't see that StAX supports any of this, does it?

Isn't there a block based API for files instead of a stream bases one? Aren't files and filesystems typical examples for block "devices"? And if there is such an API, does it help me with my problem?

Thanks in advance.


Updating XML is basically impossible because there's no "cheap" way to insert data.

Appending XML is not so bad. All you need to do there is seek to the end of the file, then GO BACK over the "end tag" (</datastore> in this case), and then just start writing. This is a cheap operation all told, but none of the frameworks really support this as they're all mostly designed to work with well formed, full boat XML documents, as a whole, not in pieces.

You could use a StAX like thing, but in this case, StAX isn't aware of the <datastore> tag, rather it's just aware of the <item> tags and its elements. Then you create Items and start writing, over and over and over, to the same OutputStream that you have set up.

That's the best way to do this.

But if you need to delete or change data, then you get to rewrite stuff, or do hacks, such as marking elements as "inactive", hunting them down in the XML file, seeking to the 'active="Y"' attribute, and then inplace changing the Y to N. It can be done, it will be mostly efficient, but its far and away outside what the normal XML processing frameworks let you do. If I were to do that, I'd read the entire file and keep track of those entries and note their locations within it so later I could easily seek and change them efficiently.

Then when you update something, you "inactivate" the old one, and "append" the new one. Eventually get to GC the file by rewriting it all and throwing out the old, "inactive" entries.


As a rule of thumb, XML files aren't very efficient as datastores, not for the record-based data you seem to want to use them for.

But if you've already got the file and absolutely can't do anything about it, you can use StAX XMLEventReaders and XMLEventWriters to read through a file quickly and insert/modify elements in it.

But when I say quickly, what I mean is more quickly than DOM would be, but nowhere near as effective as any relational DB.

Update: Another option you can consider is vtd-xml, although I haven't tried it in real projects, it actually looks pretty decent.


If you always want to append items at the end, then the best way to handle this is to have two XML files. The outer one datstore.xml is simply a wrapper, and looks like this:

<!DOCTYPE datastore [
  <!ENTITY e SYSTEM "items.xml">
]>
<datastore>&e;</datastore>

The file items.xml looks like this:

<item>....</item>
<item>....</item>
<item>....</item>

with no wrapper element.

When you want to append data, you can open items.xml and write to the end of it. When you want to read data, open datastore.xml with an XML parser.

Of course, once your data grows beyond 20Mb or so, it may well be better to use an XML database. But I've been using this approach for years for records of Saxon orders, with files that are currently about 8Mb, and it works fine.


It's not very easy or efficient to partially update an XML file so you won't find much support for it as a use case.

Really it sound like you need a proper database, perhaps with a tool to export the data as XML.

If you don't want to use a DB and insist on storing the data purely as XML you might consider keeping all your items in memory as objects. Whenever a new one is added you can write all of them out to XML. It might seem inefficient, but depending on your data size might still be good enough.

If you choose this path, you might want to check out the Xstream library to make this quite easy, see stream tutorial for a quick example.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜