开发者

How do I efficiently refresh results of an expensive query?

I have an application which executes an expensive query to populate the UI. At regular intervals, the UI needs to refresh and display new data that qualifies for the original query. I would like for the second query to be as efficient as possible and only return the new data that qualifies. How can I do this?

  For example, my first query is the following:  

select * from some_really_big_table

  The most obvious thought is to add a new qualification to the original query. Suppose I had a column called update_timestamp that stores the timestamp of when the row was created or last updated. To simplify this question, let's assume that there is only a single database server and this timestamp is populated from the database server operating system time when the insert/update statement is executed. Also assume the clock has enough resolution such that no two timestamps would ever be the same. Add another assumption that the clock value will never decrease. I know these are somewhat unrealistic assumptions.

  With the aid of this new column (and an index on that column), I perform the refresh with the following query (where X is the highest update_timestamp returned on any result in the original query):  

select * from some_really_big_table where update_timestamp > X

  I still have a problem. Since the update_timestamp was assigned when the statement was executed, it's possible that two transactions writing records at the same time will execute their insert/update statements in one order, but commit in the opposite order. As such, a row with a greater update_timestamp will exist prior to a row with a lesser update_timestamp. If the application's refresh query executes between these two commits, it will never see the data from the second commit! It will be missed in the current refresh and the next refresh will not pick it up either. I cannot accept this possibility.  

Serialized Writes

  To solve this problem it seems I need to serialize the writes to some_really_big_table by synchronizing on some lock before selecting the update_timestamp. Then, free the lock as soon as the insert/update is committed. With this approach I can guarantee a record will never be written out of order with respect to update_timestamp.

  If database writes were infrequent or already single-threaded maybe I would not be concerned with this, but let’s assume writes to some_really_big_table are frequent and parallel.

  Is there any solution that does not require me to serialize the writes?  

Alternate Solution

  I suppose I could use the following query instead (where Y is some fudge factor for how far apart I think update_timestamp values could possibly be out of order):  

select * from some_really_big_table where update_timestamp > (X - Y )

  I could do that but I don't feel comfortable taking the risk that I got Y wrong and my application silently missed data because of it. Maybe I would make Y pretty large to try to make that extremely unlikely but performance of my application would suffer because of it and I still might have gotten it wrong. For example, somebody might change the clock on the database server and my fudge factor is now way off.

  I could use a logical clock instead of an operating system clock but I still have to serialize before selecting the next clock value or I am back to the same problem of the opportunity for logical clock value selection and commits to the database to be out of sequence. I could use a fudge factor on the logical clock too, but there is still room for things to go wrong.  

Partitioned Serialized Writes

  I have also considered partitioning the data such that I don't have to synchronize all writes to some_really_big_table. Say, for example, I have a column some_attribute that is a natural way to partition some_really_big_table such that concurrent writes for a given value of some_attribute are relatively rare. In this case I only guarantee the insert/update ordering of update_timestamp within a partition. Then my refresh query becomes the following:  

select * from some_really_big_table where
(some_attribute = A1 and update_timestamp > X1)
or (some_attribute = A2 and update_timestamp > X2)
or (some_attribute = A3 and update_timestamp > X3)
...

...

  X1, X2, X3… are the highest update_timestamp values I saw the last time I queried. A1, A2, A3… are the values of some_attribute that my query cares about. In reality, my original query would have had this some_attribute qualification in it as well but I excluded it from this discussion to make things simpler at the beginning. So really the very first query would have been this:  

select * from some_really_big_table where some_attribute in (A1, A2, A3, ...)

  It seems this partitioned serialization technique is the best I can come up with. I know others must have tackled this same problem before. It seems it is likely to come up when implementing caches for certain types of data. I've done some Googling, but it seems difficult for me to select search terms that will lead me to discussions of this problem.   Does anyone have some experience with a similar problem that they can share?

As I was doing a little research on this topic I stumbled across Oracle's Flashback Query functionality [1]. It seemed like that might make sense for this use case with a query ba开发者_JS百科sed on SCN, but I can't necessarily count on access to this feature since my system won't necessarily be running on Oracle.

[1] http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14251/adfns_flashback.htm


Wondering why no1 has replied so far but anyways, maybe this is an option for you: Instead of updating all your records, create a secondary table called 'update_groups', then link your records to that group(s) and set a timestamp on the group instead of the seperate records. U could use a boolean value also instead of a timestamp 'IsUpdated' ..or something. But this only works if the records can be put into groups, if all you have is a big fat list of records then u can just create table with 1 or more records called 'UpdateList' and u just update that table with the ID's of the changed records. You can just comma seperate them and you'll know excactly what to query next. It's very lightweight. The only thing you should be aware of is how to update this record. I think whenever a record updates you can just append the id to the current one, then when you're updating your UI you can clear all values and start filling them again. That;s basicly how it works. Good luck !


You could use the pseudo column ora_rowscn to track the changes


Use a global revision number and a read committed isolation level in your database (other clients can only read data after the transaction has been committed, pretty much the standard). The trick is not to let the reader know of the newest revision until all new rows have been updated to it. Of course a transaction can also consist of just one updated row.

You can use a separate light-weight table to keep track of revision numbers and join on that table when reading. This is allows you to manage revision numbers per group of rows (if you would want that) by linking N to 1 from the large table to the revision table.

Writing process:

Initialize a global revision number X = 1.

For each update:

  1. Increment X.
  2. Start a transaction (likely implicit).
  3. Update all rows you want to update and update their revision numbers to X.
  4. Commit.

Reading process:

Initialize a last known revision number Y = 0. Select new data as follows:

For each refresh:

  1. SELECT * FROM some_really_big_table WHERE revision > Y
  2. Update Y with the highest known revision number in your new set.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜