Optimizing a query - using a field or using another table
I have a query which takes a long time and I want to optimize it. I'm looking for the most efficient way to do it.
I'm working on Hibernate/JPA with Postgresql DB but any solution should be a generic JPA one.
Terminology
- User: A user in the system.
- Friend: A friend of the user. A user will have N friends.
- Session: A session of using the system. Can be open or closed.
- Context: A context of the session. A user may have one open session per context in any given time, and may have many past closed sessions per context.
The query
I need to implement a query that, given a user name, gives me the following:
- Fetch all the friends of that user
- For ea开发者_如何学编程ch friend:
- If the friend has any open sessions, fetch all the open sessions (for all the contexts)
- Otherwise, get the friend's latest session out of all the contexts.
Note that the friendships are stored in a different DB so I cannot incorporate that into one big query in any case.
Example
User A has three friends: B,C,D. There are two contexts, 1 and 2. The friends have the following data:
(The formatting below is Session ID - User,Context)
- 1 - B,1: Open session
- 2 - B,2: Closed session that started on Feb-27
- 3 - B,2: Closed session that started on Feb-26
- 4 - C,1: Closed session that started on Feb-27
- 5 - C,1: Closed session that started on Feb-26
- 6 - C,2: Closed session that started on Feb-26
- 7 - C,2: Closed session that started on Feb-25
- 8 - D,1: Open session
- 9 - D,2: Open session
The query should get me: B: Session 1 (All open sessions) C: Session 4 (Latest closed session) D: Sessions 8,9 (All open sessions)
Current state
My query works in three steps:
- Get all the friends of the user
- For each friend:
- Get all the open sessions for the friend
- If there is any open session, return all the open sessions
- Get the latest session for the friend, return that session
Obviously this is a lot of queries. For starters, I'm going to take step 2 above and convert it into a single query. My concerns are related to that second query. The question is - how to make it more optimized. The problem can be therefore rephrased:
"Given a set of N friend IDs, get all the open sessions or the latest session for all these friends."
Suggested solutions
There are basically two solutions we came up with and we're contemplating what would be better.
The table solution says to keep a new table that will correlate between user, context, and latest session. The implications of this solution are:
- Create a new entity & table for "latest sessions"
- The table will have these columns:
- User
- Context
- Latest session ID
- The table will be updated by the session entity on post persist, so that any newly persisted session will automatically update this table.
- The new query will fetch all the records for all the friends of the user from this table and work on them to create the final result.
The column solution says to keep a "latest" flag column on the sessions table. The implications of this solution are:
- Create a new field for the latest (a boolean)
- The column will be set by the post persist of the session entity, so that the former "latest" session will no longer be the latest, and the new session will become the latest one.
- The new query will fetch all the latest records (by incorporating the new column into the condition of the statement) for all the friends of the user from the original sessions table and work on them to create the final result.
There are pros and cons to each of these, and we don't seem to have a winner yet. Obviously there may be other, better solutions we have not considered. What I'd like to see is which of the above is better and why, or a new better approach of your own.
The difference between your two solutions should be marginal. Table solution might be cleaner depending on activity.
However, do note that 'you are doing it wrong' (according to the theory).
The RDBMS application design principle clearly states that you should not try to specify how your queries should be executed, but what data you want. The database will find optimal path to your solution (the RDBMS sits closest to the data and depending on your architecture might save network round trips, storage round trips and so on; scalability can be seriously crippled here and you might not be aware of it if you don't do decent stress testing; furthermore RDBMS knows about indexes and internal statistics that determine if scans or seeks will be more effective and it knows how to optimally execute joins).
In practice, try to raise the question why different database for friendships? (is it really different db or different schema on the same db?).
Furthermore, if you really want to go the way you do it (disabling the RDBMS to look for optimal execution plan), then the most important factors are:
- indexes (will affect the performance in orders of magnitude)
- usage patterns (indexes will improve performance of SELECTs, but too many indexes will slow down updates)
- application/client layer caching (can affect performance and scalability in orders of magnitude)
EDIT: So, considering "Given a set of N friend IDs, get all the open sessions or the latest session for all these friends." here is a query that should be tested before introducing new structures
Sessions (SessionID, User, Context, Start, End)
SELECT *
FROM Sessions s
WHERE s.End IS NULL
AND s.User IN (:friendsList)
UNION ALL
SELECT *
FROM Sessions s
WHERE s.User NOT IN (SELECT User
FROM Sessions s2
WHERE s2.User IN (:friendsList)
AND s2.End IS NULL)
AND s.User IN (:friendsList)
AND s.End IN (SELECT MAX(End)
FROM Sessions s2
WHERE s2.User = s.User)
There are more ways to write the above to try to help the optimizer, in particular if your DB supports CTE the above can be rewritten more efficiently.
Notes:
:friendsList
- list of Users that are friends.
Also, I am assuming open sessions have NULL as value of the End
for open sessions. You might already be choosing some other approach (maybe you have a field denoting it; or there are two tables, one for open sessions, one for closed)
The above query will benefit from certain indexes (principle is to first try to optimize with indexes, then with restructuring; first index I would try is composite index on User, End
) and on relatively small number of friends (assumed from the fact that it passed around as a string), this should perform decently already.
why not cache objects? You do not need to hit the DB.
Your main bottle neck appears to be the fact that the information you need is distributed over two databases. Thus, you acquire a list of friends and itterate through them.
I would suggest that you attempt to remove the itteration, reducing it to a single query instead.
The way I would achieve this is build up a comma delimited string of user ids, and pass that string to the second database. The sql in the second database could then (using a function, for example) translate the string intol a single field table of ids, and join on that.
It feel very inellegant to me, yet it's something I do all the time.
The only practical alternative that I have used is to build up a single query that inserts the IDs in to a table, then join on that. Either a temporary table, or a permanent table with a SessionID field allowing multiple sessions to use it concurrently.
Whatever approach you use, have a single query for step 2, using a set based approach rather than itteration, should yield significant benefits.
精彩评论