Selecting distinct entities across a large google app engine table
I was wondering if anyone can help me with this problem.
We have an idea we'd lik开发者_如何学JAVAe to implement, and we're currently unable to do this efficiently.
I've anonymised the data as best as possible, but the structure is the same.
We have two entities, Car and CarJourney. Each Car has 0 to many CarJourney's. Each Car Journey has (amongst other properties) a date associated with it - the date the journey was started.
I wish to query by time over car journeys. I'll have two times, a start date and an end date, where start date <= endDate, and I want to receive the most recently started journey in that period.
So, if I had a particular car in mind, say car 123, I'd write a query that limits by Car.key and Car.startDate, where Car.key == 123 and Journey.startDate >= startDate and Journey.startDate <= endDate with an ordering on Journey.startDate descending and a limit of 1.
e.g. Car A has 3 journeys, taken on 1st, 2nd and the 3rd of the month. The query start date is 1st and the query end date is the 2nd. The result of this query would be one Car journey, the 2nd.
Once the result of that query is returned, a very small amount of processing is done to return a result to the user.
That's the easy bit.
But, instead of over 1 Car, I want a list of cars, where the list contains N keys to cars.
So, I want to run the above query N times, once for every car. And I want the latest journey for each car.
Because the time range is flexible (and thus can't be known beforehand) we can't implement a "isMostRecent" flag, because while it might be the most recent for now, it might not be the most recent for the specified date parameters.
We also need to ensure that this returns promptly (current queries are around the 3-5 second mark for a small set of data) as this goes straight back to the user. This means that we can't use task queues, and because the specified dates are arbitrary we can't implement mass indexing of "isWithinDate" fields.
We tried using an async query, but because the amount of processing is negligible the bottleneck is still the queries on the datastore (because the async api still sends the requests synchronously, it just doesn't block).
Ideally, we'd implement this as a select on car journeys ordered by startDate where the Car.key is distinct, but we can't seem to pull this off in GAE.
There are numerous small optimisations we can make (for example, some MemCaching of repeated queries) but none have made a significant dent in our query time. And MemCaching can only help for a maximum of 1-2 minutes (due to the inevitable forward march of time!)
Any ideas are most welcome and highly appreciated.
Thanks, Ed
It sounds like the best option is to execute the many queries yourself. You say you tried asynchronous queries, but the bottleneck was sending the query. This seems extremely odd - you should be able to have many queries in flight at the same time, substantially cutting down your latency. How did you determine this?
First of all I'd recommend using objectify. JDO/JPA on appengine just fool people into thinking that appengine datastore is just a SQL database, which, as you realized, is far from the truth.
If I understand correctly you have a Car which contains a List of CarJourneys?
List properties on appengine are limited to 5000 entries and any time you access/change them they have to be serialized/deserialized in whole. So if you plan to have a lot of CarJourneys per Car than this will get slow. Also because appengine creates an index entry for every value in the collection this can lead to exploding indexes.
Instead, just create a property Car inside CarJourney that points to the Car that made the journey: a one-to-one relationship from CarJourney to Car. The type can be Key or just string/long containing the id of the Car. When querying just add filter for Car property.
I suggest watching Brett Slatkin's video: Scalable, Complex Apps on App Engine.
You can also use one query and filter distinct cars by yourself. Like select CarJouney startDate >= startDate and startDate <= endDate order by startData
and iterate (+filter on your side) through this query until you find enough data to show.
Denormalization should solve your problem - having a last_journey reference property in your car, so everytime you start a journey, you'd also update the Car entity - this way you'd be able to query all cars and have their lastest journey on the resultset. It's worth noting that when you access last_journey, a new get() will be issued to the datastore, so if you're listing a lot of cars, you could build a list with all the last_journey keys and fetch then all at once passing that to db.get().
Scalable, Complex Apps on App Engine is definately a must watch (sadly the sound is terrible on this video)
I have faced same kind of problem some time ago. I tried some solutions (in memory sort and filtering, encoding things into keys etc. and I have benchmarked those for both latency and cpu cycles using some test data around 100K entities) An other approach I have taken is encoding the date as an integer (day since start of epoch or day since start of year, same for hour of day or month depending on how much detail you need in your output) and saving this into a property. This way you turn your date query filter into an equality only filter which does not even needs to specify an index) then you can sort or filter on other properties. Benchmarking the latest solution I have found that when the filtered result set is a small fraction of the unfiltered original set, is 1+ order of magnitude faster and cpu-eficient. Worst case when no reduction of the result set due to filtering the latency and cpu usage was comparable to the previous solutions)
Hope this helps, or did I missed something ?
Happy coding-:)
You can also make this queries in parallel by calling it right from client, using ajax. I mean that you can return to the user an empty html page, just with cars definitions, and then make ajax calls for journeys for every car on this page.
As JB nizet suggested I am wondering if the answer might be something such as a single query, possibly with a temporary table, or anonymous intermediate table (I don't know what google supports to this end) using a group by (thus eliminating extra transfer of data and the need for Java to do the processing). I am thinking something along the lines of
CREATE TEMPORARY TABLE temp1 AS
SELECT * FROM car_journey
WHERE start_date > ? AND
end_date < ?
SELECT car_id, journey_id
FROM temp1 t1, (
SELECT car_id, MIN(start_date)
FROM temp1
GROUP BY car_id
) t2
WHERE t1.car_id = t2.car_id AND
t1.start_date = t2.start_date
With the temporary table you can greatly reduce the time for the secondary query, since theoretically the data will be much smaller than the full table.
Finally, again not knowing what google supports, I would ask if you have indices defined on the appropriate columns, which may help speed up the query.
精彩评论