How do you react to the absence of an event in a distributed system?
I have a system that collects session data. A session consists of a number of distinct events, for example "session started" and "action X performed". There is no way to determine when a session ends, so instead heartbeat events are sent at regular intervals.
This is the main complication: without a way to determine if a session has ended the only way is to try to react to the absence of an event, i.e. no more heartbeats. How can I do this efficiently and correctly in a distributed system?
Here is some more background to the problem:
The events must then be assembled into objects representing sessions. The session objects are later updated with additional data from other systems, and eventually they are used to calculate things like the number of sessions, average session length, etc.
The system must scale horizontally, so there are multiple servers that receive the events, and multiple servers that process them. Events belonging to the same session can be sent to and processed by different servers. This means that there's no guarantee that they will be processed in order, and there are additional complications that meant that events can be duplicated (and there's always the risk that some are lost开发者_Python百科, either before they reach our servers, or when processed).
Most of this exists already, but I have no good solution to how to efficiently and correctly determine when a session has ended. The way I do it now is to periodically search through the collection of "incomplete" session objects looking for any that have not been updated in an amount of time equal to two heartbeats, and moving these to another collection with "complete" sessions. This operation is time consuming and inefficient, and it doesn't scale well horizontally. Basically it consists of sorting a table on a column representing the last timestamp and filtering out any rows that aren't old enough. Sounds simple, but it's hard to parallelize, and if you do it too often you won't be doing anything else, the database will be busy filtering your data, if you don't do it often enough each run will be slow because there's too much to process.
I'd like to react to when a session has not been updated for a while, not poll every session to see if it's been updated.
Update: Just to give you a sense of scale; there are hundreds of thousands of sessions active at any time, and eventually there will be millions.
One possibility that comes to mind:
In your database table that keeps track of sessions, add a timestamp field (if you don't have one already) that records the last time the session was "active". Update the timestamp whenever you get a heartbeat.
When you create a session, schedule a "timer event" to fire after some suitable delay to check whether the session should be expired. When the timer event fires, check the session's timestamp to see if there's been more activity during the interval that the timer was waiting. If so, the session is still active, so schedule another timer event to check again later. If not, the session has timed out, so remove it.
If you use this approach, each session will always have one server responsible for checking whether it's expired, but different servers can be responsible for different sessions, so the workload can be spread around evenly. When a heartbeat comes in, it doesn't matter which server handles it, because it just updates a timestamp in a database that's (presumably) shared between all the servers.
There's still some polling involved since you'll get periodic timer events that make you check whether a session is expired even when it hasn't expired. That could be avoided if you could just cancel the pending timer event each time a heartbeat arrives, but with multiple servers that's tricky: the server that handles the heartbeat may not be the same one that has the timer scheduled. At any rate, the database query involved is lightweight: just looking up one row (the session record) by its primary key, with no sorting or inequality comparisons.
So you're collecting heartbeats; I'm wondering if you could have a batch process (or something) that ran across the collected heartbeats looking for patterns that implied the end of a session.
The level of accuracy is governed by how regular the heartbeats are and how often you scan across the collected heartbeats.
The advantage is you're processing all heartbeats through a single mechanism (in one spot - you don't have to poll each heartbeat on it's own) so that should be able to scale - if it was a database centric solution that should be able to cope with lots of data, right?
There might be a more elegant solution but my brains a bit full just now :)
精彩评论