Querying a Cassandra column family for rows that have not been updated in X days
I'm moving an existing MySQL based application over to Cassandra. So far finding the equivalent Cassandra data model has been quite easy, but I've stumbled on the following problem for which I'd appreciate some input:
Consider a MySQL table holding millions of entities:
CREATE TABLE entities (
id INT AUTO_INCREMENT NOT NULL开发者_StackOverflow社区,
entity_information VARCHAR(...),
entity_last_updated DATETIME,
PRIMARY KEY (id),
KEY (entity_last_updated)
);
Every five minutes the table is queried for entities that need to be updated:
SELECT id FROM entities
WHERE entity_last_updated IS NULL
OR entity_last_updated < DATE_ADD(NOW(), INTERVAL -7*24 HOUR)
ORDER BY entity_last_updated ASC;
The entities returned by this queries are then updated using the following query:
UPDATE entities
SET entity_information = ?,
entity_last_updated = NOW()
WHERE id = ?;
What would be the corresponding Cassandra data model that would allow me to store the given information and effectively query the entities table for entities that need to be updated (that is: entities that have not been updated in the last seven days)?
To achieve what you described you need to have column name as time stamp and use get slice function using start time and endtime, it will give you all rows with column name with in that range. also use column name sort so you would get result in ordered by time.
You'd have to scan all the rows and grab the timestamp from the column(s) you're interested in. If this is something you run every day or so, doing this in a Hadoop job should be fine. If it's something you run every few minutes, then you'll need to come up with another approach.
精彩评论