How can I improve this PHP/MySQL news feed?
Let me start right off the bat by saying that I know this is not the best solution. I know it's kludgy and a hack of a feature. But that's why I'm here!
This question/work builds off some discussion on Quora with Andrew Bosworth, creator of Facebook's news feed.
I'm building a news feed of sorts. It's built solely in PHP
and MySQL
.
The MySQL
The relational model for the feed is composed of two tables. One table functions as an activity log; in fact, it's named activity_log
. The other table is newsfeed
. These tables are nearly identical.
The schema for the log is activity_log(uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP)
...and the schema for the feed is newsfeed(uid INT(11), poster_uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP)
.
Any time a user does something relevant to the news feed, for example asking a question, it will get logged to the activity log immediately.
Generating the news feeds
Then every X minutes (5 minutes at the moment, will change to 15-30 minutes later), I run a cron job that executes the script below. This script loops through all of the users in the database, finds all the activities for all of that user's friends, and then writes those activities to the news feed.
At the moment, the SQL
that culls the activity (called in ActivityLog::getUsersActivity()
) has a LIMIT 100
imposed for performance* reasons. *Not that I know what I'm talking about.
<?php
$user = new User();
$activityLog = new ActivityLog();
$friend = new Friend();
$newsFeed = new NewsFeed();
// Get all the users
$usersArray = $user->getAllUsers();
foreach($usersArray as $userArray) {
$uid = $userArray['uid'];
// Get the user's friends
$friendsJSON = $friend->getFriends($uid);
$friendsArray = json_decode($friendsJSON, true);
// Get the activity of each friend
foreach($friendsArray as $friendArray) {
$array = $activityLog->getUsersActivity($friendArray['fid2']);
// Only write if the user has activity
if(!empty($array)) {
// Add each piece of activity to the news feed
foreach($array as $news) {
$newsFeed->addNews($uid, $friendArray['fid2'], $news['activity'], $news['activity_id'], $news['title'], $news['time']);
}
}
}
}
Displaying the news feeds
In the client code, when fetching the user's news feed, I do something like:
$feedArray = $newsFeed->getUsersFeedWithLimitAndOffset($uid, 25, 0);
foreach($feedArray as $feedItem) {
// Use a switch to determine the activity type here, and display based on type
// e.g. User Name asked A Question
// where "A Question" == $feedItem['title'];
}
Improving the news feed
Now forgive my limited understanding of the best practices for developing a news feed, but I understand the approach I'm using to be a limited version of what's called fan-out on write, limited in the sense that I'm running a cron job as an intermediate step instead of writing to the users' news feeds directly. But this is very different from a pull model, in the sense that the user's news feed is not compiled on load, but rather on a regular basis.
This is a large question that probably deserves a large amount of back and forth, but I think it can serve as a touchstone for many important conversations that new developers like myself need to have. I'm just trying to figure out what I'm doing wrong, how I can improve, or how I should maybe even start from scratch and try a different approach.
One other thing that bugs me about this model is that it works based on recency rather than relevancy. If anyone can suggest how this can be improved to work relevancy in, I would 开发者_JAVA技巧be all ears. I'm using Directed Edge's API for generating recommendations, but it seems that for something like a news feed, recommenders won't work (since nothing's been favorited previously!).
Really cool question. I'm actually in the middle of implementing something like this myself. So, I'm going to think out loud a bit.
Here's the flaws I see in my mind with your current implementation:
You are processing all of the friends for all users, but you will end up processing the same users many times due to the fact that the same groups of people have similar friends.
If one of my friends posts something, it won't show up on my news feed for at most 5 minutes. Whereas it should show up immediately, right?
We are reading the entire news feed for a user. Don't we just need to grab the new activities since the last time we crunched the logs?
This doesn't scale that well.
The newsfeed looks like the exact same data as the activity log, I would stick with that one activity log table.
If you shard your activity logs across databases, it will allow you to scale easier. You can shard your users if you wish as well, but even if you have 10 million user records in one table, mysql should be fine doing reads. So whenever you lookup a user, you know which shard to access the user's logs from. If you archive your older logs every so often and only maintain a fresh set of logs, you won't have to shard as much. Or maybe even at all. You can manage many millions of records in MySQL if you are tuned even moderately well.
I would leverage memcached for your users table and possibly even the logs themselves. Memcached allows cache entries up to 1mb in size, and if you were smart in organizing your keys you could potentially retrieve all of the most recent logs from the cache.
This would be more work as far as architecture is concerned, but it will allow you to work in real-time and scale out in the future...especially when you want users to start commenting on each posting. ;)
Did you see this article?
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
between you can use user flags and caching. Lets say, have a new field for user as last_activity. Update this field whenever user enters any activity. Keep a flag, till what time you have fetched the feeds lets say it feed_updated_on.
Now update function $user->getAllUsers(); to return only users that have last_activity time later than feed_updated_on. This will exclude all the users that doesnt have any activity log :). Similar process for the users friends.
You can also use caching like memcache or file level caching.
Or use some nosql DB for storing all the feeds as one document.
I'm trying to build a Facebook-style news feed on my own. Instead of creating another table to log users' activities, I calculated the 'edge' from the UNION of posts, comments etc.
With a bit of mathematics, I calculate the 'edge' using an exponential decay model, with time-elapsed being the independent variable, taking account the number of comments, likes, etc each post has to formulate the lambda constant. The edge will decrease fast at first but gradually flattens to almost 0 after a few days (but will never reach 0)
When showing the feed, each edge is multiplied using RAND(). Posts with higher edge will appear more often
This way, more popular posts have higher probability to appear in the news feed, for a longer time.
Instead of running a cron job, a post-commit script of some sort. I don't know specifically what the capabilities of PHP and MySQL are in this regard - if I recall correctly MySQL InnoDB allows more advanced features than other varieties but I don't remember if there are things like triggers in the latest version.
anyway, a simple variety that doesn't rely on a lot of database magic:
when user X adds content:
1) do an asynchronous call from your PHP page after the database commit (async of course so that the user viewing the page doesn't have to wait for it!)
The call starts an instance of your logical script.
2) the logic script goes only through the list of friends [A,B,C] of the user who committed the new content (as opposed to list of everyone in the DB!) and appends the action of user X to feeds for each of these users.
You could just store these feeds as straight-up JSON files and append new data to the end of each. Better of course to keep the feeds in cache with a backup to filesystem or BerkeleyDB or Mongo or whatever you like.
This is just a basic idea for feeds based on recency, not relevance. You COULD store the data sequentially in this manner and then do additional parsing on a per-user basis to filter by relevance, but this is a hard problem in any application and probably not one that can be easily addressed by an anonymous web user without detailed knowledge of your requirements ;)
jsh
I have slightly different mechanism to generate the user feed with 2-level caching. My assumptions of scale is based on my theoretical experience but the same method can be used for a different scale as per requirement.
The above diagram tries to explain this entire feed generation architecture
Let us say you have 100M users. Based on 80-20 rule, 20% of your active users generate 80% of your traffic. Considering each active user generates 20 posts per day, you have 20M users generating 400M new posts daily. Consider every active user has around 1k friends of which 20% are active, i.e. 200 active friends with recent posts. Every user has (200 active friends)*(20 post each user) = 4000 posts eligible to appear on feed.
Create a cache which stores 24-48 hours of recent posts, i.e. around 800M posts. Store these posts against its owner as userid: posts[]
where the userid is user who has created the post, and posts contains his last 24-48 hours of posts.
Create a Feed Generator service which for every active user (20M) fetches posts of that user's 200 active friends and generates another array of posts eligible for feed in Feed's cache as userid:posts[]
where userid is the user who is opening his feed and posts[] is the superset of all the posts that are presented.
This Feed Generator service can run for every active user periodically and for every inactive user on-demand. Once Feed cache is populated, the Feed Generator service can run every short period just to populate the delta posts based on updated rows in Recent Posts cache
The Feed service can connect to the Feed's cache and show the posts based on relevance, importance, recency or any other logic.
Would you add statistical keywording? I made a (crude) implementation via exploding the body of my document, stripping HTML, removing common words, and counting the most common words. I made that a few years ago just for fun (as with any such project, the source is gone), but it worked for my temporary test-blog/forum setup. Maybe it will work for your news feed...
精彩评论