开发者

How to design the user table for an online dating site?

I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.

The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.

There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.

As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My p开发者_StackOverflow中文版redecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.

So, what do you think it's the best way to go about it?


This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.

I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.

The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.

And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.

Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.

My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.

EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().

EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.


What you could do is split the user data accross two tables.

1) Table: user

This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.

Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.

2) Table: user_profile

Fields: user_id, option, value

user_id: 1

option: profile_image

value: /uploads/12/myimage.png

and

user_id: 1

option: questions_answered

value: 24

Hope this helps, Paul.


The entity-attribute-value model might be a good fit for you:

http://en.wikipedia.org/wiki/Entity-attribute-value_model

Rather than have 100 and growing columns, add one table with three columns:

user_id, property, value.


In general, you shouldn't sacrifice database integrity for performance.

The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.

If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.


In general you should always get the schema formally correct before you worry about performance!

That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.

You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.

Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....

At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:

SELECT candidate.id, 
 COUNT(*)
FROM users candidate,
  attributes candidate_attrs,
  attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user 
  AND candidate.user_id<>$current_user
  AND candidate.id=candidate_attrs.user_id
  AND candidate_attrs.attr_type=current_user.attr_type
  AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;

However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:

SELECT candidate.id, 
 COUNT(*)
FROM users candidate,
   attributes candidate_attrs,
   attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user 
  AND candidate.user_id<>$current_user
  AND candidate.id=candidate_attrs.user_id
  AND candidate_attrs.attr_type=current_user.attr_type
  AND candidate_attrs.attr_value 
     BETWEEN current_user.attr_value+$tolerance
     AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;

(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).

This can be further refined into a points scoring system:

SELECT candidate.id, 
  SUM(1/1+
      ((candidate_attrs.attr_value - current_user.attr_value)
        *(candidate_attrs.attr_value - current_user.attr_value))
  ) as match_score
FROM users candidate,
  attributes candidate_attrs,
  attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user 
  AND candidate.user_id<>$current_user
  AND candidate.id=candidate_attrs.user_id
  AND candidate_attrs.attr_type=current_user.attr_type
  AND candidate_attrs.attr_value 
   BETWEEN current_user.attr_value+$tolerance
   AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;

This approach lets you do lots of different things - including searching by a subset of attributes, e.g.

SELECT candidate.id, 
  SUM(1/1+
      ((candidate_attrs.attr_value - current_user.attr_value)
        *(candidate_attrs.attr_value - current_user.attr_value))
  ) as match_score
FROM users candidate,
  attributes candidate_attrs,
  attributes current_user_attrs,
  attribute_subsets s
WHERE current_user_attrs.user_id=$current_user 
  AND candidate.user_id<>$current_user
  AND candidate.id=candidate_attrs.user_id
  AND candidate_attrs.attr_type=current_user.attr_type
  AND candidate_attrs.attr_value
  AND s.subset_name=$required_subset
  AND s.attr_type=current_user.attr_type 
   BETWEEN current_user.attr_value+$tolerance
   AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;

Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.

If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.

Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.


Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.

Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜