开发者

Advanced MySQL: Find correlations between poll responses

I've got four MySQL tables:

users (id, name)

polls (id, text)

options (id, poll_id, text)

responses (id, poll_id, option_id, user_id)

Given a particular poll and a particular option, I'd like to generate a table that shows which options from other polls are most strongly correlated.

Suppose this is our data set:

TABLE users:
+------+-------+
| id   | name  |
+------+-------+
|    1 | Abe   |
|    2 | Bob   |
|    3 | Che   |
|    4 | Den   |
+------+-------+

TABLE polls:
+------+-----------------------+
| id   | text                  |
+------+-----------------------+
|    1 | Do you like apples?   |
|    2 | What is your gender?  |
|    3 | What is your height?  |
|    4 | Do you like polls?    |
+------+-----------------------+

TABLE options:

+------+----------+---------+
| id   | poll_id  | text    |
+------+----------+---------+
|    1 | 1        | Yes     |
|    2 | 1        | No      |
|    3 | 2        | Male    |
|    4 | 2        | Female  |
|    5 | 3        | Short   |
|    6 | 3        | Tall    |
|    7 | 4        | Yes     |
|    8 | 4        | No      |
+------+----------+---------+

TABLE responses:

+------+----------+------------+----------+
| id   | poll_id  | option_id  | user_id  |
+------+----------+------------+----------+
|    1 | 1        | 1          | 1        |
|    2 | 1        | 2          | 2        |
|    3 | 1        | 2          | 3        |
|    4 | 1        | 2          | 4        |
|    5 | 2        | 3          | 1        |
|    6 | 2        | 3          | 2        |
|    7 | 2        | 3          | 3        |
|    8 | 2        | 4          | 4        |
|    9 | 3        | 5          | 1        |
|   10 | 3        | 6          | 2        |
|   10 | 3        | 5          | 3        |
|   10 | 3        | 6          | 4        |
|   10 | 4        | 7          | 1        |
|   10 | 4        | 7          | 2        |
|   10 | 4        | 7          | 3        |
|   10 | 4        | 7          | 4        |
+------+----------+------------+----------+

Given the poll ID 1 and the option ID 2, the generated table should be something like this:

+--------开发者_JAVA百科--+------------+-----------------------+
| poll_id  | option_id  | percent_correlated    |
+----------+------------+-----------------------+
| 4        | 7          | 100                   |
| 2        | 3          | 66.66                 |
| 3        | 6          | 66.66                 |
| 2        | 4          | 33.33                 |
| 3        | 5          | 33.33                 |
| 4        | 8          | 0                     |
+----------+------------+-----------------------+

So basically, we're identifying all of the users who responded to poll ID 1 and selected option ID 2, and we're looking through all the other polls to see what percentage of them also selected each other option.


Don't have an instance handy to test, can you see if this gets proper results:

select
        poll_id,
        option_id,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
from
(
    select 
        poll_id,
        option_id,
        SUM(score) AS sum1,
        SUM(score_rev) AS sum2,
        SUM(score * score) AS sum1sq,
        SUM(score_rev * score_rev) AS sum2sq,
        SUM(score * score_rev) AS psum,
        COUNT(*) AS n
    from
    (
            select 
                responses.poll_id, 
                responses.option_id,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 0
                    ELSE SELECT 1
                END CASE as score,
                CASE 
                    WHEN user_resp.user_id IS NULL THEN SELECT 1
                    ELSE SELECT 0
                END CASE as score_rev,
            from responses left outer join 
                    (
                        select 
                            user_id
                        from 
                            responses 
                        where
                            poll_id = 1 and 
                            option_id = 2
                    )user_resp  
                        ON (user_resp.user_id = responses.user_id)
    ) temp1 
    group by
        poll_id,
        option_id
)components 


After a few hours of trial and error, I managed to put together a query that works correctly:

SELECT poll_id AS p_id, 
       option_id AS o_id, 
       COUNT(*) AS optCount, 

       (SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /
       (SELECT COUNT(*) FROM response WHERE poll_id = p_id  AND user_id IN 
          (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) 
       AS percentage 

FROM response 
INNER JOIN 
   (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids
ON response.user_id = user_ids.user_id
WHERE poll_id != '1' 

GROUP BY option_id DESC 
ORDER BY percentage DESC, optCount DESC

Based on a tests with a small data set, this query looks to be reasonably fast, but I'd like to modify it so the "IN" subquery is not repeated three times. Any suggestions?


This seems to give the right results for me:

select poll_stats.poll_id,
       option_stats.option_id,
       (100 * option_responses / poll_responses) as percent_correlated
from (select response.poll_id,
             count(*) as poll_responses
      from response selecting_response
           join response on response.user_id = selecting_response.user_id
      where selecting_response.poll_id = 1 and selecting_response.option_id = 2
      group by response.poll_id) poll_stats
      join (select options.poll_id,
                   options.id as option_id,
                   count(response.id) as option_responses
            from options
                 left join response on response.poll_id = options.poll_id
                           and response.option_id = options.id
                           and exists (
                            select 1 from response selecting_response
                            where selecting_response.user_id = response.user_id
                                  and selecting_response.poll_id = 1
                                  and selecting_response.option_id = 2)
            group by options.poll_id, options.id
           ) as option_stats
       on option_stats.poll_id = poll_stats.poll_id
where poll_stats.poll_id <> 1
order by 3 desc, option_responses desc
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜