selecting top 3 values (based on their count) from each category
I'm trying开发者_StackOverflow中文版 to get the top 3 rows from col2 for each letter in col1 based on the count of col2
My table looks like this:
col1 | col2 |
---|---|
A | 1 |
B | 2 |
A | 2 |
B | 2 |
B | 1 |
B | 1 |
B | 1 |
A | 3 |
A | 2 |
B | 4 |
A | 2 |
B | 2 |
A | 3 |
A | 4 |
So for example, here I have that A=1 one time, A=2 three times and A=3 two times (similar for B).
I would like my output to look like this:
col1 | col2 |
---|---|
A | 2 |
A | 3 |
A | 1 |
B | 1 |
B | 2 |
B | 4 |
Where it shows me the top 3 values (col2) (based on the count) of A for col1. Same for B.
I tried doing:
SELECT col1, col2
FROM data
WHERE col2 = (SELECT COUNT(col2)
FROM data
WHERE col1 = data.col1
ORDER BY COUNT(col2) desc
LIMIT(3)
)
GROUP BY col1, col2
but I get an empty table
I'm also trying:
SELECT col1, col2, rn
FROM (SELECT col1, col2,
ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY COUNT(col2) DESC) AS rn
FROM data) tmp
WHERE rn <= 3
ORDER BY col1, rn
but it doesn't work (I think the problem is when I tell it to count).
Any ideas? I'm pretty new with SQL and can't figure it out. I could also do it using dataframes (pyspark) but I'm not sure what the equivalent for OVER PARTITION would be in API dataframe..
UPDATE: I figured it out. This is the correct query:
SELECT col1, col2, x FROM (SELECT col1, col2, count(col2) AS x, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY count(col2) DESC) AS rn FROM data GROUP BY col1, col2) tmp WHERE rn <= 3 ORDER BY col1
try this
SELECT col1,col2
FROM (
SELECT col1,col2,
row_number() over(PARTITION BY col1 ORDER BY cnt DESC) as rn
FROM (
SELECT col1,col2,COUNT(1) as cnt
FROM `table`
group by col1,col2
) t0
) t1
WHERE rn <= 3
in PySpark:
df = spark.createDataFrame(
[
('A','1')
,('B','2')
,('A','2')
,('B','2')
,('B','1')
,('B','1')
,('B','1')
,('A','3')
,('A','2')
,('B','4')
,('A','2')
,('B','2')
,('A','3')
,('A','4')
],
['col1','col2']
)
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy("col1").orderBy('col2')
df\
.withColumn('rank', F.dense_rank().over(w))\
.filter(F.col('rank')<=3)\
.groupby('col1','col2')\
.agg(F.first('col2').alias('dummy'))\
.drop('rank','dummy')\
.show()
# +----+----+
# |col1|col2|
# +----+----+
# | A| 1|
# | A| 2|
# | A| 3|
# | B| 1|
# | B| 2|
# | B| 4|
# +----+----+
精彩评论