How do I find the top N batters per year?
I'm playing around with the Lahman Baseball Database in a MySQL instance. I want to find the players who topped home runs (HR) for each year. The Batting table has the following (relevant parts) of its schema:
+-----------+----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+----------------------+------+-----+---------+-------+
| playerID | varchar(9) | NO | PRI | | |
| yearID | smallint(4) unsigned | NO | PRI | 0 | |
| HR | smallint(3) unsigned | YES | | NULL | |
+-----------+----------------------+------+-----+---------+-------+
For each year, every player has an entry (between hundreds and 12k per year, going back to 1871). Getting the top N hitters for a single year is easy:
SELECT playerID,yearID,HR
FROM Batting
WHERE yearID=2009
ORDER BY HR DESC LIMIT 3;
+-----------+--------+------+
| playerID | yearID | HR |
+-----------+--------+------+
| pujolal01 | 2009 | 47 |
| fieldpr01 | 2009 | 46 |
| howarry01 | 2009 | 45 |
+-----------+--------+------+
But I'm interested in finding the top 3 from every year. I've found solutions like this, describing how to select the top from a category and I've tried to apply it to my problem, only to end up with a query that never returns:
SELECT
b.yearID, b.playerID, b.HR
FROM
Batting AS b
LEFT JOIN
Batting b2
ON
(b.yearID=b2.yearID AND b.HR <= b2.开发者_StackOverflowHR)
GROUP BY b.yearID HAVING COUNT(*) <= 3;
Where have I gone wrong?
Something like this should work:
SELECT b.playerID, b.yearID, b.HR
FROM Batting b
WHERE HR >= (
SELECT b2.HR
FROM Batting b2
WHERE b2.yearID=b1.yearID
ORDER BY b2.HR DESC
LIMIT 2, 1
)
ORDER BY b.yearID DESC, b.HR DESC;
Explanation: select all rows which have >= number of home runs as the third highest for that year. This won't break ties. So if there's more than one batter with the same number of home runs, they'll all show up.
The results are the ordered from the most recent year, sub-ordered by rank for each year.
Note: LIMIT is a 0-based offset, so 2, 1 means starting after the second row grab one row, i.e.: the third row.
Wow, random. I happened to be doing the same query (for salaries though) on the Lahman Baseball DB using an article on emulating Oracle Analytic functions. This version of the query is snappy, but not so intuitive.
select *
from (
select
b.yearID as year,
b.teamID as team,
m.nameFirst as first,
m.nameLast as last,
find_in_set(b.HR, x.teamRank) as rank,
b.HR as HR
from
Batting b
inner join Master m on m.playerID = b.playerID
inner join (select yearID, group_concat(distinct HR order by HR desc) as teamRank from Batting group by yearID) x on x.yearID = b.yearID
) x
where
rank <= 10 and rank > 0
order by
year desc, rank
Or the top 5 HR totals per team for the year 2010...
select *
from (
select
b.yearID as year,
b.teamID as team,
m.nameFirst as first,
m.nameLast as last,
b.HR as HR,
find_in_set(b.HR, x.teamRank) as rank
from
Batting b
inner join Master m on m.playerID = b.playerID
inner join (select teamID, group_concat(distinct HR order by HR desc) as teamRank from Batting where yearID = 2010 group by teamID) x on x.teamID = b.teamID
where
b.yearID = 2010
) x
where
rank <= 5 and rank > 0
order by
team, rank
limit 12
Shows these results...
year team first last HR rank
2010 ARI Mark Reynolds 32 1
2010 ARI Chris Young 27 2
2010 ARI Kelly Johnson 26 3
2010 ARI Adam LaRoche 25 4
2010 ARI Justin Upton 17 5
2010 ATL Brian McCann 21 1
2010 ATL Jason Heyward 18 2
2010 ATL Troy Glaus 16 3
2010 ATL Martin Prado 15 4
2010 ATL Eric Hinske 11 5
2010 BAL Luke Scott 27 1
2010 BAL Ty Wigginton 22 2
精彩评论