One of the primary source data sets I use to create baseball visualizations is the amazingly detailed information captured by the Retrosheet project, a dedicated group of volunteers providing play-by-play and game level information for each MLB season. They have recently passed the 100-year milestone, with data from the 1921 & 1922 seasons now available. I have some catching up to do on the older seasons, but just downloaded the 2022 season for adding to my databases.
The data comes in two distinct sets – game logs being the much easier of the two to work with, due to the smaller data size. Each game played in a season is captured at a summary level (~ 2,400 records), with information pertaining to the score, players, umpires, attendance, and much more. This information is used to feed my game summary visualizations:
As you can see, these are bite-sized summaries of every game, showing some of the important summary data for a game. They can be filtered to find specific teams, pitchers, scores, and much more. These visualizations are currently available covering the 1955-2019 seasons; one of my immediate goals is to add the 2020, 2021, and 2022 seasons, before starting to work in reverse with pre-1955 campaigns.
Fortunately, I have lots of SQL code built up over the years to make the data update process fairly simple; the 2022 game logs have already been added, and now I’ll get to work on the play-by-play data. Stay tuned for updates, and thanks for reading!