I’ve been creating MLB pennant race charts for years now, covering every season from 1901 through 2019, with 2020, 2021, and 2022 to come soon. These charts have been available on the site in single charts for each season at a league (American or National) and division level (since 1969). This has always worked reasonably well, but I have always yearned for something a bit more interactive, where users could go to one place and enter the season and league they want to view. Finally, courtesy of the Exploratory Server, such a solution is now available.
Here’s a glimpse of what I’m talking about – first, the old way of doing things, which I’ll continue to maintain. The process starts with a visit to the pennant races page on this site:
Selecting a specific menu option will display a single pennant race, such as the 1901 American League race shown here:
These charts work well, and provide some interactivity, but it is strictly one chart per link, so not very efficient.
Now, here’s the alternative option using the Exploratory server. Here I can create very similar charts but with a parameter-driven menu enabling users to select a season and a league:
Here’s a case where we select the 1901 season and the American League filters, with the following result:
The real power in this approach comes with the seasons from 1969-2019, where each league had two and then three divisions. Selecting the 2019 season and the American League filter options will now deliver all three divisional charts on a single page!
You can try this out yourself; just make sure to set the Parameters interactive mode to “On” which will activate the filters; you can control the display as well to show one or more columns. I find that a single column works best for the pennant race charts.
A quick note – just added game summaries from the 2020 & 2021 seasons; waiting for some data updates before processing the 2022 season. Here’s a quick view from 2021 of Jacob deGrom’s home field starts as an example for how you can use filters to find the information you are seeking:
One of the primary source data sets I use to create baseball visualizations is the amazingly detailed information captured by the Retrosheet project, a dedicated group of volunteers providing play-by-play and game level information for each MLB season. They have recently passed the 100-year milestone, with data from the 1921 & 1922 seasons now available. I have some catching up to do on the older seasons, but just downloaded the 2022 season for adding to my databases.
The data comes in two distinct sets – game logs being the much easier of the two to work with, due to the smaller data size. Each game played in a season is captured at a summary level (~ 2,400 records), with information pertaining to the score, players, umpires, attendance, and much more. This information is used to feed my game summary visualizations:
As you can see, these are bite-sized summaries of every game, showing some of the important summary data for a game. They can be filtered to find specific teams, pitchers, scores, and much more. These visualizations are currently available covering the 1955-2019 seasons; one of my immediate goals is to add the 2020, 2021, and 2022 seasons, before starting to work in reverse with pre-1955 campaigns.
Fortunately, I have lots of SQL code built up over the years to make the data update process fairly simple; the 2022 game logs have already been added, and now I’ll get to work on the play-by-play data. Stay tuned for updates, and thanks for reading!
This is the first in a series of posts where I take a look at notoriously one-sided baseball trades, using the baseball trade networks published on this site earlier in 2022. I won’t necessarily rank these deals in any sort of order; rather I will pick out a few from the network trade graphs and provide some analysis and context for some of the most notorious transactions.
If you haven’t seen the trade networks previously, here’s a link.
The networks were built using data from Retrosheet and Neil Paine, loaded into Gephi, a network analysis and visualization tool, and ultimately pushed to the web where I could finish styling the graphs. Graph nodes (the circles in the networks) are sized based on the total future WAR (Wins Above Replacement) accrued by the teams involved in the trade. All values must occur at the major league level (MLB), so players involved in the deal who don’t reach the MLB level with their new team will have a zero value. Only the cumulative WAR value while playing for the new team is included; we are not calculating WAR once a player leaves one of the teams involved in the transaction.
Finding a bad trade by scanning the networks is more an art than a science; the key is to look for large nodes (indicating a lot of future WAR value), and then dissecting the trade to see how much value each team received. The other alternative is if we already know the player(s) we are looking for; in these cases we can perform a simple search to find the trade. Here’s a classic example that Red Sox fans would love to forget – trading future Hall of Famer Jeff Bagwell for journeyman reliever Larry Anderson. Let’s go to the Red Sox trade network and search for Jeff Bagwell.
Typing in Jeff Bagwell locates him quickly within the trade network. Note that even if a player is involved in multiple trades to or from the same team (rare but possible) the search will locate each transaction. Here’s the Bagwell transaction, showing his player node and future WAR value connected to the transaction node; every player involved in that transaction will be connected to the trade node, as long as there is some future WAR value. If a player in the trade did not play in the majors for the receiving team, they will not be reflected in the graph. Here’s a view of Jeff Bagwell relative to the trade:
We can also click on the transaction node to see the value provided to each team by all of the players involved in the trade, again assuming they spent time with the team and were not limited to the minor leagues. Clicking on that node will display the respective WAR values in the sidebar on the left of the screen:
Here’s where we get to the details of the trade, and specifically the direct benefits accrued to each team. The Red Sox received 1.1 future WAR from Larry Anderson; to put this in perspective, we might expect this sort of value for an average player for a single season. The Astros, on the other hand received an incredible 93.8 WAR from Jeff Bagwell, or close to 6 WAR per season for 16 years! That is a Hall of Fame level performance, and it eventually led to his selection to Cooperstown in 2017. Here’s a profile that mentions the one-sided trade.
While we have the Red Sox network open, let’s see if there are any other disastrous transactions (other than the cash sale of Babe Ruth to the Yankees, technically not a trade). After scanning the network, we find this one from 1928:
This one is clearly not a Bagwell-level disaster, but was still quite negative for the Red Sox, with a WAR differential of 30 points. The primary villain here is Buddy Myer, a solid infielder who hit .300 or better seven times for the Senators. Not a major star, but the owner of a very nice career, including leading the American League in batting average in 1935.
Let’s try to find one more before closing this piece, this time favoring the Red Sox. We zero in on this deal:
The Red Sox netted nearly 45 future WAR value while surrendering just 0.1; most of the benefit was generated by slugging future Hall of Famer Jimmie Foxx, but they also received a nice three season contribution from pitcher Johnny Marcum. Note how we also removed nodes not involved in the trade by clicking on the edges icon on the bottom left of the display area; this makes it easier to focus on the details.
Feel free to try your hand at finding more of these one-sided deals in the Red Sox or any other trade networks. I’ll be back with some other teams before long. Thanks for reading!
The final 10 MLB WAR Trade Networks have now been published, bringing the total number of graphs to 31 – 30 teams and one overall network with all teams and transactions. For more information on the trade networks, click here. Here are the remaining networks:
Happy day! Just finished uploading the 2021 baseball dataset from the Lahman baseball archive and Baseball-Databank, just in time for the 2022 season. Next step is inserting and updating the existing tables (with data back to 1901!) with the 2021 season stats. I can then move on to the fun side of the equation – updating existing visualizations and creating some new analyses and visuals. Stay tuned!by by
Welcome to Visual-baseball.com! I’ve just completed an update where data for the 2019 season is now part of the Batting Explorer 2010-2019 interactive visualization. This is a tool where we have every batter in a given season depicted via a baseball card type of framework, showing their key stats for the season, as well as the positions played in the field (by number of games), using a visual of a baseball diamond. It’s a highly interactive way to filter through multiple seasons worth of data by team, player, position, and more.
So while you’re waiting for on the field action to start, have a look at the Batting Explorer to answer some of the questions on your baseball mind. For example, here’s a quick look at all of the left fielders who played at least 110 games at the position in 2019 (by using the filters pill at the top left):
We can take the same results and sort it by the number of home runs, to see who the power hitters in the group were for 2019, by sorting from high to low:
Now we see Kyle Schwarber and Juan Soto at the top of the display. Let’s look at Schwarber’s details by simply hovering over his card:
We now have a pop-up within Schwarber’s card telling a mini-story about his batting stats for the 2019 season. Here’s a closer look:
From this, we learn that Schwarber hit 38 home runs, batted .250, and had an OPS (on-Base + Slugging) of .871, as well as multiple other details. Additionally, you likely noticed the “View the full stats at Baseball-Reference.com” pop-up tag. To get there, simply click on Schwarber’s card, and you’ll be transported (in a new tab) to his page at Baseball-Reference:
Pretty cool, right? Give it a try, or pick your own filters, look at specific teams and seasons, and so on. Here’s the Batting Explorer page, with every decade back to 1900-1910 available for your curiosity.
More updates to come soon on the 2019 data, and thanks for reading!by by
In our previous post, we looked at how to acquire and load our baseball player data into Gephi. In this second installment, the focus will be on creating a player network graph in Gephi, and customizing many settings to deliver a network graph we can export to the web. Player networks are used to detail the connections between all players who are connected to one another in some fashion. In this instance, it is based on players having played for the same team in one or more common seasons. So let’s begin with the process of creating the graph using our raw data from the first installment.
Importing .csv data into Gephi is quite simple – we create individual node and edge files (as we showed in the previous post), and use the Gephi import functions to pull the data in. I always start with the node file, since it will typically have additional information not included in the edges file. After importing the node data, I then import the edge data, which gives us the information to form our initial graph. If we were to start with the edge file, Gephi will create our node data automatically, and we will not have the detail needed for our graph. This approach may work for simple graphs, but not for our current case.
Once both data files have been imported, we can begin thinking about what we want form our graph. Here are several questions we might pose:
How will we use color?
What sort of layout will be best?
Which measures should we calculate?
How should we depict node sizes?
In many cases, the answers to these questions come about through trial and error. We may have some ideas going into the process, but invariably, there will be modifications along the way. So be patient, and be willing to experiment as you create network graphs. The graph you will see in this post went through many of these modifications, which I won’t take the time to detail. Instead, this post will detail my final choices, along with some explanations for why these choices were made. So let’s take a walk through the various facets of the visualization.
While a network will retain the same underlying structure from a statistical point of view (degrees, centrality, eccentricity, etc.) regardless of our layout choices, it is still important to select a layout that will visually represent the underlying patterns in the network. Otherwise, we could just as well deliver a spreadsheet with all of the network statistics. So layout selection is critical, and often involves an iterative process.
For the baseball network graphs I built in 2014, I eventually settled on the ARF layout algorithm, which ran quickly and created an attractive circular network graph display using the player connection data. Alas, there is no ARF algorithm available for Gephi 0.9.2, so I required a different approach for the updates. Ultimately, this led to a 2-step approach using a pair of layout algorithms – OpenOrd followed by Force Atlas 2. OpenOrd is especially effective at creating a quick layout from large datasets, although with far less precision than some other force-directed approaches. Still, it is a great tool for creating a general understanding of the structure of a network very quickly. Force Atlas 2, is the near opposite of OpenOrd – a very precise approach that can be tweaked easily using the various settings in Gephi. It is ideal for putting the finishing touches on what OpenOrd started.
Here are the settings I eventually settled on for Force Atlas 2, after much trial and error:
Some of the more important things to note here are the Scaling and Gravity settings. I reduced the scaling to 0.5 so the network would display appropriately in a single window without the need for scrolling. The Gravity setting was increased to 2.5 to force nodes slightly toward the center of the display. The LinLog mode and Prevent Overlap options are also selected in order to make this particular graph more visually effective. For other graphs, I have used the Dissuade Hubs option, forcing large nodes to the perimeter of the graph; in this case, that was not an ideal choice.
The use of color is also important within a network graph display. Color can be used to highlight nuances in the data that distinguish one or more nodes relative to another group of nodes. Often we use color to visually represent clusters within the graph, as grouped using the modularity classes statistic or some similar input. In the case of this series of graphs (ultimately one graph per team), I made a decision to use the official team colors to differentiate each graph. Thus my initial graph for the Boston Red Sox would be based on the two primary hex colors for the current team (these colors do change over time for many teams).
Here are the Red Sox primary colors:
After capturing current team colors in a spreadsheet for easy reference, I used the color-hex.com site to select complementary colors for the Red Sox graph. Using complementary colors allows me to differentiate clusters in the graph while remaining true to the original concept of employing team colors for each graph. So instead of a wide range of colors one would normally see in a Gephi output, I was able to input the complementary colors for each group. Thus, one team color could be used for the graph background, while the other color (and it’s complements) could be used for the graph structure (nodes & edges). We’ll share the effect later in this post.
Graph statistics are critical to the full understanding of the structure of a network. While we can view a graph and begin to understanding the general structure of a network, the various statistics will aid and reinforce our initial visual comprehension. Gephi provides a nice range of statistical measures to choose from:
Eccentricity (the number of steps needed to traverse the network)
Centrality – betweenness, eigenvector, closeness, harmonic closeness (various measures of importance of an individual node)
Clustering coefficient (to discern cliques in the network)
Number of triangles (a friends of friends measure)
Modularity Class (clusters)
Degrees (the number of connections)
Node sizing is another key element of effective graph design. In this case, there were a few options I could pursue for node sizing – the number of seasons played (I used this in the 2014 graphs), one of the various centrality measures we calculated, or the number of degrees (connections) an individual player possesses. After computing each of these statistics, I eventually decided to use the number of degrees as a representation of influence in the graph. Visually, I want to show how many other players a single individual is related to, and using node size is an effective means of doing so.
Our final graph in Gephi is shown below; the eventual web-based version will differ slightly and include additional functionality, but that’s for another post.
My third and final post in this series will address exporting this graph to the web using the sigma.js plugin, and making some additional customization to the web version. Thanks for reading, and see you soon!
A few years back, I used Gephi and sigma.js to create a series of interactive baseball team networks, one for each current MLB franchise. These networks displayed all players through the 2013 season, going all the way back to 1901 for the original American and National League franchises. Now that we have data through the 2017 season, it’s time for an update, not only from a data perspective, but also stylistically. This post will walk through the process of creating one of these networks using Toad for MySQL, Gephi, and sigma.js to create web-based interactive network visualizations.
Here’s a typical network from the 2013 series; the full list of networks can be found here. We’ll use the existing networks as a baseline for the new networks, although a few modifications will be made.
Source Data & MySQL Queries
Let’s start our discussion with the source data. Season-level baseball data is available through the seanlahman.com website, in the form of .csv files or Microsoft Access database tables. I use the .csv format, as it can be easily added to existing MySQL databases on the visual-baseball.com server. MySQL also makes it simple to add derived fields through some simple coding. These fields can be utilized later for a variety of activities.
For the purpose of our network graphs, there are a handful of critical fields we want to use. These include the following:
playerID, a unique identifier for every player who ever donned a major league uniform
player name, which can be used to provide a meaningful reference based on the playerID field
yearID, which refers to the season (or seasons) a player suited up for a specific franchise
franchID, a unique identifier for each MLB franchise
We also need to do a little manipulation of the source data in our code to deliver our results in the proper form for use in Gephi. This means we need to create two input files – one for nodes, and a second for edges. The nodes will contain information about each player, the number of seasons played for the franchise and the first and last seasons, which may differ from the number of seasons, as players frequently leave a franchise only to return later in their career. Here’s our node code:
SELECT Id, Label, MAX(Size) as Size FROM (SELECT bp.playerID AS Id, CONCAT(bp.name, ” “, MIN(bp.yearID), “-“, MAX(bp.yearID)) AS Label, COUNT(bp.yearID) AS Size FROM BattingPlus bp WHERE bp.franchID = ‘BOS’ and bp.yearID >= 1901 GROUP BY bp.name
SELECT pp.playerID AS Id, CONCAT(pp.name, ” “, MIN(pp.yearID), “-“, MAX(pp.yearID)) AS Label, COUNT(pp.yearID) AS Size FROM BattingPlus pp WHERE pp.franchID = ‘BOS’ and pp.yearID >= 1901 GROUP BY pp.name) a GROUP BY Id ORDER BY Id;
Here’s the simple interpretation – since we are attempting to display all players for a given franchise, we are executing a UNION ALL statement to combine batters and pitchers into a single result file. We have used the playerID field to create the required Id value for Gephi, while also creating a Label field by combining the player’s name with their first and last years playing for this franchise. Finally, we have created a Size field based on the number of seasons played for the franchise. We can then choose to use this in Gephi to size each node, if we so choose.
We also need to create the edge file for Gephi. In this case, we want to understand how many seasons two players were on the same team. This code is a bit trickier, since we want to show only one connection between two players, since this will be an undirected graph. More on that distinction later. Here’s our edge code:
SELECT b.playerID AS Source, m.playerID AS Target, ‘Undirected’ as Type, ‘ ‘ as Id, ‘ ‘ as Label, count(*) as weight FROM (SELECT a.playerID, CONCAT(m.nameFirst, ” “, m.nameLast) name, a.yearID, a.franchID
FROM Appearances a INNER JOIN Master m ON a.playerID = m.playerID
WHERE a.franchID = ‘BOS’ and a.yearID >= 1901) b
INNER JOIN Appearances a ON b.yearID = a.yearID and b.franchID = a.franchID and b.playerID <> a.playerID and a.playerID > b.playerID INNER JOIN Master m ON a.playerID = m.playerID
GROUP BY b.playerID, a.playerID ORDER BY b.playerID
Here we use the Master table to provide player name information, and we also gather the ID information to match the node values. The critical piece in this code is in our join criteria:
INNER JOIN Appearances a ON b.yearID = a.yearID and b.franchID = a.franchID and b.playerID <> a.playerID and a.playerID > b.playerID
Here we are matching players based on the same season and the same franchise. We then specify that we do not want to connect any player to himself, and that we want only values where the playerID value from our main query is greater than the playerID value from the sub-query. This gives us a single connection between two players, which is what we need for an undirected graph. We then define a Source node (required by Gephi) and a Target node (also required), as well as specifying ‘Undirected’ as the graph type. We leave the ID and Label values empty, and then summarize the number of seasons played together as an edge weight. This value can be used in Gephi to show the strength of a connection between two nodes (e.g.- did they spend one season together, or 10 seasons together?).
After exporting each of these files to a .csv format, we have our source data for Gephi. In Part 2 our focus will shift to creating the network in Gephi.by by