Bad Trades, Red Sox Edition

This is the first in a series of posts where I take a look at notoriously one-sided baseball trades, using the baseball trade networks published on this site earlier in 2022. I won’t necessarily rank these deals in any sort of order; rather I will pick out a few from the network trade graphs and provide some analysis and context for some of the most notorious transactions.

If you haven’t seen the trade networks previously, here’s a link.

The networks were built using data from Retrosheet and Neil Paine, loaded into Gephi, a network analysis and visualization tool, and ultimately pushed to the web where I could finish styling the graphs. Graph nodes (the circles in the networks) are sized based on the total future WAR (Wins Above Replacement) accrued by the teams involved in the trade. All values must occur at the major league level (MLB), so players involved in the deal who don’t reach the MLB level with their new team will have a zero value. Only the cumulative WAR value while playing for the new team is included; we are not calculating WAR once a player leaves one of the teams involved in the transaction.

Finding a bad trade by scanning the networks is more an art than a science; the key is to look for large nodes (indicating a lot of future WAR value), and then dissecting the trade to see how much value each team received. The other alternative is if we already know the player(s) we are looking for; in these cases we can perform a simple search to find the trade. Here’s a classic example that Red Sox fans would love to forget – trading future Hall of Famer Jeff Bagwell for journeyman reliever Larry Anderson. Let’s go to the Red Sox trade network and search for Jeff Bagwell.

Red Sox trade network

Typing in Jeff Bagwell locates him quickly within the trade network. Note that even if a player is involved in multiple trades to or from the same team (rare but possible) the search will locate each transaction. Here’s the Bagwell transaction, showing his player node and future WAR value connected to the transaction node; every player involved in that transaction will be connected to the trade node, as long as there is some future WAR value. If a player in the trade did not play in the majors for the receiving team, they will not be reflected in the graph. Here’s a view of Jeff Bagwell relative to the trade:

Jeff Bagwell transaction

We can also click on the transaction node to see the value provided to each team by all of the players involved in the trade, again assuming they spent time with the team and were not limited to the minor leagues. Clicking on that node will display the respective WAR values in the sidebar on the left of the screen:

WAR values of the trade

Here’s where we get to the details of the trade, and specifically the direct benefits accrued to each team. The Red Sox received 1.1 future WAR from Larry Anderson; to put this in perspective, we might expect this sort of value for an average player for a single season. The Astros, on the other hand received an incredible 93.8 WAR from Jeff Bagwell, or close to 6 WAR per season for 16 years! That is a Hall of Fame level performance, and it eventually led to his selection to Cooperstown in 2017. Here’s a profile that mentions the one-sided trade.

While we have the Red Sox network open, let’s see if there are any other disastrous transactions (other than the cash sale of Babe Ruth to the Yankees, technically not a trade). After scanning the network, we find this one from 1928:

Transaction 59324 – Buddy Myer

This one is clearly not a Bagwell-level disaster, but was still quite negative for the Red Sox, with a WAR differential of 30 points. The primary villain here is Buddy Myer, a solid infielder who hit .300 or better seven times for the Senators. Not a major star, but the owner of a very nice career, including leading the American League in batting average in 1935.

Let’s try to find one more before closing this piece, this time favoring the Red Sox. We zero in on this deal:

Transaction 59403 – Jimmie Foxx

The Red Sox netted nearly 45 future WAR value while surrendering just 0.1; most of the benefit was generated by slugging future Hall of Famer Jimmie Foxx, but they also received a nice three season contribution from pitcher Johnny Marcum. Note how we also removed nodes not involved in the trade by clicking on the edges icon on the bottom left of the display area; this makes it easier to focus on the details.

Feel free to try your hand at finding more of these one-sided deals in the Red Sox or any other trade networks. I’ll be back with some other teams before long. Thanks for reading!

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

First 10 WAR Trade Networks Published!

The first 10 WAR (Wins Above Replacement) Trade Networks are now available for exploring! This initial group includes nine team networks and one overall graph with all teams included. Here’s a list of the 10 graphs:

Each of these and any upcoming WAR trade networks can be found on this page.

Let’s walk through how the graphs work, using the Detroit Tigers network as an example. We’ll begin with an anatomy of the graph display:

As the image shows, the primary focus will be the main graph area in the center of the window. This is where all nodes (transactions, teams, and players) will reside, connected by edges based on common relationships. Transaction nodes will vary in size based on the total value of a trade with the largest nodes indicating a trade that created significant future WAR for one or both teams. Team and player nodes are set to constant sizes so that the initial visual focus will be on the transaction nodes. The size differences become more noticeable when we zoom in to the network. More on that shortly.

Edges are also sized based on WAR value; this is where we see the value provided to a team and by specific players. Edge sizes (weights) will be more easily seen when we zoom in to the network.

On the left are some graph controls to assist in navigating the graph. We can zoom in using the slider control or the plus/minus buttons adjacent to the slider. Zooming can also be done with a mouse scroll if you prefer that option. The fisheye lens can be toggled on or off and can be used to highlight certain areas of the graph by hovering over a selected region. Finally, the edges button will enable showing or hiding edges and connected nodes. This is useful when you wish to reduce surrounding nodes and focus on specific transactions. We can also pan the graph by dragging it using a mouse – this is helpful in centering a network or viewing specific regions of the graph.

At the upper left of the window is a color legend for each node type, and hidden on the left (not shown in our image) is an information pane that will show specifics about the network. More on that in a bit.

Now let’s examine the information window – this is what makes the network truly powerful. When the network is first displayed or the browser window is refreshed the information pane displays information about the graph (open it by clicking on the arrows icon at the top left):

You can see the simple overview of the graph, the source data, and what it aims to accomplish. Here’s an enlarged version for easier reading:

If we zoom in and select a specific transaction the pane displays the relevant details for that selection:

Now we have the details for the transaction – the season, teams, and players involved. Here’s the enlarged view:

You can do this for any transaction in a graph, or you could choose to select a team or player to see how they fit into the network. The possibilities are nearly endless and it’s a fun way to understand the relationships between teams, players, and trades.

We’ll do more exploring of the networks in upcoming posts; I’ll also be adding more teams until we have a complete set of trade networks. In the meantime, feel free to explore the graphs to learn more about the best (and worst) trades your favorite team has made over the last 120 years. Enjoy, and thanks for reading!

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

MLB Trade Networks Part 3: Edges Code

In our previous post I shared the SQL code I created to pull data for our upcoming set of trade networks based on WAR (Wins Above Replacement) numbers from the Neil Paine 538 MLB data set. The prior post dealt with creating nodes for a network graph; this post will share code for edge creation. In simple terms, a graph needs edges that connect related nodes; for our case we need to connect transaction (trade) nodes to the teams and players involved in each transaction.

Part of what makes this case interesting is my desire to show edge weights based on the future WAR value each team received. Showing edges with varying weights will quickly help users to identify the relative importance of a trade. Wider edges will indicate a trade that involved high future value for one or both teams. In seeing the individual players involved in a common trade we can pinpoint where the future value (or lack thereof) comes from. This will become much clearer when the graphs are posted; I’ll do one or more posts on how to use and interpret each graph.

For now let’s examine the code. Gephi requires users to identify Source nodes and Target nodes whether the edges are Undirected (i.e.- it doesn’t matter which node leads to the other) or Directed. Our initial code is for transactions to teams:

SELECT CONCAT(tr.TransactionID, ‘-‘, tr.PrimaryDate) AS Source, t.franchID AS Target, CONCAT(‘The ‘, t.name, ‘ received ‘, ROUND(SUM(h.WAR162),1),
‘ wins in future WAR value’) AS Label,
IF(ROUND(SUM(h.WAR162),1) = tr.season and tr.Type = ‘T’ AND tr.Season >= 1901 and LENGTH(tr.TeamTo) = 3 AND LENGTH(tr.TeamFrom) = 3
AND tr.Season = t.yearID
GROUP BY tr.TransactionID, tr.PrimaryDate, t.franchID, t.name;

With this code we are linking every transaction to the teams receiving one or more players in a trade. Note that we are summing the WAR value to create an edge weight based on the total value received by each team. If four players were involved (two to each team) these edge weights will reflect the combined values of these players. Note that we are setting edge weight = 1.0 if the future WAR is less than 1 (some will actually be negative so we need a minimal edge to show). Here’s a sample of results:

In contrast, the edges linking a transaction to individual players are based solely on that one player’s value. In the case cited above we will wind up with four lines of varying weights. Otherwise the code is quite similar:

SELECT CONCAT(tr.TransactionID, ‘-‘, tr.PrimaryDate) AS Source, p.playerID AS Target, CONCAT(p.nameFirst,’ ‘, p.nameLast, ‘ provided ‘, ROUND(SUM(h.WAR162),1),
‘ wins in future WAR value for the ‘, t.name) AS Label,
IF(ROUND(SUM(h.WAR162),1) = tr.season and tr.Type = ‘T’ AND tr.Season >= 1901 and LENGTH(tr.TeamTo) = 3 AND LENGTH(tr.TeamFrom) = 3
AND tr.Season = t.yearID AND t.franchID = h.franch_ID
GROUP BY tr.TransactionID, tr.PrimaryDate, p.nameFirst, p.nameLast, p.playerID, t.name;

The same logic on edge weights applies but now at the player level. Here are a few results:

I hope this makes sense – it will all become much more clear when the network graphs are produced. The good news is that I already have three graphs created and many more to come shortly. I’ll have some of them available on the site later this week. As always, thanks for reading.

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Trade Network Updates, Part 1

A few years back (2016 o be specific) I created network graphs displaying the history of trades made for each MLB franchise, using transactions data from the wonderful Retrosheet project. These graphs presented more than a few challenges in how to present the data but I wound up with what I consider to be a very interesting set of results, which you can find here. I also created some posts on the process at that time, found here and here.

Here’s a snapshot within a graph:

Six seasons have elapsed since I created those graphs, so I thought it was beyond time to update them, but this time with a twist. Last fall I came across a great dataset that captures an array of advanced sabermetric statistics which I hope to use on a regular basis. These statistics can be used to assess a player’s true value relative to his peers each season. What if I could incorporate those into the trade network updates to show the post-trade value of each player to their new team? Ideally, this will help to show the value of each trade and which team wound up getting the better part of the deal.

Of course this would involve adding a degree of complexity to the MySQL code for pulling the data and shaping it for use in creating network graphs. However, the end result could be very revealing and worthwhile. Today I’m at the start of the process, tinkering with SQL code to extract the data in a proper format. Here’s an example:

SELECT h.player_name, p.playerID, tr.season, tr.TransactionID, tr.TeamFrom, tr.TeamTo, ROUND(SUM(h.WAR162),1) as WAR

FROM historical_WAR_and_more h
INNER JOIN People p
ON h.key_bbref = p.bbrefID
INNER JOIN trades2021 tr
ON p.retroID = tr.Player

WHERE tr.season >= 1901 and h.year_ID > tr.season and h.team_ID = tr.TeamTo AND tr.Type = ‘T’

GROUP BY h.player_name, p.playerID, tr.season, tr.TransactionID, tr.TeamFrom, tr.TeamTo

In this case, I’m looking at the cumulative WAR (Wins Above Replacement) for each traded player with their new team. This could be a single season total or the sum of many years in some cases. Here are some results:

We now have post-trade results (starting if the season following the trade) as measured by WAR for each traded player. We see one fairly substantial figure – the second Aaron Harang trade which netted 16.9 WAR points for his new team, the Cincinnati Reds (CIN in the results). Given that a single season WAR above 3 or 4 is considered substantial, it’s clear that his new team probably benefited from a few of those high-value seasons. What we can’t see yet is what they gave away in their half of the trade.

Fortunately, we can access this using the TransactionID field, which provides all the information for each party within the trade. But we’ll save that for another day as I figure out the next progression of the code. As always, thanks for reading!

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Radial Franchise Networks for all MLB Franchises

All 30 MLB Radial Networks are now complete, and available for you to explore. One thing to notice is that each network will have a slightly different (or radically different) shape, depending on how many (or few) players started in a single season. If the team was in the midst of a successful run, the radians will tend to be short, as fewer rookies or acquired players will debut. On the other hand, teams that are retooling will tend to have long radians, as there are many new players making the team. This could also be reflected in the number of players getting a September call-up from the minors.

While these networks are pretty attractive to view as static images (IMHO), the real fun comes from the interactivity, where you can click, zoom, pan, and see all the details for who played with whom over the course of a franchise’s history. Note that this is based on seasonal rosters, so not all connections actually played together at the same time of the season.

Anyhow, check out a handful of examples, and then try them out yourself at the Franchise Radial Networks 1901-2019 page.

 FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Updating Player Networks – Part 3

In Part 1 of this series, we looked at how to generate node and edge data for all players within a single franchise’s history. Part 2 examined how we could take that data and create a network using Gephi, adding graph statistical measures along the way. In this, the final part of the series, our focus is on moving the graph beyond Gephi and on to the web, where users can interact with the data and interrogate the player network using sigma.js software. So let’s pick up with the process of moving the network from Gephi  to sigma.js.

Recall our basic network structure in Gephi, which looks like this:

One of our goals when we export the graph to the web is to enable user interaction, so the above graph becomes a bit less intimidating. As a reminder, this is at most a moderate sized network; the need to provide interactive capabilities becomes even greater for large networks.

There are a few ways we can create files suitable for web deployment using Gephi. In this case, the choice is to use the simple sigma.js export plugin located at File > Export > Sigma.js template. Selecting this option will provide a set of options similar to this:

This template allows for a modest level of customization, including network descriptions, titles, author info, and other attributes relevant to the network. When all fields are filled to your satisfaction, click on the OK button to save the template. Your network will be saved to the location specified in the blank space at the top of the template window (grayed out in this case). A word of caution is in order here – if you make some custom entries to the template, and then make adjustments to your network, be sure to specify a new location to save the generated files. Otherwise, the initial set will be overwritten. This is especially critical if you have gone behind the scenes to customize colors, fonts, and other display attributes. More on that capability in a moment.

Once the template is complete and the OK button is clicked, a set of folders and files is generated that can then easily be copied to the web. Here’s a view of the created file structure:

These files and sub-folders are all housed within a single folder named ‘network’. If you wish to tinker with your graph in Gephi, rename the network folder to something else prior to exporting a second (or 3rd or 4th time). This will help keep you sane. 🙂

Without going into great detail here, let’s talk about the key files:

  • data.json stores all of your graph data, including positioning attributes, statistics created in Gephi, plus node and edge details
  • config.json contains many of the primary graph settings that can be easily edited for optimal web display. It’s quite easy to go through a trial and error process, since the file is so small. Simply make changes, then refresh your browser to see the result.
  • index.html has a few basic settings relevant to web display, most notably the title information that the browser will use

Within the css folder are .CSS files where you can make changes to many display attributes. This is typically where you will adjust fonts and font sizes, as well as some colors. The js folder has javascript files that can be edited to a certain degree, although caution is recommended if you’re not a javascript guru. Finally, the images folder contains any relevant image files to be used for web display, such as logos.

Alright, now that we have had a brief view of the technical details, let’s have a look at the network graph in the browser. Note that this is still a bit experimental at this stage; I’m attempting to customize each graph based on the official team colors or close variations in the color family.

To see some of the interactive functionality, let’s select a specific player. Simply type Ted Williams (the greatest Red Sox batter of all time) in the search box, and view the results:

Now we see only the direct connections (a 1st degree ego network) for Ted Williams (270 degrees in this case), as well as a wealth of statistical information previously calculated in Gephi, seen in the right panel. At the bottom of the panel are hyperlinks where any one of the 270 connections may be clicked, allowing us to view their network. As you can see, sigma.js quickly provides great interactivity for graph viewers.

Even better, we can scroll in to the network at any time:

Hovering on a node generates a pop-up title for that node, as seen for Ted Williams in this instance. We also begin to see the names of other prominent players at this zoom level. Additional zooming will reveal more player titles – a great way to embed information without making the original graph visually chaotic by displaying all titles at every level.

For the current web version of this graph, click here. I’ll try to keep this version active, even if I make improvements to the final network. Once again, thanks for reading!FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Updating Player Networks – Part 2

In our previous post, we looked at how to acquire and load our baseball player data into Gephi. In this second installment, the focus will be on creating a player network graph in Gephi, and customizing many settings to deliver a network graph we can export to the web. Player networks are used to detail the connections between all players who are connected to one another in some fashion. In this instance, it is based on players having played for the same team in one or more common seasons. So let’s begin with the process of creating the graph using our raw data from the first installment.

Importing .csv data into Gephi is quite simple – we create individual node and edge files (as we showed in the previous post), and use the Gephi import functions to pull the data in. I always start with the node file, since it will typically have additional information not included in the edges file. After importing the node data, I then import the edge data, which gives us the information to form our initial graph. If we were to start with the edge file, Gephi will create our node data automatically, and we will not have the detail needed for our graph. This approach may work for simple graphs, but not for our current case.

Once both data files have been imported, we can begin thinking about what we want form our graph. Here are several questions we might pose:

  • How will we use color?
  • What sort of layout will be best?
  • Which measures should we calculate?
  • How should we depict node sizes?

In many cases, the answers to these questions come about through trial and error. We may have some ideas going into the process, but invariably, there will be modifications along the way. So be patient, and be willing to experiment as you create network graphs. The graph you will see in this post went through many of these modifications, which I won’t take the time to detail. Instead, this post will detail my final choices, along with some explanations for why these choices were made. So let’s take a walk through the various facets of the visualization.

Layout

While a network will retain the same underlying structure from a statistical point of view (degrees, centrality, eccentricity, etc.) regardless of our layout choices, it is still important to select a layout that will visually represent the underlying patterns in the network. Otherwise, we could just as well deliver a spreadsheet with all of the network statistics. So layout selection is critical, and often involves an iterative process.

For the baseball network graphs I built in 2014, I eventually settled on the ARF layout algorithm, which ran quickly and created an attractive circular network graph display using the player connection data. Alas, there is no ARF algorithm available for Gephi 0.9.2, so I required a different approach for the updates. Ultimately, this led to a 2-step approach using a pair of layout algorithms – OpenOrd followed by Force Atlas 2. OpenOrd is especially effective at creating a quick layout from large datasets, although with far less precision than some other force-directed approaches. Still, it is a great tool for creating a general understanding of the structure of a network very quickly. Force Atlas 2, is the near opposite of OpenOrd – a very precise approach that can be tweaked easily using the various settings in Gephi. It is ideal for putting the finishing touches on what OpenOrd started.

Here are the settings I eventually settled on for Force Atlas 2, after much trial and error:

Force_Atlas_2

Some of the more important things to note here are the Scaling and Gravity settings. I reduced the scaling to 0.5 so the network would display appropriately in a single window without the need for scrolling. The Gravity setting was increased to 2.5 to force nodes slightly toward the center of the display. The LinLog mode and Prevent Overlap options are also selected in order to make this particular graph more visually effective. For other graphs, I have used the Dissuade Hubs option, forcing large nodes to the perimeter of the graph; in this case, that was not an ideal choice.

Color

The use of color is also important within a network graph display. Color can be used to highlight nuances in the data that distinguish one or more nodes relative to another group of nodes. Often we use color to visually represent clusters within the graph, as grouped using the modularity classes statistic or some similar input. In the case of this series of graphs (ultimately one graph per team), I made a decision to use the official team colors to differentiate each graph. Thus my initial graph for the Boston Red Sox would be based on the two primary hex colors for the current team (these colors do change over time for many teams).

Here are the Red Sox primary colors:

c8102e_Color_Hex_-_2018-06-10_09.15.48 0c2340_Color_Hex_-_2018-06-10_09.16.33

After capturing current team colors in a spreadsheet for easy reference, I used the color-hex.com site to select complementary colors for the Red Sox graph. Using complementary colors allows me to differentiate clusters in the graph while remaining true to the original concept of employing team colors for each graph. So instead of a wide range of colors one would normally see in a Gephi output, I was able to input the complementary colors for each group. Thus, one team color could be used for the graph background, while the other color (and it’s complements) could be used for the graph structure (nodes & edges). We’ll share the effect later in this post.

Statistics

Graph statistics are critical to the full understanding of the structure of a network. While we can view a graph and begin to understanding the general structure of a network, the various statistics will aid and reinforce our initial visual comprehension. Gephi provides a nice range of statistical measures to choose from:

  • Eccentricity (the number of steps needed to traverse the network)
  • Centrality – betweenness, eigenvector, closeness, harmonic closeness (various measures of importance of an individual node)
  • Clustering coefficient (to discern cliques in the network)
  • Number of triangles (a friends of friends measure)
  • Modularity Class (clusters)
  • Degrees (the number of connections)

Sizing

Node sizing is another key element of effective graph design. In this case, there were a few options I could pursue for node sizing – the number of seasons played (I used this in the 2014 graphs), one of the various centrality measures we calculated, or the number of degrees (connections) an individual player possesses. After computing each of these statistics, I eventually decided to use the number of degrees as a representation of influence in the graph. Visually, I want to show how many other players a single individual is related to, and using node size is an effective means of doing so.

Summary

Our final graph in Gephi is shown below; the eventual web-based version will differ slightly and include additional functionality, but that’s for another post.

red_sox_20180610

Next Post

My third and final post in this series will address exporting this graph to the web using the sigma.js plugin, and making some additional customization to the web version. Thanks for reading, and see you soon!

 

 

 

 

 

 FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Updating Baseball Team Networks – Part 1

A few years back, I used Gephi and sigma.js to create a series of interactive baseball team networks, one for each current MLB franchise. These networks displayed all players through the 2013 season, going all the way back to 1901 for the original American and National League franchises. Now that we have data through the 2017 season, it’s time for an update, not only from a data perspective, but also stylistically. This post will walk through the process of creating one of these networks using Toad for MySQL, Gephi, and sigma.js to create web-based interactive network visualizations.

Here’s a typical network from the 2013 series; the full list of networks can be found here. We’ll use the existing networks as a baseline for the new networks, although a few modifications will be made.

baseball_team_network_2013
a 2013 baseball team network for the Boston Red Sox

Source Data & MySQL Queries

Let’s start our discussion with the source data. Season-level baseball data is available through the seanlahman.com website, in the form of .csv files or Microsoft Access database tables. I use the .csv format, as it can be easily added to existing MySQL databases on the visual-baseball.com server. MySQL also makes it simple to add derived fields through some simple coding. These fields can be utilized later for a variety of activities.

For the purpose of our network graphs, there are a handful of critical fields we want to use. These include the following:

  • playerID, a unique identifier for every player who ever donned a major league uniform
  • player name, which can be used to provide a meaningful reference based on the playerID field
  • yearID, which refers to the season (or seasons) a player suited up for a specific franchise
  • franchID, a unique identifier for each MLB franchise

We also need to do a little manipulation of the source data in our code to deliver our results in the proper form for use in Gephi. This means we need to create two input files – one for nodes, and a second for edges. The nodes will contain information about each player, the number of seasons played for the franchise and the first and last seasons, which may differ from the number of seasons, as players frequently leave a franchise only to return later in their career. Here’s our node code:

SELECT Id, Label, MAX(Size) as Size
FROM
(SELECT bp.playerID AS Id, CONCAT(bp.name, ” “, MIN(bp.yearID), “-“, MAX(bp.yearID)) AS Label, COUNT(bp.yearID) AS Size
FROM BattingPlus bp
WHERE bp.franchID = ‘BOS’ and bp.yearID >= 1901
GROUP BY bp.name

UNION ALL

SELECT pp.playerID AS Id, CONCAT(pp.name, ” “, MIN(pp.yearID), “-“, MAX(pp.yearID)) AS Label, COUNT(pp.yearID) AS Size
FROM BattingPlus pp
WHERE pp.franchID = ‘BOS’ and pp.yearID >= 1901
GROUP BY pp.name)  a
GROUP BY Id
ORDER BY Id;

Here’s the simple interpretation – since we are attempting to display all players for a given franchise, we are executing a UNION ALL statement to combine batters and pitchers into a single result file. We have used the playerID field to create the required Id value for Gephi, while also creating a Label field by combining the player’s name with their first and last years playing for this franchise. Finally, we have created a Size field based on the number of seasons played for the franchise. We can then choose to use this in Gephi to size each node, if we so choose.

We also need to create the edge file for Gephi. In this case, we want to understand how many seasons two players were on the same team. This code is a bit trickier, since we want to show only one connection between two players, since this will be an undirected graph. More on that distinction later. Here’s our edge code:

SELECT b.playerID AS Source, m.playerID  AS Target,  ‘Undirected’ as Type,  ‘ ‘ as Id, ‘ ‘ as Label, count(*) as weight
FROM
(SELECT a.playerID, CONCAT(m.nameFirst, ” “, m.nameLast) name, a.yearID, a.franchID

FROM Appearances a
INNER JOIN Master m
ON a.playerID = m.playerID

WHERE a.franchID = ‘BOS’ and a.yearID >= 1901) b

INNER JOIN Appearances a
ON b.yearID = a.yearID and b.franchID = a.franchID and b.playerID <> a.playerID and a.playerID > b.playerID
INNER JOIN Master m
ON a.playerID = m.playerID

GROUP BY b.playerID, a.playerID
ORDER BY b.playerID

Here we use the Master table to provide player name information, and we also gather the ID information to match the node values. The critical piece in this code is in our join criteria:

INNER JOIN Appearances a
ON b.yearID = a.yearID and b.franchID = a.franchID and b.playerID <> a.playerID and a.playerID > b.playerID

Here we are matching players based on the same season and the same franchise. We then specify that we do not want to connect any player to himself, and that we want only values where the playerID value from our main query is greater than the playerID value from the sub-query. This gives us a single connection between two players, which is what we need for an undirected graph. We then define a Source node (required by Gephi) and a Target node (also required), as well as specifying ‘Undirected’ as the graph type. We leave the ID and Label values empty, and then summarize the number of seasons played together as an edge weight. This value can be used in Gephi to show the strength of a connection between two nodes (e.g.- did they spend one season together, or 10 seasons together?).

After exporting each of these files to a .csv format, we have our source data for Gephi. In Part 2 our focus will shift to creating the network in Gephi.FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather