All 30 MLB Radial Networks are now complete, and available for you to explore. One thing to notice is that each network will have a slightly different (or radically different) shape, depending on how many (or few) players started in a single season. If the team was in the midst of a successful run, the radians will tend to be short, as fewer rookies or acquired players will debut. On the other hand, teams that are retooling will tend to have long radians, as there are many new players making the team. This could also be reflected in the number of players getting a September call-up from the minors.
While these networks are pretty attractive to view as static images (IMHO), the real fun comes from the interactivity, where you can click, zoom, pan, and see all the details for who played with whom over the course of a franchise’s history. Note that this is based on seasonal rosters, so not all connections actually played together at the same time of the season.
A few years back, I created network graphs for many MLB franchises, using data from the Lahman Baseball Database. These graphs displayed the connections between teammates throughout the history of a given franchise from 1901-2013. Any players who were on a roster within the same season (or seasons) were connected to one another, with each node in the network representing a single player. These were then sized to reflect the number of seasons played with that franchise. Every graph was customized by using one or more of their official team colors, resulting in a visualization like this:
A full roster of the live versions can be found here.
Now that 2019 data is available, I thought it was about time to update the graphs, but this time with a new look that might make the graphs a bit more intuitive and perhaps even more visually attractive. Out of this was born the radial axis franchise graph, as shown for the 1901-2019 Detroit Tigers:
I’m pretty excited about the look the Radial Axis layout provides for this sort of data, and I think you’ll see why it is an effective method for visualizing all the players over the course of 119 seasons. Let’s have a look at the anatomy of this graph.
The graph runs in a counter-clockwise manner, starting with 1901 at the bottom of the graph, and working all the way around until we get to 2019, also at the bottom of the graph. Each set of nodes along the way represents the collection of players with their first Tigers season in that radian. We can see some years where there were many new players (1912 has an exceptional number), while other years had very few new players, 1915 being an example. Here’s a general diagram to help with this concept:
We can also identify a handful of players with especially large nodes, which indicate the number of seasons with the Tigers. These are sorted to make the graph clean and easy to interpret; players with the most seasons will all be closest to the center of the graph, with teammates from the same starting year sorted from longest to shortest tenure. The perimeter of the graph will be populated almost exclusively with one-season players.
For context, let’s examine a few of the large nodes, and identify who they are:
I hope this is starting to make sense. Each radian represents a season, and each node on a radian depicts a player, sized by their longevity with the team. The third critical aspect of the graph is the connectivity between players, represented by the thin gray lines running between them. These are called edges, and are at the heart of a network graph. Let’s have a closer look at the edges for Alan Trammell, as one example.
If we click on the Alan Trammell node, the graph is reduced to him and the players he played with over his career – or at least those who were on the roster in those seasons. This is the fun part of the graphs, as it facilitates exploration and pattern discovery. Here is a portion of the Trammell network, zoomed in so we can see the connections:
Now the edges are a bit more visible, and the graph detail begins to reveal itself. Notice the multiple large nodes in line behind Trammell; it turns out that this is the celebrated 1977 class, many of whom would ultimately be members of the great 1984 World Series champs. So while the 1977 Tigers were not a good team, they were beginning to see the fruits of a strong minor league pipeline. In order, here are the players in that group, and their connection to the 1984 team:
Alan Trammell (20 seasons, WS Champ)
Lou Whitaker (19 seasons, WS Champ)
Jack Morris (14 seasons, WS Champ)
Lance Parrish (10 seasons, WS Champ)
Milt Wilcox (9 seasons, WS Champ)
Dave Rozema (8 seasons, WS Champ)
Trammell is obviously connected to many other players who started in different seasons, given his 20 years with the team. In fact, he has a degree measure of 333, representing the number of players with a connection to Trammell. Thus, we can also see many connections to players who started their Tigers journey in other seasons. The three large nodes to the upper right of Trammell are interesting – who are they?
Closest to Trammell is John Hiller (1965-80)
Next is Mickey Stanley (1964-78)
Followed by Willie Horton (1963-77)
What we have here are the three remaining holdovers from the 1968 World Series champs, finishing up their careers just as Trammell was starting his long tenure with the Tigers. Discovering patterns like this make network graphs very interesting for me, and I hope you will also find them interesting. I’m currently in the process of refining the Tigers graph, which will be followed by graphs for each MLB franchise, once again using their team colors to provide some visual context. Hope you found this informative, and we’ll see you soon.
In Part 1 of this series, we looked at how to generate node and edge data for all players within a single franchise’s history. Part 2 examined how we could take that data and create a network using Gephi, adding graph statistical measures along the way. In this, the final part of the series, our focus is on moving the graph beyond Gephi and on to the web, where users can interact with the data and interrogate the player network using sigma.js software. So let’s pick up with the process of moving the network from Gephi to sigma.js.
Recall our basic network structure in Gephi, which looks like this:
One of our goals when we export the graph to the web is to enable user interaction, so the above graph becomes a bit less intimidating. As a reminder, this is at most a moderate sized network; the need to provide interactive capabilities becomes even greater for large networks.
There are a few ways we can create files suitable for web deployment using Gephi. In this case, the choice is to use the simple sigma.js export plugin located at File > Export > Sigma.js template. Selecting this option will provide a set of options similar to this:
This template allows for a modest level of customization, including network descriptions, titles, author info, and other attributes relevant to the network. When all fields are filled to your satisfaction, click on the OK button to save the template. Your network will be saved to the location specified in the blank space at the top of the template window (grayed out in this case). A word of caution is in order here – if you make some custom entries to the template, and then make adjustments to your network, be sure to specify a new location to save the generated files. Otherwise, the initial set will be overwritten. This is especially critical if you have gone behind the scenes to customize colors, fonts, and other display attributes. More on that capability in a moment.
Once the template is complete and the OK button is clicked, a set of folders and files is generated that can then easily be copied to the web. Here’s a view of the created file structure:
These files and sub-folders are all housed within a single folder named ‘network’. If you wish to tinker with your graph in Gephi, rename the network folder to something else prior to exporting a second (or 3rd or 4th time). This will help keep you sane. 🙂
Without going into great detail here, let’s talk about the key files:
data.json stores all of your graph data, including positioning attributes, statistics created in Gephi, plus node and edge details
config.json contains many of the primary graph settings that can be easily edited for optimal web display. It’s quite easy to go through a trial and error process, since the file is so small. Simply make changes, then refresh your browser to see the result.
index.html has a few basic settings relevant to web display, most notably the title information that the browser will use
Alright, now that we have had a brief view of the technical details, let’s have a look at the network graph in the browser. Note that this is still a bit experimental at this stage; I’m attempting to customize each graph based on the official team colors or close variations in the color family.
To see some of the interactive functionality, let’s select a specific player. Simply type Ted Williams (the greatest Red Sox batter of all time) in the search box, and view the results:
Now we see only the direct connections (a 1st degree ego network) for Ted Williams (270 degrees in this case), as well as a wealth of statistical information previously calculated in Gephi, seen in the right panel. At the bottom of the panel are hyperlinks where any one of the 270 connections may be clicked, allowing us to view their network. As you can see, sigma.js quickly provides great interactivity for graph viewers.
Even better, we can scroll in to the network at any time:
Hovering on a node generates a pop-up title for that node, as seen for Ted Williams in this instance. We also begin to see the names of other prominent players at this zoom level. Additional zooming will reveal more player titles – a great way to embed information without making the original graph visually chaotic by displaying all titles at every level.
For the current web version of this graph, click here. I’ll try to keep this version active, even if I make improvements to the final network. Once again, thanks for reading!
Ego networks are an interesting concept within the realm of network visualization using graph analysis, as they allow us to easily see direct connections within the network of a particular individual. Using Gephi, we can navigate large networks using this technique, which enables us to filter and view only those connections relevant to our current criteria. All remaining nodes and edges are simply filtered out from a visual perspective, giving a very clean look at individual networks. The ego network can be set to a depth of 1 if the goal is to show only direct connections, or to 2 or even 3 if our goal is to see the so-called “friends of friends” via indirect connections.
My latest venture uses a network of all MLB players between 1901 and 2015, which consists of a somewhat unwieldy mass of nearly 17,000 players with close to 1.2 million connections. Even when we cluster the results using Gephi’s modularity class option, it is still a difficult network to navigate, both from a visual perspective and a resource allocation viewpoint. Here’s a view of the network as a whole:
While the modularity class coloring helps identify groups of related players, there is an awful lot of small detail that is not easily discerned, and the graph is computationally expensive, often crashing my version of Gephi if I try to do too many things with the full graph. Fortunately, ego networks are a great way to filter the data for greater understanding of some of the details within the network.
Using the ego network option as a filter, I am able to view the individual network of any player in the graph with ease. Here’s a look at my settings for the Miguel Cabrera ego network, and the resulting network, which is now a very manageable 300 nodes and 11k edges:
With a little editing in Gephi, such as increasing the size and adjusting the color for the central node, I can easily create a series of ego networks that can later be exported to a JSON format for use with Sigma.js. These can then be turned into interactive web-based networks quite easily. Here, we change the existing node settings so that the Cabrera node stands out in the graph. First, we locate Cabrera’s record in the data worksheet, and then select the node edit menu option:
This then takes us to the node properties, where size and color can be edited:
If this step causes some overlap in the graph, we can easily run the Noverlap layout algorithm to optimize graph spacing. Here’s a view of the completed Cabrera network after using Sigma.js and tweaking a few of the config settings:
As of now, there are five of these ego networks available for viewing on the visual-baseball site. They can be found here. I promise more to come in 2017 as time permits. Update – 25 networks as of 1/15/2017.
Welcome to Part 2 in our miniseries on building baseball (MLB) trade networks with Gephi. In the first post, the focus was on procuring and preparing the data using MySQL. The goal was to create nodes and edges that could be easily imported to Gephi. Gephi does allow for some data manipulation post-import, but I’ve learned from experience to do the main parts of the job with either SQL code or within spreadsheet software like Excel or Calc.
With our data readied for import, we’ll now move on to the more fun parts of the process, where we get to visualize the data and see any underlying patterns. Gephi is an ideal tool for this, as it allows us to try out many different algorithms, especially in version 0.8.2. The newer 0.9 versions are faster, but have not fully caught up on the plugin side at this writing, so options are a bit more limited. One other caveat – I frequently run into Java issues when using Gephi, so save your work often and be prepared to shut down and restart Gephi periodically.
We’ll kick off this part of the process by importing the data, nodes first, followed by edges. The reason I prefer this order is that nodes will be automatically created if we start with the edge file import, and they won’t contain any extra fields you may have added using your database or spreadsheet processes.
Here’s the node import window, showing the appropriate file input:
Once the node import process is complete, we turn to the edges file, and follow a similar sequence of steps. Here’s our starting point:
After the data has been imported, it’s time to move to the Overview tab, where we’ll see a dense mess of nodes and edges, especially if we have a fair sized dataset. Something like this:
Gephi offers a variety of interesting algorithms, each more or less appropriate based on the underlying dataset. In our case, the dataset is of a moderate size, with more than 8,000 nodes and nearly 62,000 edges. This immediately rules out the use of simple layouts such as the circular algorithms, as it would prove immensely challenging to display, even when we take it to an interactive output. At the other end of the sophistication level lie the force-directed layouts, which apply a significant dose of science and math within their respective algorithms. In Gephi, the Force Atlas 2 is quite popular, but it tends to run very slowly unless coupled with enormous levels of RAM. So where to go with our choice for this data?
I elected to take a two step approach, using the extremely fast (if less precise) OpenOrd algorithm for the original data. This provides a nice view of the network within a few minutes, making it a good starting point for our next steps.
The goal of this exercise was to create team level graphs, which will each have a small subset of the entire dataset. One easy way to achieve this is to use the Ego Network filter to select a single team and its connections. Setting the ego network to a depth of 1 limits the display to only first degree connections; in this case players traded to or from our selected team.
Once this step has been taken, we can then refine the display by applying another algorithm; in this case I have chosen the Yifan Hu option, and adjusted the settings until they created an aesthetically pleasing graph. The Yifan Hu adds further precision within each of the team graphs, and provides them with a common look & feel inasmuch as their respective data allows.
We have now completed our basic graph creation in Gephi, and can output our results to a variety of output formats. Our choice here is to create a GEXF file, which we can then plug in to an existing template. We do have another step with respect to the GEXF data. In order to relate the graphs back to their respective teams, I chose to apply official team colors to elements in the graph. Specifically, each node should reflect the individual team; we want the edges to remain the same across all graphs so that users have a common understanding for the types of connections between players and teams. So to update the node colors, simply use a code editor that can perform batch updates. I typically work with Brackets for this task, but choose your tool of choice. Here’s a view of the GEXF output prior to applying color changes:
Now here’s the updated version that reflects the Tigers navy blue coloring:
After a bit of massaging the config and CSS settings, the result is visually appealing as well as highly functional. Here’s a zoomed in look at the Toronto Blue Jays network on the web:
All of the networks can be found by clicking the link below, with new ones being added until all teams have been represented:
This is part two of a brief series sharing components of my presentation titled Analyzing Complex Networks Using Open Source Software at ODSC East in Boston on May 21st. The first post looked at a few examples from a Boston Red Sox players network, while this one examines a Miles Davis album and musician network. I’ll share a few examples of network analysis within the context of the Miles Davis graph.
The Miles Davis network could be described as a tripartite network, or one with three layers. Miles is at the center, and connects to each of nearly 50 recordings. Other musicians then connect to the respective recording(s) they played on, but not to one another. This approach provides a very clear look at musical phases in the career of the legendary trumpeter, without the graph being clouded by excessive detail. Here’s a view of the final network, after which we’ll look at some components of the graph.
We see some interesting patterns in the graph, specifically in viewing the pink circles, which represent individual albums. Musicians playing on a recording can be seen adjacent to that recording, except in the case of musicians present on multiple albums. We would expect them to be positioned relative to all of the recordings they played on. A quick visual scan leads to five distinct clusters, as seen in the next screenshot.
Now that we have identified these clusters, it would be helpful to understand their meaning and relevance to Miles career. Using the graph in interactive fashion, we can learn more about the recordings and musicians, and begin to formulate some insights. These can be confirmed by referring to album links on the web or in Wikipedia, which give context to what we are viewing. Based on these steps, here is a quick overview of the five clusters.
A final step might be to add some verbiage using PowerPoint or Inkscape, which I’ve done below in very minimalist fashion. We could also add this to a web version using CSS attributes to position the text, although this could get tricky as we pan and zoom on the graph. We might be better off using some sort of stylized marker (color or shape) to communicate some of this information.
There is much more that could be done, but I hope this brief example shed some light on the usefulness of network graphs, especially from a pure visual perspective.
Hi – I just launched another network project courtesy of Gephi and Sigma.js, my two favorite tools of the moment. You can find it here, or in a full web version here. This one, like its immediate predecessor, is founded in politics, and more specifically in tracking political contributions – who gives to whom. The paths in this network detail thousands of political candidates, and the many PACs, corporations, foundations, and trade associations that help fund their campaign efforts. Of course these connections also create a sort of influence network that could never be achieved by individual voters, and help explain why so many decisions are made that run counter to the will of the people.
While this one doesn’t focus on dollar amounts, it nonetheless paints a compelling picture for how political influence is meted out. Fringe candidates, frequently outside the embedded American two party system, are depicted near the perimeter of the graph, receiving little or no support from most major donors. Incumbent Democrats and Republicans, on the other hand, are situated at the center of the network, receiving contributions from dozens or even hundreds of PACs, unions, corporations, and trade associations.
The multiple colors reflect the multitude of political parties (yes, beyond the dominant two-party monopoly) plus the hordes of contributors – corporations, unions, trade associations, and more.
One of the great features of interactive networks is the ability to dive into the details. For starters, lets take a look at the Nancy Pelosi neighbor network, which should provide a nice glimpse into the donor network for an entrenched, influential Democratic candidate:
What we see is a well-connected network populated by dozens of contributors. Now let’s go to the other side of the aisle and take a look at the donor network of John Boehner, an influential Republican incumbent:
The Boehner network is even more dense than the Pelosi network. We should note that many contributing organizations may be found in both the Pelosi and Boehner camps, although the overlap will be somewhat mitigated by the Democrat versus Republican differences. What they do have in common are a huge number of contributors determined to influence policy, often at the expense of the voting public.
Our final screenshot displays many of the PACs in the network – more than 2,600 in total. The attribute pane on the right of the display will show each and every one of these when you use the category filter to the left of the screen:
I hope you find some value in navigating and learning more about the scores of organizations involved in trying to influence policy through congressional gatekeepers. Bear in mind we haven’t even touched on the unelected portions of the government residing in the halls of the CIA, FBI, and Department of Defense. That will be the subject of a future network.
As I worked through a just completed project chronicling the diverse musical career of Neil Young, some valuable (if unintended) insights were reinforced once more. I work on a regular basis with a variety of large datasets that require analysis, interpretation, and ultimately visualization and presentation. Often, these goals are not easily reconciled, which leads to unsatisfactory results across one or more of these factors.
As much as we as analysts need to depict the data accurately and meaningfully, if we don’t do so with an attractive visual approach we risk not having our message get communicated at all. Merely presenting our data in a table may technically get the job done, but is also likely to bore the reader to tears while simultaneously failing to deliver the key messages. At the other extreme, we can pull out individual bits of the data and spend our time creating flashy infographics that may capture attention but fail to represent the data in its proper context. All flash, no substance. Neither approach is terribly effective.
At the same time, we may present all of the information using a reasonable visual approach that preserves the integrity of the data while still falling short of creating a fulfilling user experience. This is what I recently experienced with the Neil Young project, as I’ll detail below.
After spending a few days getting the data from the AllMusic site into Excel, and eventually as node and edge files into Gephi, it was finally time to create the network data visualization. I was determined to attempt one of the many force-based methods used in network graph analysis to create the graph. These methods are very popular and useful for creating graphs out of a variety of data networks, allowing viewers to see the larger patterns at work within the data.
After a few iterations, I wound up with a serviceable graph that covered most of the basics I spoke of earlier – all the data was exposed, element types were sized and color-coded for easier interpretation, and the project was navigable via the web. Here’s a look:
Not bad, but there was something nagging at me as I viewed it, tweaked it, played with the styling, and so on. Everything was technically fine, but something was missing. So back I went to Gephi to find the answer. The next day, it occurred to me – I was using the wrong approach for the type of data I was trying to depict. Where the force-directed approach is ideal for dense, social media type networks, this was a unique network that didn’t possess the same structure. Therefore, it was not as aesthetically appealing or as intuitive as it could be.
After iterating through a few approaches, I came across a winner that best exploits the structure of the underlying data while conveying a far more intuitive feel to the end user. Why not have Neil at the center of the graph, surrounded by all of his albums, ordered by release date? On top of this, I could then have the style and mood data form an outer ring, as they needed only to link to the albums in some fashion. Now we have something that conveys the same information as the first attempt, but in a much more pleasing layout relative to this dataset. See for yourself:
The new version addresses the issues of aesthetics and intuition where the first graph fell short. All moods and styles are now easily found; the same is true for all albums. Highlighting a single mood (or album) also provides an information-rich view for how the music changed over periods in Young’s career. This was nearly impossible to see in the initial layout.
So the message is this – visualizations not only don’t need to sacrifice aesthetics and intuition in order to be effective; rather, they should take advantage of these attributes to increase their appeal and impact. Don’t be afraid to experiment until you find the right formula, as it seldom presents itself the first time around, and trust your instincts.
I’m having a lot of fun creating network graphs using Gephi, and excited about the possibilities for displaying a wide range of baseball information. My initial pass at showing team connections uses the Tigers, with this version incorporating all players from 1901 through 2013. Try it out here.
This is done using a radial axis layout with Chinese Whispers Clustering, based on a paper by Chris Biemann. The colors along each of the axes is based on the clusters created by this algorithm. Once again, I’ve used Sigma.js to create the interactive version, so you can dive into the graphic and gain an understanding for how the data is displayed. Here’s a static view of the graph:
Kind of colorful, isn’t it? For those of you who are long-time Tiger fans, you’ll soon detect that each cluster (color) represents a specific era in Tiger history, an by clicking on individual nodes, you’ll be able to see which players connect across multiple clusters. Typically, this will be players like Alan Trammell, Ty Cobb, and others who played many years with the club, and thus transcend their own cluster position.
Not sure if this will be the style for all the franchise graphs I plan to do this year, but it feels close, given the ability to display more than 1500 players and nearly 48,000 connections without having things too cluttered.
Currently I’m way into using Gephi with the Sigma.js plugin, which takes the cool graph output from Gephi and makes it interactive and absolutely fun to play with for anyone interested in either network graphs or baseball history – or both.
Just recently, I created a piece featuring the career connections of Octavio Dotel, who has played on a record 13 MLB franchises, and may wind up with number 14 this season. Now, I’m moving into the team level, and have just created a 1901-1949 Detroit Tigers network as a test. Ultimately, I may wind up getting everything from 1901 through 2013 in the graph, but need to test for usability first.
Have a go at it here or visit the Network Graphs portfolio page to view this graph and others.