Pennant Race Charts Updated!

The last of my big three annual updates is now complete, as all 2016 & 2017 pennant race charts have been created, and now reside in the Visual-Baseball Project portfolio. These charts are created using NVD3, which is built on top of the powerful d3.js framework developed by Mike Bostock. These tools help make the charts highly interactive, allowing you to see where each team stands at any given point in the season, and also providing the ability to zoom in using a smaller sub-chart beneath the primary display.

The structure of the charts is based on every team’s relationship to a .500 winning percentage – a situation where a team wins exactly as many games as it loses. This structure allows for easy interpretation of the results, as we can see which teams hover near the .500 mark (i.e.- consistent mediocrity), others that rise well above this level, and also those teams that descend far below the breakeven point. Allow me to illustrate these thoughts using the 2017 American League Central division, and my hometown Detroit Tigers, who suffered through their worst season since 2003.

2017_AL_Central

As you can see, the darker orange line representing the Tigers takes a steep dive starting in early August, culminating in a final record 34 games below the .500 percentage. Meanwhile, the rival Cleveland Indians (light blue line) present a near mirror image of the Tigers failure, with a sensational month of September that ultimately lands then 42 games over the .500 break-even level.

Similar charts have been created for the other divisions for both the 2016 & 2017 seasons. In fact, you can now view any season, league, and divisional splits dating back to the 1901 campaigns, a total of 380 pennant races to explore! Find all the pennant race charts here. Have fun exploring, and as always, thanks again for reading!

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Baseball Game Summaries Updated!

Thanks to some unusually cold and rainy weather, I’ve been able to focus on updating both my source databases as well as some of the visualizations built from the data. That’s a roundabout way of saying that the baseball Game Summary exhibits have been updated for both the 2016 & 2017 seasons. They can be found in the portfolio section of the site by following this link.

As a refresher, the baseball game summaries give you a sort of visual box score for every game played in a season, featuring the line score for the game, winning and losing pitchers, attendance, and much more information pertaining to each specific game. The real power comes from the ability to filter results to find all games that match specific criteria.

game_summary_filter_1

As you can see, there are many available filter options, right down to who the home plate umpire is for every game.

Here’s a quick illustration of how the filters can be used. We’ll filter 2017 results where Clayton Kershaw was the starting pitcher at home, and gave up 4 home runs (a very rare event!). First, we select Kershaw as the Home Starter, and then we open the Visitor HR filter, and select 4 (there’s just one instance). We can then apply these filters to see at which game this unusual event took place.

Kershaw_4HR

Closing the filter window, we see the single game box score returned by our filters:

Kershaw_4HR_game

Ironically, we can see that the Dodgers not only won this game, with Kershaw as the winning pitcher, but that they too hit 4 home runs (Home HR in the box score). We can also see that the Mets struck out 13 times (Visitor SO) and the Dodgers 12 times (Home SO). Must have been a wild day at Dodger Stadium on June 19th for the 43,266 in attendance!

As you can see, a lot of information can be gleaned using just a couple of selections to filter the data. There are nearly endless possibilities for using the filters to return the information that most interests you. So have a look at the game summaries and any other items in the portfolio section. Enjoy, and thanks for reading!

 

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Batting Explorers Updated Through 2017 Season

I’m pleased to share that the Batting Explorer visualization for the 2010s decade has now been updated with 2016 & 2017 statistics. This ongoing project captures batting statistics at the season level for every major league batter, and visualizes them in a baseball card type of format, as seen below:

Batting Explorer
Batting Explorer

A number of filters are provided to make it easy to browse across a wide range of attributes, including all of the major batting categories:

Filters
Filters

One of my favorite aspects of the Batting Explorer is the ability to link to greater detail by clicking on a specific player card, which will transport you to the Baseball-Reference page for that player:

baseball_ref_1

This project uses the Exhibit project software originally developed years ago as part of the MIT Simile project, as well as a lot of HTML & CSS for styling purposes. Give it a try, and thanks for reading.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

2017 Baseball Data Updates Coming Soon!

Happy to report that the annual 2017 baseball data has been downloaded from seanlahman.com, meaning it’s time to start the annual updates. This data is a great resource for building many of my data visualizations that are featured in the portfolio section of this site. Likewise, the Retrosheet game event data has also been downloaded, meaning the onus is now on me to run the various upload and update processes for my databases.

Looking forward to putting these resources to work as I make updates to the following data visualizations (among others):

Stay tuned for updates on these and other projects, and please pay a visit to my JazzGraphs site, where the focus is on network graph analysis of jazz artists, labels, releases, and songs. Thanks for reading!

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Recapping 2017

Observers of this blog will note that posts were scarce in 2017 – in fact this is the only one, and it’s being completed in 2018! This is the result of a variety of causes, including external projects, busy schedules, and focus that was shifted in other, unrelated directions. Still, 2017 was not without its moments.

For starters, I managed to create three data visualization courses for Packt:

Learning Data Visualization

Data Visualization Techniques

Advanced Data Visualization

Retrosheet data for the 2016 and 2017 seasons has also been downloaded, and is in the update process as we speak, which will enable some new visualization work (and perhaps a new book title) in 2018. Soon, annual season data from the Baseball-Databank and Sean Lahman will be available as well.

I’m also in the process of launching a new site at jazzgraphs.com, where I’ll use network visualizations to uncover the complex web of relationships between jazz musicians, labels, and recordings. Posters and a book are in the plans for 2018, so stay tuned.

Wishing all a happy and prosperous 2018, and I promise more content to come this year!

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Player Ego Network Visualization

Ego networks are an interesting concept within the realm of network visualization using graph analysis, as they allow us to easily see direct connections within the network of a particular individual. Using Gephi, we can navigate large networks using this technique, which enables us to filter and view only those connections relevant to our current criteria. All remaining nodes and edges are simply filtered out from a visual perspective, giving a very clean look at individual networks. The ego network can be set to a depth of 1 if the goal is to show only direct connections, or to 2 or even 3 if our goal is to see the so-called “friends of friends” via indirect connections.

My latest venture uses a network of all MLB players between 1901 and 2015, which consists of a somewhat unwieldy mass of nearly 17,000 players with close to 1.2 million connections. Even when we cluster the results using Gephi’s modularity class option, it is still a difficult network to navigate, both from a visual perspective and a resource allocation viewpoint. Here’s a view of the network as a whole:

mlb_players_20161230

While the modularity class coloring helps identify groups of related players, there is an awful lot of small detail that is not easily discerned, and the graph is computationally expensive, often crashing my version of Gephi if I try to do too many things with the full graph. Fortunately, ego networks are a great way to filter the data for greater understanding of some of the details within the network.

Using the ego network option as a filter, I am able to view the individual network of any player in the graph with ease. Here’s a look at my settings for the Miguel Cabrera ego network, and the resulting network, which is now a very manageable 300 nodes and 11k edges:

mlb_ego_filter_20161230

With a little editing in Gephi, such as increasing the size and adjusting the color for the central node, I can easily create a series of ego networks that can later be exported to a JSON format for use with Sigma.js. These can then be turned into interactive web-based networks quite easily. Here, we change the existing node settings so that the Cabrera node stands out in the graph. First, we locate Cabrera’s record in the data worksheet, and then select the node edit menu option:

mlb_edit_node1_20161230

This then takes us to the node properties, where size and color can be edited:

mlb_edit_node2_20161230

If this step causes some overlap in the graph, we can easily run the Noverlap layout algorithm to optimize graph spacing. Here’s a view of the completed Cabrera network after using Sigma.js and tweaking a few of the config settings:

Cabrera_20161130

As of now, there are five of these ego networks available for viewing on the visual-baseball site. They can be found here. I promise more to come in 2017 as time permits. Update – 25 networks as of 1/15/2017.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Updated Batting Explorer Baseball Visualizations

One of my ongoing visualization projects has been the Batting Explorer, a semantic-based discovery concept built using the Simile Exhibit open source tools. I’ve been updating this on a not very timely basis the last few years, but have now caught up through the 2015 season. Of course, 2016 stats will be available in the next 2-3 months, so it will be time to repeat the process once more. For the moment, I’ve just added the 2014 & 2015 seasons into one of the decade-based examples, so you can now search for all batters covering seasons from 1901 through 2015.

The explorers work in much the same way as many travel sites you’ve visited on the web. Each page can be filtered using a wide array of facets (filters) that allow you to quickly narrow down results by team, season, batting category, and a bunch of other options. I’ll show this in a moment. Let’s first start with a basic view of the 2010-15 explorer:

Batting Explorer
Batting Explorer

Each of the Batting Explorers has a consistent look & feel, with the underlying data as the only difference. All individual player-season combinations are laid out in a baseball card sort of format, although you can’t flip them over or get any bubble gum either :) Nonetheless, each card contains a wealth of information, including the number of games played by position, laid out on a baseball diamond. In addition, hovering over a card loads a pop-up summary of the season for each individual batter, as seen here:

Mouseover for Info
Mouseover for Info

An additional benefit comes when you click on a selected card. Every batter card has a personalized link to the massive Baseball-Reference.com site. Here’s what you’ll see when clicking on the Juan Uribe link:

Juan Uribe at Baseball-Reference
Juan Uribe at Baseball-Reference

I mentioned earlier the ability to filter using a wide range of facets. Here’s a glimpse of the many categorical and numerical options present in each Batting Explorer:

Filters
Filters

As you can see, there are dozens of possible filters that can be used. If you want to see only batters with more than 40 home runs in a season, simply select the HR facet and check the conveniently provided ranges. Or how about viewing players from a single team? Simple, using the Team facet. Likewise for filtering by season, number of doubles, stolen bases, walks, strikeouts, and so much more. And of course these filters can be used together to quickly find matching results.

Finally, there are a multitude of sort capabilities, or you can choose to have nothing sorted. If you do wish to choose one or more sort attributes, here are your options:

Sort options
Sort options

Your sorts can be many layers deep – just keep adding variables!

This has been a very brief overview – to learn more, go to the Portfolio section and begin exploring! While you’re on the site, take some time to view the Game Summary exhibits, set up in much the same fashion using Exhibit. Or, if networks are your thing, check out a large collection of franchise player or team trade networks. Hope you enjoy the site, and thanks for reading.

Batting Explorers

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Major League Baseball Trade Networks, Part 2

Welcome to Part 2 in our miniseries on building baseball (MLB) trade networks with Gephi. In the first post, the focus was on procuring and preparing the data using MySQL. The goal was to create nodes and edges that could be easily imported to Gephi. Gephi does allow for some data manipulation post-import, but I’ve learned from experience to do the main parts of the job with either SQL code or within spreadsheet software like Excel or Calc.

With our data readied for import, we’ll now move on to the more fun parts of the process, where we get to visualize the data and see any underlying patterns. Gephi is an ideal tool for this, as it allows us to try out many different algorithms, especially in version 0.8.2. The newer 0.9 versions are faster, but have not fully caught up on the plugin side at this writing, so options are a bit more limited. One other caveat – I frequently run into Java issues when using Gephi, so save your work often and be prepared to shut down and restart Gephi periodically.

We’ll kick off this part of the process by importing the data, nodes first, followed by edges. The reason I prefer this order is that nodes will be automatically created if we start with the edge file import, and they won’t contain any extra fields you may have added using your database or spreadsheet processes.

Here’s the node import window, showing the appropriate file input:

Gephi node import window
Gephi node import window

Once the node import process is complete, we turn to the edges file, and follow a similar sequence of steps. Here’s our starting point:

Gephi edge import window
Gephi edge import window

After the data has been imported, it’s time to move to the Overview tab, where we’ll see a dense mess of nodes and edges, especially if we have a fair sized dataset. Something like this:

Impossibly dense hairball network
Impossibly dense hairball network

Gephi offers a variety of interesting algorithms, each more or less appropriate based on the underlying dataset. In our case, the dataset is of a moderate size, with more than 8,000 nodes and nearly 62,000 edges. This immediately rules out the use of simple layouts such as the circular algorithms, as it would prove immensely challenging to display, even when we take it to an interactive output. At the other end of the sophistication level lie the force-directed layouts, which apply a significant dose of science and math within their respective algorithms. In Gephi, the Force Atlas 2 is quite popular, but it tends to run very slowly unless coupled with enormous levels of RAM. So where to go with our choice for this data?

I elected to take a two step approach, using the extremely fast (if less precise) OpenOrd algorithm for the original data. This provides a nice view of the network within a few minutes, making it a good starting point for our next steps.

Trade network with OpenOrd layout
Trade network with OpenOrd layout

The goal of this exercise was to create team level graphs, which will each have a small subset of the entire dataset. One easy way to achieve this is to use the Ego Network filter to select a single team and its connections. Setting the ego network to a depth of 1 limits the display to only first degree connections; in this case players traded to or from our selected team.

Ego network with depth = 1
Ego network with depth = 1

Once this step has been taken, we can then refine the display by applying another algorithm; in this case I have chosen the Yifan Hu option, and adjusted the settings until they created an aesthetically pleasing graph. The Yifan Hu adds further precision within each of the team graphs, and provides them with a common look & feel inasmuch as their respective data allows.

We have now completed our basic graph creation in Gephi, and can output our results to a variety of output formats. Our choice here is to create a GEXF file, which we can then plug in to an existing template. We do have another step with respect to the GEXF data. In order to relate the graphs back to their respective teams, I chose to apply official team colors to elements in the graph. Specifically, each node should reflect the individual team; we want the edges to remain the same across all graphs so that users have a common understanding for the types of connections between players and teams. So to update the node colors, simply use a code editor that can perform batch updates. I typically work with Brackets for this task, but choose your tool of choice. Here’s a view of the GEXF output prior to applying color changes:

GEXF nodes with original colors
GEXF nodes with original colors

Now here’s the updated version that reflects the Tigers navy blue coloring:

GEXF nodes updated with team colors
GEXF nodes updated with team colors

Once this step has been completed, I can upload the files to the web server, where other minor changes can be made. These include any updates to the CSS styling, adjustments to the config.js file, and minor changes to the index.html file so the proper team information is displayed. The easiest way to do this is to create a common set of directories for the basic javascript and CSS files, leaving only the individual config, html, and gexf files in each team’s directory.

After a bit of massaging the config and CSS settings, the result is visually appealing as well as highly functional. Here’s a zoomed in look at the Toronto Blue Jays network on the web:

Blue Jays trade network close-up
Blue Jays trade network close-up

All of the networks can be found by clicking the link below, with new ones being added until all teams have been represented:

Trade Networks

If you want to see each network in its own window or tab, use the right-click options in Firefox or Chrome.

I hope this has been helpful, and please feel free to reach out via the LinkedIn or Facebook Gephi groups, or leave a comment on this site.

Thanks for reading!

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Major League Baseball Trade Networks, Part 1

One topic that has long fascinated me as network graph material is trade data between major league baseball (MLB) teams. I have previously created a static visualization showing activity at a macro level, i.e.- the number of trades between teams over a 100+ year period. Yet there was a desire to do something more, and to make it interactive so users would be able to sift through the data for their favorite teams to understand trade patterns through a visual representation. Today, after weeks spent tinkering with this topic, I finally have something to share, and will walk through how it is created and how to engage with it online. If you want to play with it before reading further, visit the Tigers trade network.

Here’s what we’ll wind up with:

Tigers trade network
Tigers trade network

My tools of choice in this endeavor are familiar ones to anyone working with baseball data, network graphs, or perhaps both, although I haven’t seen many instances of the latter. The trade data can be found at Retrosheet, as part of a seemingly boundless array of baseball data, both statistical and historical in nature. Gephi, the open source network analysis tool is again my choice for creating the network structure from the raw data, and Sigma.js is once more the tool for web implementation. Mix in a bit of Excel and PowerPoint for good measure, and we have all the tools necessary to create a pretty cool (IMHO) finished graph.

So let’s get started. Our first step is to go to the Retrosheeet site and download trade data, found at Retrosheet transactions. Be aware that there is much more than trade data in this dataset; free agent transactions, releases, and many other transaction types are available. My approach is to grab the entire dataset, which I can then load into a MySQL database for filtering and matching to other baseball data from both Retrosheet and the Lahman archives. For our example, only trades will be used; this leaves open the future possibility to examine free agent signings or other transaction types.

Once I have the data in MySQL (I’m purposely skipping over this process), the coding steps begin. This was a very iterative process as I gradually figured out how Gephi would play with the output data, but I won’t bore you with my multiple missteps. Instead, let’s have a look at the code snippets, and I’ll explain their usage and the thought process behind them. We’ll start with a view of the code, created within the (free) Toad for MySQL tool. In creating this code, we need to understand how Gephi (or other network analysis tools) work. At the risk of over-simplifying, Gephi only needs nodes and edges. Nodes will represent the players or teams in our visual, while edges will show the linkages within a single trade – who was traded for whom, and which players moved together from one organization to another.

Node creation is simple – we just grab all players involved in a trade, and do likewise for all teams. Here’s the code for players:

SELECT t.Player as Player, CONCAT(m.nameFirst, ” “, m.nameLast) as Name, count(*) as transactions

FROM trades2015 t
INNER JOIN Master m
ON t.Player = m.retroID

WHERE t.Type = ‘T’ and (t.TeamFrom > ‘A’ OR t.TeamTo > ‘A’)
GROUP BY t.Player, CONCAT(m.nameFirst, ” “, m.nameLast)

All we’re doing here is creating a node size for each player, based on the number of trades they are involved in.

For Teams, the logic is a bit more complex; since team names have changed from season to season, we need to join on both team and season to get the correct name assignments. We also want to account for the direction of each transaction, which we do using a UNION query.

SELECT b.Team AS Id, b.Name As Label, SUM(b.transactions) as Size
FROM
(SELECT t.TeamFrom as Team, te.name as Name, count(*) as transactions

FROM trades2015 t
INNER JOIN Teams te
ON te.teamID = t.TeamFrom and t.Season = te.yearID

WHERE t.Type = ‘T’ and t.TeamFrom > ‘A’
GROUP BY t.TeamFrom, te.name

UNION ALL

SELECT t.TeamTo as Team, te.name as Name, count(*) as transactions

FROM trades2015 t
INNER JOIN Teams te
ON te.teamID = t.TeamTo and t.Season = te.yearID

WHERE t.Type = ‘T’ and t.TeamFrom > ‘A’
GROUP BY t.TeamTo, te.name) b
GROUP BY b.Team, b.Name

After running the queries, we have results that can be posted into Excel or other spreadsheet software, where a tab-delimited file can be saved for use in Gephi. Our file data looks like this:

Id Label Size
aardd001 David Aardsma 4
aaroh101 Hank Aaron 1
aased001 Don Aase 1
abadf001 Fernando Abad 1
abbae101 Ed Abbaticchio 1
abbeb101 Bert Abbey 1
abbof101 Fred Abbott 1
abboj001 Jim Abbott 2

and for the team entries:


OAK Oakland Athletics 936
PH4 Philadelphia Athletics 6
PHA Philadelphia Athletics 355
PHI Philadelphia Blue Jays 28
PHI Philadelphia Phillies 1445
PHI Philadelphia Quakers 3
PIT Pittsburg Alleghenys 9
PIT Pittsburgh Pirates 1416

This is all Gephi requires for displaying nodes – an ID, a Label, and size. Even the label and size are not required fields, but they do make things easier if done in advance. So far, so good. Next we’ll move on to the somewhat more involved process of creating edge files.

As I progressed deeper into this project, it became evident to me that there were four different types of edges to display. The first two were obvious and easy – players being traded to a team, or from a team. Yet I also wanted to see the other players involved in each transaction, which necessitated the addition of two more edge type – traded with other players, and traded for other players. Note that in many cases just two or three of these might come into play, and for many prominent players, we’ll have none at all. Thus, the likes of an Al Kaline or Ted Williams will not be found in any of these graphs, as they remained with a single team for their entire careers.

Here’s the final edge code I wound up with to create the four categories of trades to be displayed in a graph. Gephi requires three edge attributes – a source value, a target value, and an edge type. The edge type must be either undirected or directed; for our graph, all edges will be directed, since we intend to show the bi-directional movements within each transaction. The first bit of code is for instances where a player was traded from a team:

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.TeamFrom AS Source, tr.Player as Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “, CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” from “, t.name) AS Label, ‘Directed’ as Type, ‘Traded From’ as CategoryDetail

FROM trades2015 tr
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Teams t ON tr.TeamFrom = t.teamIDretro and t.yearID = tr.season

WHERE tr.type = ‘T’

Note the legacy code covering free agency and releases, rendered moot by the WHERE clause. These will have to wait for another set of graphs. In a similar fashion we have code for trades where a player comes to a team.

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.Player AS Source, tr.TeamTo as Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “, CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” to “, t.name) AS Label, ‘Directed’ as Type, ‘Traded To’ as CategoryDetail

FROM trades2015 tr
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Teams t ON tr.TeamTo = t.teamIDretro and t.yearID = tr.season

WHERE tr.type = ‘T’

Next, it’s time to create linkages with players from the same transaction, first those moving in the same direction (traded with) in the trade.

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.Player AS Source, tr2.Player AS Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “, CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” with “, m2.nameFirst, ” “, m2.nameLast) AS Label, ‘Directed’ as Type,
‘Traded With’ as CategoryDetail

FROM trades2015 tr
INNER JOIN trades2015 tr2
ON tr.TransactionID = tr2.TransactionID
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Master m2 ON tr2.player = m2.retroID

WHERE tr.type = ‘T’

Note the need to duplicate the Master table in the code, since we now require multiple player names to populate the Source and Target fields in Gephi. The same holds true for our last snippet, where players are traded for one another.

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.Player AS Source, tr2.Player AS Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “,CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” for “, m2.nameFirst, ” “, m2.nameLast) AS Label, ‘Directed’ as Type,
‘Traded For’ as CategoryDetail

FROM trades2015 tr
INNER JOIN trades2015 tr2
ON tr.TransactionID = tr2.TransactionID
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Master m2 ON tr2.player = m2.retroID

WHERE tr.type = ‘T’

Each of these bits of code outputs results, which are then copied and pasted into our edges spreadsheet. Here are five rows showing each of our four trade categories:

Season TransactionID PrimaryDate Source Target Category Label Type CategoryDetail
2010 62908 20100731 KCA ankir001 Trade Rick Ankiel Traded on 20100731 from Kansas City Royals Directed Traded From
2010 60709 20100831 TEX ariaj001 Trade Joaquin Arias Traded on 20100831 from Texas Rangers Directed Traded From
2010 62264 20101118 COL barmc001 Trade Clint Barmes Traded on 20101118 from Colorado Rockies Directed Traded From
2010 72627 20101217 TBA bartj001 Trade Jason Bartlett Traded on 20101217 from Tampa Bay Rays Directed Traded From
2010 72622 20100709 TEX beavb001 Trade Blake Beavan Traded on 20100709 from Texas Rangers Directed Traded From

2010 62908 20100731 ankir001 ATL Trade Rick Ankiel Traded on 20100731 to Atlanta Braves Directed Traded To
2010 60709 20100831 ariaj001 NYN Trade Joaquin Arias Traded on 20100831 to New York Mets Directed Traded To
2010 62264 20101118 barmc001 HOU Trade Clint Barmes Traded on 20101118 to Houston Astros Directed Traded To
2010 72627 20101217 bartj001 SDN Trade Jason Bartlett Traded on 20101217 to San Diego Padres Directed Traded To
2010 72622 20100709 beavb001 SEA Trade Blake Beavan Traded on 20100709 to Seattle Mariners Directed Traded To

2010 62908 20100731 ankir001 blang001 Trade Rick Ankiel Traded on 20100731 for Gregor Blanco Directed Traded For
2010 62908 20100731 ankir001 chavj001 Trade Rick Ankiel Traded on 20100731 for Jesse Chavez Directed Traded For
2010 62908 20100731 ankir001 collt001 Trade Rick Ankiel Traded on 20100731 for Tim Collins Directed Traded For
2010 60709 20100831 ariaj001 franj004 Trade Joaquin Arias Traded on 20100831 for Jeff Francoeur Directed Traded For
2010 72627 20101217 bartj001 figuc001 Trade Jason Bartlett Traded on 20101217 for Cole Figueroa Directed Traded For

2010 62908 20100731 ankir001 farnk001 Trade Rick Ankiel Traded on 20100731 with Kyle Farnsworth Directed Traded With
2010 66840 20101219 betay001 greiz001 Trade Yuniesky Betancourt Traded on 20101219 with Zack Greinke Directed Traded With
2010 72622 20100709 beavb001 luekj001 Trade Blake Beavan Traded on 20100709 with Josh Lueke Directed Traded With
2010 72622 20100709 beavb001 smoaj001 Trade Blake Beavan Traded on 20100709 with Justin Smoak Directed Traded With
2010 62908 20100731 blang001 chavj001 Trade Gregor Blanco Traded on 20100731 with Jesse Chavez Directed Traded With

We have now successfully prepared the data for Gephi. In our next post, I’ll examine the process starting with the Gephi data import phase. Thanks for reading!

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

A Network Graph of MLB All-Stars

With the Major League Baseball All-Star game this week, I was suddenly possessed by an urge to create a network graph of all the players who have been selected for the game from 1933-2015. The goal, as with most of my network graph efforts, was to see where the data took me, and what stories it might tell. Along for the ride was my trusty companion Gephi, version 0.9.1 in this instance. Data for this exercise comes, as it so often does, from the Lahman baseball database, which happens to have a nice table with all the necessary all-star information.

The challenge inherent in this data, as with many temporal datasets, is to create an interesting graph that isn’t entirely driven by the time element, in this case as baseball seasons. Some otherwise excellent layout algorithms tend to turn this sort of data into long, worm-like displays that are not visually appealing. I could just as easily use a timeline if that were the goal of the visualization. So in an effort to balance aesthetics with the underlying data, I finally settled on the radial axis layout in Gephi. With this layout, we have the ability to create multiple axes radiating from the center of the graph, grouped by something meaningful. In this case, that turns out to be the modularity class, a sort of clustering mechanism that groups nodes together based on common or similar characteristics.

After some trial and error, I wound up with 13 distinct classes, a nice manageable number for this type of display. The end result can be interpreted as some sort of exotic, colorful starfish, or perhaps as a multi-colored fireworks display. In any event, I believe it tells an interesting story in a visually appealing manner, and allows for understanding the common threads within each cluster of players. Here’s the complete graph layout:

allstar_graph_base

I’ll spend the rest of the post with some quick analyses of each group, and then point you to the entire graph in interactive form so you can discover your own patterns and learn more about the all-star connections of individual players. I’ll also provide a quick overview for how to read the sidebar output for the graph when you interact with the data.

Our 13 clusters (you can think of them as cohorts) tell us some interesting things about the history of all-star participants. Let’s walk through each of the 13 (numbered 0 through 12) to learn more. The clusters begin with 0 (in green) at the upper left and move counter-clockwise around the graph. Each is rank ordered from small to large radiating out from the center, so the member with the most years as an all-star will be at the tip of each group. That individual will serve as our focal point in each of the following screenshots, followed by a brief overview of other members of the cohort.

First up is our Cohort 0, headlined by Miguel Cabrera, with 10 selections through 2015. Obviously, this would appear to be a cohort of current or recent all-stars based on Cabrera’s appearance. We can easily navigate the graph to see if that’s the case. Who else is prominent in the group? Yadier Molina, Matt Holliday, and Robinson Cano, to name a few, all big name stars for most of their careers. How about at the low end of the spectrum, players with a single all-star selection? Here’s where we find the likes of Billy Butler, R.A. Dickey, and Melky Cabrera. All long-tenured, noteworthy players, but certainly not in the same category as the first group.

allstar_0

Cohort 1 takes us on some time travel, with Johnny Mize as the representative star, also with 10 all-star selections and Hall of Fame membership as well. Joining Mize in the group are Bobby Doerr, Vern Stephens, Joe Gordon, and Bob Feller, all Hall of Famers with the exception of Stephens. At the other end of the cohort, each with one appearance, are Oscar Grimes, Red Barrett, and Nick Etten, among others. Based on the career arcs of the stars in this group, we could characterize it as primarily a 1940s-based cohort, certainly with overlap into the surrounding decades.

allstar_1

Derek Jeter is our icon for Cohort 2, so it figures to be a group that immediately precedes the Cabrera-led Cohort 0. Perhaps the focus here will be on stars from the early 2000s, at the center point of Jeter’s long career. Mariano Rivera, Albert Pujols, and David Ortiz are among the top stars here, confirming the hypothesis that this group is largely post-2000 in nature. Among the lesser knowns with a single selection each are Gil Meche, Joe Crede, and Ryan Ludwick.

allstar_2

Cohort 3 is headed up by the legendary Stan Musial, whose career covered the entirety of the 40s and 50s. Given that Cohort 1 was largely concentrated on players from the 40s, we might anticipate more of a skew towards 1950 and beyond. We’ll see in a moment if that’s true. Next to Musial we have Ted Williams and Warren Spahn, two more whose careers spanned both decades, so perhaps we have players here with greater longevity compared to the Mize cohort. Let’s go a bit deeper, where we find Roy Campanella, Larry Doby, and Robin Roberts, all with career pinnacles primarily in the 50s. So while there will certainly be connections across the two groups, Cohort 3 does appear to span more of the 1950s compared to Cohort 1.

allstar_3

With Cohort 4 we see a very large group fronted by all-time hits leader Pete Rose. So we could be focused on the 1960s or 1970s here; Rod Carew, Reggie Jackson, and Mike Schmidt, Hall of Famers all, are included, so the 1970s would seem to be the dominant theme. Perhaps we shouldn’t be surprised at the size of this group, as expansion in the 1960s afforded more players the opportunity to become an all-star. A few interesting figures can be found at the single game end of the radian – Bob Horner, Kent Hrbek, and Lonnie Smith, all with at least momentary brushes with greatness, but good enough to qualify for just one all-star nod apiece.

allstar_4

Joe DiMaggio is the lead for our next group, joined by the likes of Mel Ott, Bill Dickey, and Joe Medwick. The skew is toward the late 1930s and beyond; many members of this cohort would have had limited all-star game opportunities, as the game originated only in 1933. As proof of this, we find Hall of Famers Heinie Manush, Goose Goslin, and Kiki Cuyler at the low end of the radian, each with just a single all-star credit.

allstar_5

Hall of Famer Al Kaline heads up Cohort 6, so we know we’re in either the 50s or 60s, or more likely, a bit of both decades. Along with Kaline we have Mickey Mantle, Yogi Berra, and Ernie Banks, each of who had multiple appearances covering both decades. At the more modest end of the group we find Rocky Bridges, Bob Cerv, and perhaps surprisingly, the slugger Joe Adcock, each with just a single season as all-stars.

allstar_6

Cohort 7 is our one group that’s difficult to explain. Our graph modularity settings forced a small cohort of just nine players; Hank Aaron with 21 seasons, and eight others with a single season each. Consider this one a bit of a fluke.

allstar_7

The great Willie Mays leads the relatively small Cohort 8, joined by both Brooks and Frank Robinson, as well as Roberto Clemente. This would indicate a cohort of players who began in the 1950s and perhaps peaked in the 60s. Lower down the list this group features lesser known players like Tito Francona, Dick Howser, and Joey Jay, all one season all-stars.

allstar_8

Cohort 9 is a very large group led by Barry Bonds, Ivan “Pudge” Rodriguez, and Ken Griffey. Here we have three stars with illustrious careers launched around 1990 and extending into the new millenium. At the other extreme we find one-timers such as Jay Buhner, Mark Grudzelianek, and Lance Johnson.

allstar_9

Cohort 10 is a mid-sized group headed up by Alex Rodriguez, and featuring Manny Ramirez, John Smoltz, and Scott Rolen. This would appear to be a very similar group (if a few years later) to the prior cohort, and we should expect to find a great number of crossover connections between the two, as they each cover players from similar time periods.

allstar_10

Down to the final two groups! Cohort 11 is led by Cal Ripken, Ozzie Smith, and Roger Clemens, all major impact players in both the 1980s and 90s. This is another group featuring players who were quite talented but dented the all-star ranks just a single time. Among these were outfielders Jesse Barfield and Kevin Bass, and pitcher Teddy Higuera.

allstar_11

With Cohort 12, we see a large group led by 18-time all-star Carl Yastrzemski, supported by Johnny Bench, Tom Seaver, and Harmon Killebrew. These are all players who were at their productive peaks in the late 1960s through mid-1970s, an era where the National League was dominating the annual game. Chuck Hinton, Jerry Lumpe, and Joe Azcue are among those who can claim a single trip to the all-star game as a career highlight.

allstar_12

Quick note on the sidebar – you’ll see a few measures which I won’t go into too deeply here; there are 3 centrality measures (influence within the network), eccentricity (the number of steps to traverse the network, think six degrees of Kevin Bacon), and size, reflecting the number of seasons as an all-star. The key part of the sidebar lies in the listing of all connected players to the one currently selected, with numbers indicating the number of games as co-all-stars. Use these links to navigate through the network quickly. It’s fun!

So that’s it for our brief analysis. Now it’s time to explore for yourself by opening the MLB All-Star Network visualization. Be patient as the data loads; once it has cached the graph should be fairly fast at zooming, panning, and allowing you to explore to your heart’s content.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More