A New Book Resolution for 2016

As we enter a new year, I find myself eager to create a new book that explores the world of baseball data using a wide array of data visualization approaches. This idea has been in my head for several years at least, and has found partial fulfillment in my previously published pennant races book. However, I wish to tackle something broader that will touch a number of baseball categories as well as multiple data visualization approaches.

The working title for the book is ‘Baseball Grafika’, grafika being the Czech and Polish word for graphics, a word which still conveys the intent of the book regardless of language. If all goes well, the book will be available early in the 2016 baseball season, and will cover the following topics:

  • Franchise player networks
  • Trade pattern networks
  • Hall of Fame connection network
  • Franchise location maps
  • Player birthplace maps
  • Pennant race charts
  • Standings charts
  • Career trajectory graphs
  • Baseball dashboards

Fortunately, much work has been done over the last several years on at least a few of these topics, so we’re not starting from scratch, but this will still be a considerable, yet rewarding, challenge. Updates to come.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Data Visualization Summit 2015

I’m in the process of pulling together a presentation for next month’s Data Visualization Summit in Boston, a conference organized by the Innovation Enterprise team. The event attracts 150-200 industry folks to see what can be done using data visualization approaches. I committed to share some insights on using network visualization to visually analyze customer behavior, and after a few weeks of tossing ideas around, have settled on a final approach. Now it’s time to actually put some data together and create some impressive visualizations for the presentation.

The end goal is to share how interactive network graphs can be used to tap into customer insights from several angles. There are three levels of analysis I’m hoping to share with the group, using some wholly fictitious data for a consumer products company. In order, the three stages are:

  1. Create a network that displays customer purchase patterns by product, providing a quick yet insightful visual overview showing who buys what, and how different products intersect with one another. For example, we might see a strong visual correlation where the shoppers who purchase Product A also buy Product D, but rarely purchase Product C. This in itself should provide some value, although other visualization methods could also perform this task, albeit in a less elegant fashion.
  2. Stage two is to focus on overall customer satisfaction levels (with the company rather than individual products), and potentially on an individual product basis, although this gets a bit more complex to execute. Through the effective use of color, we can scale satisfaction levels using the original purchase graph, thus providing a more powerful visual image. Decision makers can now easily view multiple attributes in a single visualization, something that is often difficult to achieve using conventional charts or tables.
  3. The third stage providers viewers with the ability to see actual customer comments, including summarized versions of said comments. This will enable analysts and decision makers to discover common themes that may be linked to low (or high) satisfaction levels. Again, this would be a challenging task using other visualization approaches, but can be handled effectively using well designed network graphs.

So how do we pack all this information into a single, easy to use visualization? For starters, we employ Gephi, the powerful network graph tool that allows us to convert purchase behavior data into nodes and edges that define our network graph. We can use Gephi to define the best layout for our dataset, create specific groups, make adjustments to sizes and colors, and so on. From there, we’ll be exporting the graph file using the Gexf-JS Web Viewer plugin, which will enable user interactivity through a browser. Finally, we can tweak some of the settings to deliver an attractive, intuitive, highly useful network graph visualization.

Before I forget, I must mention that the brilliant Aylien text analysis service will be used to analyze and categorize our customer comments. The results can then be included in our Gephi source files, adding another layer or two of rich insights to the data and ultimately the network graph. Integrating text analysis results with transactional customer information is an area that continues to evolve, and is a key component in understanding the present and predicting the future of customer behavior.

I hope to share the final deck at a future point, or at least the network graph that makes up the primary component of the presentation. Until then, happy visualizing!

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Political Contributions Network

Hi – I just launched another network project courtesy of Gephi and Sigma.js, my two favorite tools of the moment. You can find it here, or in a full web version here. This one, like its immediate predecessor, is founded in politics, and more specifically in tracking political contributions – who gives to whom. The paths in this network detail thousands of political candidates, and the many PACs, corporations, foundations, and trade associations that help fund their campaign efforts. Of course these connections also create a sort of influence network that could never be achieved by individual voters, and help explain why so many decisions are made that run counter to the will of the people.

While this one doesn’t focus on dollar amounts, it nonetheless paints a compelling picture for how political influence is meted out. Fringe candidates, frequently outside the embedded American two party system, are depicted near the perimeter of the graph, receiving little or no support from most major donors. Incumbent Democrats and Republicans, on the other hand, are situated at the center of the network, receiving contributions from dozens or even hundreds of PACs, unions, corporations, and trade associations.

Here are a few screenshots from the graph, which is fully interactive through the use of filters, scrolling, zooming, and panning, thanks to the wonders of javascript via Sigma.js. First up is a shot of the full network:

galaxy_all

The multiple colors reflect the multitude of political parties (yes, beyond the dominant two-party monopoly) plus the hordes of contributors – corporations, unions, trade associations, and more.

One of the great features of interactive networks is the ability to dive into the details. For starters, lets take a look at the Nancy Pelosi neighbor network, which should provide a nice glimpse into the donor network for an entrenched, influential Democratic candidate:

galaxy_pelosi

What we see is a well-connected network populated by dozens of contributors. Now let’s go to the other side of the aisle and take a look at the donor network of John Boehner, an influential Republican incumbent:

galaxy_boehner

The Boehner network is even more dense than the Pelosi network. We should note that many contributing organizations may be found in both the Pelosi and Boehner camps, although the overlap will be somewhat mitigated by the Democrat versus Republican differences. What they do have in common are a huge number of contributors determined to influence policy, often at the expense of the voting public.

Our final screenshot displays many of the PACs in the network – more than 2,600 in total. The attribute pane on the right of the display will show each and every one of these when you use the category filter to the left of the screen:

galaxy_pacs

I hope you find some value in navigating and learning more about the scores of organizations involved in trying to influence policy through congressional gatekeepers. Bear in mind we haven’t even touched on the unelected portions of the government residing in the halls of the CIA, FBI, and Department of Defense. That will be the subject of a future network.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

New US House Voting Patterns Network

Anyone who knows me well is aware of my general lack of enthusiasm for politics and politicians, so my latest network graph may come as a bit of a surprise. While I can’t express a lot of support for how my tax dollars are spent by the folks in DC, I can still make use of some of the data patterns they generate. Using data provided by govtrack.us (a non-government site), my latest Gephi project looks at the US House of Representatives votes over the last 4 months of 2014, specifically the ‘aye’ (yes) votes for each house vote.

The resulting graph lets us take a look at some general patterns, such as many cases where there is strong bi-partisan support for a bill. We can also see votes that failed, primarily in cases where the Democratic minority was unable to generate enough Republican support to pass a measure. Here are a few screenshots from the Gephi project; after that I’ll send you over to the interactive web version where you can search, zoom, pan, and otherwise interact with the data to your heart’s content.

First up is an overall view of the network, created using the Force Atlas 2 layout:

screenshot_082109
Here we can see the stereotypical view of Congress, with the blue Democrats on the left and red Republicans on the right. In the center are some very large nodes that depict near unanimous votes (nodes are sized by the number of ‘aye’ votes) with bi-partisan support. Darker gray nodes represent failed votes; note how many of these are at the far left, indicating support from only the Democrats in most cases. To the far right are bills that passed with primarily Republican support, as noted by their smaller size.

Our next view used node sizing to show only those representatives who cast 45 or fewer aye votes (of the more than 80 votes cast in this period). These voters are shown as oversized nodes relative to their colleagues. While missed votes may contribute to this classification, we also note the predominance of Democrats in this view. Given the Republican majority, it is hardly surprising that more Democrats would be likely to refrain from casting aye votes that are likely to reflect the Republican influence.

screenshot_45_vote_max

Next we take a look at those who cast at least 60 aye votes and are unsurprised to see that this one swings toward the Republican side of the graph. This view was achieved using some Gephi filters to hide individuals not meeting the selected criteria. Clearly, the most enthusiastic ‘aye’ voters in this period are primarily Republican.

screenshot_60_plus

Our final view for now (we could do dozens more) focuses on national security – generally considered to be a bi-partisan subject where both parties want to appear patriotic, regardless of whether the legislation actually advances security. To focus quickly on this topic, I have used Gephi to recolor all security-related nodes to yellow. Notice how these votes are almost uniformly bi-partisan, with overwhelming support from both parties.

screenshot_security

These are just a few examples for how Gephi can help dissect a reasonably complex network and provide quick visual insights. There are of course many other methods available in Gephi that would take this analysis much deeper.

Now that we’ve done a brief examination of this data, time to move on to the interactive example on the web, where you can do your own clicking, searching, zooming, and panning to uncover patterns in the data. This functionality all comes courtesy of Sigma.js, an oustanding Gephi plugin. You can find the network here: http://visual-baseball.com/gephi/us_house/network/index.html#.

At some point, I may attempt to link back to the actual voting data at govtrack.us, but for now I hope you find this to be a useful (and fun) way to examine voting patterns. Enjoy!

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Data Visualization, Aesthetics and Intuition

As I worked through a just completed project chronicling the diverse musical career of Neil Young, some valuable (if unintended) insights were reinforced once more. I work on a regular basis with a variety of large datasets that require analysis, interpretation, and ultimately visualization and presentation. Often, these goals are not easily reconciled, which leads to unsatisfactory results across one or more of these factors.

As much as we as analysts need to depict the data accurately and meaningfully, if we don’t do so with an attractive visual approach we risk not having our message get communicated at all. Merely presenting our data in a table may technically get the job done, but is also likely to bore the reader to tears while simultaneously failing to deliver the key messages. At the other extreme, we can pull out individual bits of the data and spend our time creating flashy infographics that may capture attention but fail to represent the data in its proper context. All flash, no substance. Neither approach is terribly effective.

At the same time, we may present all of the information using a reasonable visual approach that preserves the integrity of the data while still falling short of creating a fulfilling user experience. This is what I recently experienced with the Neil Young project, as I’ll detail below.

After spending a few days getting the data from the AllMusic site into Excel, and eventually as node and edge files into Gephi, it was finally time to create the network data visualization. I was determined to attempt one of the many force-based methods used in network graph analysis to create the graph. These methods are very popular and useful for creating graphs out of a variety of data networks, allowing viewers to see the larger patterns at work within the data.

After a few iterations, I wound up with a serviceable graph that covered most of the basics I spoke of earlier – all the data was exposed, element types were sized and color-coded for easier interpretation, and the project was navigable via the web. Here’s a look:

neil_young_gephi_20141023

Not bad, but there was something nagging at me as I viewed it, tweaked it, played with the styling, and so on. Everything was technically fine, but something was missing. So back I went to Gephi to find the answer. The next day, it occurred to me – I was using the wrong approach for the type of data I was trying to depict. Where the force-directed approach is ideal for dense, social media type networks, this was a unique network that didn’t possess the same structure. Therefore, it was not as aesthetically appealing or as intuitive as it could be.

After iterating through a few approaches, I came across a winner that best exploits the structure of the underlying data while conveying a far more intuitive feel to the end user. Why not have Neil at the center of the graph, surrounded by all of his albums, ordered by release date? On top of this, I could then have the style and mood data form an outer ring, as they needed only to link to the albums in some fashion. Now we have something that conveys the same information as the first attempt, but in a much more pleasing layout relative to this dataset. See for yourself:

neil_young_gephi_20141024

The new version addresses the issues of aesthetics and intuition where the first graph fell short. All moods and styles are now easily found; the same is true for all albums. Highlighting a single mood (or album) also provides an information-rich view for how the music changed over periods in Young’s career. This was nearly impossible to see in the initial layout.

So the message is this – visualizations not only don’t need to sacrifice aesthetics and intuition in order to be effective; rather, they should take advantage of these attributes to increase their appeal and impact. Don’t be afraid to experiment until you find the right formula, as it seldom presents itself the first time around, and trust your instincts.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Network Data Adventures with Gephi

Posting to the blog has become a luxury recently, what with a summer full of youth baseball, some organizational changes at work, summer home projects, and of course the upcoming Gephi book. I’ve learned that one has to be especially good at finding synergies between projects in order to get everything done. So it is with creating any new work in Gephi while writing the book. Any new projects will necessarily be created while in the process of developing material for the book.

Earlier this year, I created a host of network visuals, one per franchise, showing the relationships between all players who suited up for a given major league baseball team. This data made for some interesting visuals that were fun to explore. What the graphs didn’t do was to provide visual cues about how the players could have been grouped – by decade, position, birthplace, and so on. So the logical evolution was to take this idea and extend it as an example for how to use partitioning and clustering to visually segment a network graph.

Recently I began playing with this idea by looking at a few of these examples, and have included some in one of the book chapters. I’ll use some slightly different cases here to avoid redundancy, but the principles are identical. I’ll walk through an example for how we can extract intelligence from a network graph in a few easy steps, using the Boston Red Sox from 1901 through 2013.

  1. Start with the base graph, having used a layout algorithm to arrange it in some fashion. I used the ARF approach for this example.
  2. Size the nodes in the graph using some criterion, such as the number of games played as a catcher. This will help users to quickly spot the dominant players at that position.
  3. Color the nodes using a categorical variable like decades. In this case, the color will reflect the first decade a player suited up for the Red Sox.

In sequence, here are the three graphs:

redsox_1
Kind of dull – nothing but a lot of identical nodes and their connections. Let’s apply sizing based on the number of games played as a catcher:

redsox_2

Now that pops a few things! We have some easy starting points to work from. How about coloring the nodes by decade to see if that adds to the story:

redsox_3

Hmmm. Maybe this gives us some additional insight as well. Certain decades are split amongst multiple catchers, while in other cases we have a single dominant player. Of course we would want to allow the user to identify each of these cases (for example, the large green node at the top left is Jason Varitek) through some labeling or interactivity.

So you get the idea for how a couple simple tweaks can change the way we view a graph. I’ll be using a similar approach in the book to help readers create powerful stories with their own data.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Mastering Gephi Book Update

Lest anyone think of me as a full-time author, rest assured that the likes of J.K. Rowling are not trembling in fear. Even if I had the ability to conjure up creative plots, I type too damn slow to make it as a full-time literary lion. Fortunately, I don’t have to depend on my keyboarding skill (or lack of) as a full-time pursuit. Which brings me around to my topic – the current book I’m authoring on Gephi.

For those not exposed to networks and network analysis, Gephi is a French-based open source project that makes it possible for all sorts of users (including moi) to create interesting graphs from connected datasets. By connected I am referring to data where the individual nodes are connected in some way, shape, or form. This could be anything from movie actor databases, Facebook friend networks, baseball player connections, and so on. Anyone with a spreadsheet full of data and a bit of effort and persistence can use Gephi to create cool looking graphs that also tell a story of some sort.

My job in writing the book is to help people make sense of all the features and capabilities within Gephi, some of which are a bit complex to master. In the process, I get to learn more about the theory behind network analysis, and with it terms such as contagion, diffusion, clustering, and homophily. It’s really fascinating if you’re into understanding how people and institutions interact, contagion processes function, or how product adoption can be affected by the structure of a network. My higher math skills are not good enough to be at an academic level with this stuff, so I have to compensate with some logic and visual acuity.

Anyhow, here’s some of the stuff that Gephi can create:

7344OS_cover_01

7344OS_cover_02

7344OS_cover_03

I’m hoping that one of these images will serve as the book cover come publishing time, which should be sometime this fall. In the meantime, I have six more chapters to write (of 10 total), and will have the added joy of working through chapter edits where others catch the mistakes I’ve made.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

3 More MLB Network Graphs

Getting rolling now, using a templated approach to create a handful of franchise graphs, with many more to come. The first five cover the Tigers, Cubs, Red Sox, Dodgers, and Giants, showing all the connections between players from 1901-2013 within each franchise’s history. All credit is due to Gephi, the ARF layout, and the Chinese Whispers clustering algorithm. Data is courtesy of Sean Lahman’s baseball database. I’m merely the conductor who gets to bring these great tools together.

Here’s the roster if you want to go to a single graph, or you can go to the network graphs gallery on my website:

Check them out and let me know what you think.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Final Network Graph MLB Template

After what seemed an eternity (in reality, just 10 days), I’ve settled on a template and formula for depicting the player networks for MLB teams dating back to 1901. Throughout this process, the hometown Tigers have been my trial balloon to see if and how this idea would work. I’m happy to report that the idea not only works, but it makes for a beautiful (and highly addictive) interactive graph.

After several days of testing a variety of graph algorithms, I’ve landed back at the ARF method used for the Octavio Dotel graphs created earlier this year. There’s something about a circular layout that is visually appealing and informationally dense at the same time. Players are clustered by color, reflecting the primary peer group they belong to, although many will connect across two or more groups. The size of each player node reflects the number of seasons played with the team. Alan Trammell and Ty Cobb have large nodes, while Eddie Miller has a very small node, reflecting his single season in a Tigers uniform. Check it out for yourself: UPDATE: Node Sizes not behaving as planned – still tweaking

Tigers Network Graph

To play with the live version, click here.

It took awhile to get a satisfying result, but after setting a few parameters in Gephi and tweaking some options I’m thrilled with the graph. Now I’m poised to do the same for all MLB franchises, using the same settings to allow each franchise’s patterns come to the fore. I’m eager to create the entire series of graphs, and to start assessing the differences and how they relate to team success patterns.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Another Tigers Network Graph

I’m having a lot of fun creating network graphs using Gephi, and excited about the possibilities for displaying a wide range of baseball information. My initial pass at showing team connections uses the Tigers, with this version incorporating all players from 1901 through 2013. Try it out here.

Update – new version with Tigers colors: Updated version

This is done using a radial axis layout with Chinese Whispers Clustering, based on a paper by Chris Biemann. The colors along each of the axes is based on the clusters created by this algorithm. Once again, I’ve used Sigma.js to create the interactive version, so you can dive into the graphic and gain an understanding for how the data is displayed. Here’s a static view of the graph:

Tigers-1901-2013-graph

Kind of colorful, isn’t it? For those of you who are long-time Tiger fans, you’ll soon detect that each cluster (color) represents a specific era in Tiger history, an by clicking on individual nodes, you’ll be able to see which players connect across multiple clusters. Typically, this will be players like Alan Trammell, Ty Cobb, and others who played many years with the club, and thus transcend their own cluster position.

Not sure if this will be the style for all the franchise graphs I plan to do this year, but it feels close, given the ability to display more than 1500 players and nearly 48,000 connections without having things too cluttered.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather