ODSC: Analyzing Complex Networks Part 2

This is part two of a brief series sharing components of my presentation titled Analyzing Complex Networks Using Open Source Software at ODSC East in Boston on May 21st. The first post looked at a few examples from a Boston Red Sox players network, while this one examines a Miles Davis album and musician network. I’ll share a few examples of network analysis within the context of the Miles Davis graph.

The Miles Davis network could be described as a tripartite network, or one with three layers. Miles is at the center, and connects to each of nearly 50 recordings. Other musicians then connect to the respective recording(s) they played on, but not to one another. This approach provides a very clear look at musical phases in the career of the legendary trumpeter, without the graph being clouded by excessive detail. Here’s a view of the final network, after which we’ll look at some components of the graph.

miles_1

We see some interesting patterns in the graph, specifically in viewing the pink circles, which represent individual albums. Musicians playing on a recording can be seen adjacent to that recording, except in the case of musicians present on multiple albums. We would expect them to be positioned relative to all of the recordings they played on. A quick visual scan leads to five distinct clusters, as seen in the next screenshot.

miles_2

Now that we have identified these clusters, it would be helpful to understand their meaning and relevance to Miles career. Using the graph in interactive fashion, we can learn more about the recordings and musicians, and begin to formulate some insights. These can be confirmed by referring to album links on the web or in Wikipedia, which give context to what we are viewing. Based on these steps, here is a quick overview of the five clusters.

miles_3

A final step might be to add some verbiage using PowerPoint or Inkscape, which I’ve done below in very minimalist fashion. We could also add this to a web version using CSS attributes to position the text, although this could get tricky as we pan and zoom on the graph. We might be better off using some sort of stylized marker (color or shape) to communicate some of this information.

miles_4

There is much more that could be done, but I hope this brief example shed some light on the usefulness of network graphs, especially from a pure visual perspective.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

ODSC: Analyzing Complex Networks Using Open Source Software

I’ll be presenting at the 2016 ODSC East event in Boston May 20-22. ODSC stands for Open Data Science Conference, where the focus is on using open data or open source tools to do clever things in the information space. The topic of my presentation is Analyzing Complex Networks Using Open Source Software, where I’ll talk through several example networks built using Gephi and Sigma.js.

While the slides are not all prepared at this stage, I’ll share a few bits that will wind up in the talk. My goal is to convey to the audience how networks can be used to statistically and visually understand complex information. After providing an overview of network analysis (at a very high level), I’ll be sharing slides from three very different networks – a Miles Davis album network (created in 2014 and rebuilt in 2016), a Boston Red Sox player network (also built in 2014), and a brand new example using data from the amazing GDELT Project.

Here’s a glimpse into what I’ll be sharing, starting with the Red Sox examples, where we examine the networks of three well known players from the last 100 years. First, Ted Williams network:

odsc_williams

Followed by Carl Yastrzemski:

odsc_yaz

Now Jason Varitek, longtime catcher and captain for two World Series championship teams:

odsc_varitek

In talking through each of these networks, I will attempt to highlight some differences in their respective structures based on the era in which each player spent time with the Red Sox. For example, there are many more connections in the Varitek network compared to Williams and Yaz, despite a shorter duration with the team. Why would this be the case? Perhaps spending time in the era of higher salaries, larger pitching staffs, and the evolution of free agency might go a long way towards explaining why Jason Varitek crossed paths with far more players than did his earlier predecessors.

Stay tuned for additional posts featuring the Miles Davis and GDELT networks.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

A New Book Resolution for 2016

As we enter a new year, I find myself eager to create a new book that explores the world of baseball data using a wide array of data visualization approaches. This idea has been in my head for several years at least, and has found partial fulfillment in my previously published pennant races book. However, I wish to tackle something broader that will touch a number of baseball categories as well as multiple data visualization approaches.

The working title for the book is ‘Baseball Grafika’, grafika being the Czech and Polish word for graphics, a word which still conveys the intent of the book regardless of language. If all goes well, the book will be available early in the 2016 baseball season, and will cover the following topics:

  • Franchise player networks
  • Trade pattern networks
  • Hall of Fame connection network
  • Franchise location maps
  • Player birthplace maps
  • Pennant race charts
  • Standings charts
  • Career trajectory graphs
  • Baseball dashboards

Fortunately, much work has been done over the last several years on at least a few of these topics, so we’re not starting from scratch, but this will still be a considerable, yet rewarding, challenge. Updates to come.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Data Visualization Summit 2015

I’m in the process of pulling together a presentation for next month’s Data Visualization Summit in Boston, a conference organized by the Innovation Enterprise team. The event attracts 150-200 industry folks to see what can be done using data visualization approaches. I committed to share some insights on using network visualization to visually analyze customer behavior, and after a few weeks of tossing ideas around, have settled on a final approach. Now it’s time to actually put some data together and create some impressive visualizations for the presentation.

The end goal is to share how interactive network graphs can be used to tap into customer insights from several angles. There are three levels of analysis I’m hoping to share with the group, using some wholly fictitious data for a consumer products company. In order, the three stages are:

  1. Create a network that displays customer purchase patterns by product, providing a quick yet insightful visual overview showing who buys what, and how different products intersect with one another. For example, we might see a strong visual correlation where the shoppers who purchase Product A also buy Product D, but rarely purchase Product C. This in itself should provide some value, although other visualization methods could also perform this task, albeit in a less elegant fashion.
  2. Stage two is to focus on overall customer satisfaction levels (with the company rather than individual products), and potentially on an individual product basis, although this gets a bit more complex to execute. Through the effective use of color, we can scale satisfaction levels using the original purchase graph, thus providing a more powerful visual image. Decision makers can now easily view multiple attributes in a single visualization, something that is often difficult to achieve using conventional charts or tables.
  3. The third stage providers viewers with the ability to see actual customer comments, including summarized versions of said comments. This will enable analysts and decision makers to discover common themes that may be linked to low (or high) satisfaction levels. Again, this would be a challenging task using other visualization approaches, but can be handled effectively using well designed network graphs.

So how do we pack all this information into a single, easy to use visualization? For starters, we employ Gephi, the powerful network graph tool that allows us to convert purchase behavior data into nodes and edges that define our network graph. We can use Gephi to define the best layout for our dataset, create specific groups, make adjustments to sizes and colors, and so on. From there, we’ll be exporting the graph file using the Gexf-JS Web Viewer plugin, which will enable user interactivity through a browser. Finally, we can tweak some of the settings to deliver an attractive, intuitive, highly useful network graph visualization.

Before I forget, I must mention that the brilliant Aylien text analysis service will be used to analyze and categorize our customer comments. The results can then be included in our Gephi source files, adding another layer or two of rich insights to the data and ultimately the network graph. Integrating text analysis results with transactional customer information is an area that continues to evolve, and is a key component in understanding the present and predicting the future of customer behavior.

I hope to share the final deck at a future point, or at least the network graph that makes up the primary component of the presentation. Until then, happy visualizing!

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Political Contributions Network

Hi – I just launched another network project courtesy of Gephi and Sigma.js, my two favorite tools of the moment. You can find it here, or in a full web version here. This one, like its immediate predecessor, is founded in politics, and more specifically in tracking political contributions – who gives to whom. The paths in this network detail thousands of political candidates, and the many PACs, corporations, foundations, and trade associations that help fund their campaign efforts. Of course these connections also create a sort of influence network that could never be achieved by individual voters, and help explain why so many decisions are made that run counter to the will of the people.

While this one doesn’t focus on dollar amounts, it nonetheless paints a compelling picture for how political influence is meted out. Fringe candidates, frequently outside the embedded American two party system, are depicted near the perimeter of the graph, receiving little or no support from most major donors. Incumbent Democrats and Republicans, on the other hand, are situated at the center of the network, receiving contributions from dozens or even hundreds of PACs, unions, corporations, and trade associations.

Here are a few screenshots from the graph, which is fully interactive through the use of filters, scrolling, zooming, and panning, thanks to the wonders of javascript via Sigma.js. First up is a shot of the full network:

galaxy_all

The multiple colors reflect the multitude of political parties (yes, beyond the dominant two-party monopoly) plus the hordes of contributors – corporations, unions, trade associations, and more.

One of the great features of interactive networks is the ability to dive into the details. For starters, lets take a look at the Nancy Pelosi neighbor network, which should provide a nice glimpse into the donor network for an entrenched, influential Democratic candidate:

galaxy_pelosi

What we see is a well-connected network populated by dozens of contributors. Now let’s go to the other side of the aisle and take a look at the donor network of John Boehner, an influential Republican incumbent:

galaxy_boehner

The Boehner network is even more dense than the Pelosi network. We should note that many contributing organizations may be found in both the Pelosi and Boehner camps, although the overlap will be somewhat mitigated by the Democrat versus Republican differences. What they do have in common are a huge number of contributors determined to influence policy, often at the expense of the voting public.

Our final screenshot displays many of the PACs in the network – more than 2,600 in total. The attribute pane on the right of the display will show each and every one of these when you use the category filter to the left of the screen:

galaxy_pacs

I hope you find some value in navigating and learning more about the scores of organizations involved in trying to influence policy through congressional gatekeepers. Bear in mind we haven’t even touched on the unelected portions of the government residing in the halls of the CIA, FBI, and Department of Defense. That will be the subject of a future network.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

New US House Voting Patterns Network

Anyone who knows me well is aware of my general lack of enthusiasm for politics and politicians, so my latest network graph may come as a bit of a surprise. While I can’t express a lot of support for how my tax dollars are spent by the folks in DC, I can still make use of some of the data patterns they generate. Using data provided by govtrack.us (a non-government site), my latest Gephi project looks at the US House of Representatives votes over the last 4 months of 2014, specifically the ‘aye’ (yes) votes for each house vote.

The resulting graph lets us take a look at some general patterns, such as many cases where there is strong bi-partisan support for a bill. We can also see votes that failed, primarily in cases where the Democratic minority was unable to generate enough Republican support to pass a measure. Here are a few screenshots from the Gephi project; after that I’ll send you over to the interactive web version where you can search, zoom, pan, and otherwise interact with the data to your heart’s content.

First up is an overall view of the network, created using the Force Atlas 2 layout:

screenshot_082109
Here we can see the stereotypical view of Congress, with the blue Democrats on the left and red Republicans on the right. In the center are some very large nodes that depict near unanimous votes (nodes are sized by the number of ‘aye’ votes) with bi-partisan support. Darker gray nodes represent failed votes; note how many of these are at the far left, indicating support from only the Democrats in most cases. To the far right are bills that passed with primarily Republican support, as noted by their smaller size.

Our next view used node sizing to show only those representatives who cast 45 or fewer aye votes (of the more than 80 votes cast in this period). These voters are shown as oversized nodes relative to their colleagues. While missed votes may contribute to this classification, we also note the predominance of Democrats in this view. Given the Republican majority, it is hardly surprising that more Democrats would be likely to refrain from casting aye votes that are likely to reflect the Republican influence.

screenshot_45_vote_max

Next we take a look at those who cast at least 60 aye votes and are unsurprised to see that this one swings toward the Republican side of the graph. This view was achieved using some Gephi filters to hide individuals not meeting the selected criteria. Clearly, the most enthusiastic ‘aye’ voters in this period are primarily Republican.

screenshot_60_plus

Our final view for now (we could do dozens more) focuses on national security – generally considered to be a bi-partisan subject where both parties want to appear patriotic, regardless of whether the legislation actually advances security. To focus quickly on this topic, I have used Gephi to recolor all security-related nodes to yellow. Notice how these votes are almost uniformly bi-partisan, with overwhelming support from both parties.

screenshot_security

These are just a few examples for how Gephi can help dissect a reasonably complex network and provide quick visual insights. There are of course many other methods available in Gephi that would take this analysis much deeper.

Now that we’ve done a brief examination of this data, time to move on to the interactive example on the web, where you can do your own clicking, searching, zooming, and panning to uncover patterns in the data. This functionality all comes courtesy of Sigma.js, an oustanding Gephi plugin. You can find the network here: http://visual-baseball.com/gephi/us_house/network/index.html#.

At some point, I may attempt to link back to the actual voting data at govtrack.us, but for now I hope you find this to be a useful (and fun) way to examine voting patterns. Enjoy!

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Data Visualization, Aesthetics and Intuition

As I worked through a just completed project chronicling the diverse musical career of Neil Young, some valuable (if unintended) insights were reinforced once more. I work on a regular basis with a variety of large datasets that require analysis, interpretation, and ultimately visualization and presentation. Often, these goals are not easily reconciled, which leads to unsatisfactory results across one or more of these factors.

As much as we as analysts need to depict the data accurately and meaningfully, if we don’t do so with an attractive visual approach we risk not having our message get communicated at all. Merely presenting our data in a table may technically get the job done, but is also likely to bore the reader to tears while simultaneously failing to deliver the key messages. At the other extreme, we can pull out individual bits of the data and spend our time creating flashy infographics that may capture attention but fail to represent the data in its proper context. All flash, no substance. Neither approach is terribly effective.

At the same time, we may present all of the information using a reasonable visual approach that preserves the integrity of the data while still falling short of creating a fulfilling user experience. This is what I recently experienced with the Neil Young project, as I’ll detail below.

After spending a few days getting the data from the AllMusic site into Excel, and eventually as node and edge files into Gephi, it was finally time to create the network data visualization. I was determined to attempt one of the many force-based methods used in network graph analysis to create the graph. These methods are very popular and useful for creating graphs out of a variety of data networks, allowing viewers to see the larger patterns at work within the data.

After a few iterations, I wound up with a serviceable graph that covered most of the basics I spoke of earlier – all the data was exposed, element types were sized and color-coded for easier interpretation, and the project was navigable via the web. Here’s a look:

neil_young_gephi_20141023

Not bad, but there was something nagging at me as I viewed it, tweaked it, played with the styling, and so on. Everything was technically fine, but something was missing. So back I went to Gephi to find the answer. The next day, it occurred to me – I was using the wrong approach for the type of data I was trying to depict. Where the force-directed approach is ideal for dense, social media type networks, this was a unique network that didn’t possess the same structure. Therefore, it was not as aesthetically appealing or as intuitive as it could be.

After iterating through a few approaches, I came across a winner that best exploits the structure of the underlying data while conveying a far more intuitive feel to the end user. Why not have Neil at the center of the graph, surrounded by all of his albums, ordered by release date? On top of this, I could then have the style and mood data form an outer ring, as they needed only to link to the albums in some fashion. Now we have something that conveys the same information as the first attempt, but in a much more pleasing layout relative to this dataset. See for yourself:

neil_young_gephi_20141024

The new version addresses the issues of aesthetics and intuition where the first graph fell short. All moods and styles are now easily found; the same is true for all albums. Highlighting a single mood (or album) also provides an information-rich view for how the music changed over periods in Young’s career. This was nearly impossible to see in the initial layout.

So the message is this – visualizations not only don’t need to sacrifice aesthetics and intuition in order to be effective; rather, they should take advantage of these attributes to increase their appeal and impact. Don’t be afraid to experiment until you find the right formula, as it seldom presents itself the first time around, and trust your instincts.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Network Data Adventures with Gephi

Posting to the blog has become a luxury recently, what with a summer full of youth baseball, some organizational changes at work, summer home projects, and of course the upcoming Gephi book. I’ve learned that one has to be especially good at finding synergies between projects in order to get everything done. So it is with creating any new work in Gephi while writing the book. Any new projects will necessarily be created while in the process of developing material for the book.

Earlier this year, I created a host of network visuals, one per franchise, showing the relationships between all players who suited up for a given major league baseball team. This data made for some interesting visuals that were fun to explore. What the graphs didn’t do was to provide visual cues about how the players could have been grouped – by decade, position, birthplace, and so on. So the logical evolution was to take this idea and extend it as an example for how to use partitioning and clustering to visually segment a network graph.

Recently I began playing with this idea by looking at a few of these examples, and have included some in one of the book chapters. I’ll use some slightly different cases here to avoid redundancy, but the principles are identical. I’ll walk through an example for how we can extract intelligence from a network graph in a few easy steps, using the Boston Red Sox from 1901 through 2013.

  1. Start with the base graph, having used a layout algorithm to arrange it in some fashion. I used the ARF approach for this example.
  2. Size the nodes in the graph using some criterion, such as the number of games played as a catcher. This will help users to quickly spot the dominant players at that position.
  3. Color the nodes using a categorical variable like decades. In this case, the color will reflect the first decade a player suited up for the Red Sox.

In sequence, here are the three graphs:

redsox_1
Kind of dull – nothing but a lot of identical nodes and their connections. Let’s apply sizing based on the number of games played as a catcher:

redsox_2

Now that pops a few things! We have some easy starting points to work from. How about coloring the nodes by decade to see if that adds to the story:

redsox_3

Hmmm. Maybe this gives us some additional insight as well. Certain decades are split amongst multiple catchers, while in other cases we have a single dominant player. Of course we would want to allow the user to identify each of these cases (for example, the large green node at the top left is Jason Varitek) through some labeling or interactivity.

So you get the idea for how a couple simple tweaks can change the way we view a graph. I’ll be using a similar approach in the book to help readers create powerful stories with their own data.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Mastering Gephi Book Update

Lest anyone think of me as a full-time author, rest assured that the likes of J.K. Rowling are not trembling in fear. Even if I had the ability to conjure up creative plots, I type too damn slow to make it as a full-time literary lion. Fortunately, I don’t have to depend on my keyboarding skill (or lack of) as a full-time pursuit. Which brings me around to my topic – the current book I’m authoring on Gephi.

For those not exposed to networks and network analysis, Gephi is a French-based open source project that makes it possible for all sorts of users (including moi) to create interesting graphs from connected datasets. By connected I am referring to data where the individual nodes are connected in some way, shape, or form. This could be anything from movie actor databases, Facebook friend networks, baseball player connections, and so on. Anyone with a spreadsheet full of data and a bit of effort and persistence can use Gephi to create cool looking graphs that also tell a story of some sort.

My job in writing the book is to help people make sense of all the features and capabilities within Gephi, some of which are a bit complex to master. In the process, I get to learn more about the theory behind network analysis, and with it terms such as contagion, diffusion, clustering, and homophily. It’s really fascinating if you’re into understanding how people and institutions interact, contagion processes function, or how product adoption can be affected by the structure of a network. My higher math skills are not good enough to be at an academic level with this stuff, so I have to compensate with some logic and visual acuity.

Anyhow, here’s some of the stuff that Gephi can create:

7344OS_cover_01

7344OS_cover_02

7344OS_cover_03

I’m hoping that one of these images will serve as the book cover come publishing time, which should be sometime this fall. In the meantime, I have six more chapters to write (of 10 total), and will have the added joy of working through chapter edits where others catch the mistakes I’ve made.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

3 More MLB Network Graphs

Getting rolling now, using a templated approach to create a handful of franchise graphs, with many more to come. The first five cover the Tigers, Cubs, Red Sox, Dodgers, and Giants, showing all the connections between players from 1901-2013 within each franchise’s history. All credit is due to Gephi, the ARF layout, and the Chinese Whispers clustering algorithm. Data is courtesy of Sean Lahman’s baseball database. I’m merely the conductor who gets to bring these great tools together.

Here’s the roster if you want to go to a single graph, or you can go to the network graphs gallery on my website:

Check them out and let me know what you think.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather