A Network Graph of MLB All-Stars

With the Major League Baseball All-Star game this week, I was suddenly possessed by an urge to create a network graph of all the players who have been selected for the game from 1933-2015. The goal, as with most of my network graph efforts, was to see where the data took me, and what stories it might tell. Along for the ride was my trusty companion Gephi, version 0.9.1 in this instance. Data for this exercise comes, as it so often does, from the Lahman baseball database, which happens to have a nice table with all the necessary all-star information.

The challenge inherent in this data, as with many temporal datasets, is to create an interesting graph that isn’t entirely driven by the time element, in this case as baseball seasons. Some otherwise excellent layout algorithms tend to turn this sort of data into long, worm-like displays that are not visually appealing. I could just as easily use a timeline if that were the goal of the visualization. So in an effort to balance aesthetics with the underlying data, I finally settled on the radial axis layout in Gephi. With this layout, we have the ability to create multiple axes radiating from the center of the graph, grouped by something meaningful. In this case, that turns out to be the modularity class, a sort of clustering mechanism that groups nodes together based on common or similar characteristics.

After some trial and error, I wound up with 13 distinct classes, a nice manageable number for this type of display. The end result can be interpreted as some sort of exotic, colorful starfish, or perhaps as a multi-colored fireworks display. In any event, I believe it tells an interesting story in a visually appealing manner, and allows for understanding the common threads within each cluster of players. Here’s the complete graph layout:


I’ll spend the rest of the post with some quick analyses of each group, and then point you to the entire graph in interactive form so you can discover your own patterns and learn more about the all-star connections of individual players. I’ll also provide a quick overview for how to read the sidebar output for the graph when you interact with the data.

Our 13 clusters (you can think of them as cohorts) tell us some interesting things about the history of all-star participants. Let’s walk through each of the 13 (numbered 0 through 12) to learn more. The clusters begin with 0 (in green) at the upper left and move counter-clockwise around the graph. Each is rank ordered from small to large radiating out from the center, so the member with the most years as an all-star will be at the tip of each group. That individual will serve as our focal point in each of the following screenshots, followed by a brief overview of other members of the cohort.

First up is our Cohort 0, headlined by Miguel Cabrera, with 10 selections through 2015. Obviously, this would appear to be a cohort of current or recent all-stars based on Cabrera’s appearance. We can easily navigate the graph to see if that’s the case. Who else is prominent in the group? Yadier Molina, Matt Holliday, and Robinson Cano, to name a few, all big name stars for most of their careers. How about at the low end of the spectrum, players with a single all-star selection? Here’s where we find the likes of Billy Butler, R.A. Dickey, and Melky Cabrera. All long-tenured, noteworthy players, but certainly not in the same category as the first group.


Cohort 1 takes us on some time travel, with Johnny Mize as the representative star, also with 10 all-star selections and Hall of Fame membership as well. Joining Mize in the group are Bobby Doerr, Vern Stephens, Joe Gordon, and Bob Feller, all Hall of Famers with the exception of Stephens. At the other end of the cohort, each with one appearance, are Oscar Grimes, Red Barrett, and Nick Etten, among others. Based on the career arcs of the stars in this group, we could characterize it as primarily a 1940s-based cohort, certainly with overlap into the surrounding decades.


Derek Jeter is our icon for Cohort 2, so it figures to be a group that immediately precedes the Cabrera-led Cohort 0. Perhaps the focus here will be on stars from the early 2000s, at the center point of Jeter’s long career. Mariano Rivera, Albert Pujols, and David Ortiz are among the top stars here, confirming the hypothesis that this group is largely post-2000 in nature. Among the lesser knowns with a single selection each are Gil Meche, Joe Crede, and Ryan Ludwick.


Cohort 3 is headed up by the legendary Stan Musial, whose career covered the entirety of the 40s and 50s. Given that Cohort 1 was largely concentrated on players from the 40s, we might anticipate more of a skew towards 1950 and beyond. We’ll see in a moment if that’s true. Next to Musial we have Ted Williams and Warren Spahn, two more whose careers spanned both decades, so perhaps we have players here with greater longevity compared to the Mize cohort. Let’s go a bit deeper, where we find Roy Campanella, Larry Doby, and Robin Roberts, all with career pinnacles primarily in the 50s. So while there will certainly be connections across the two groups, Cohort 3 does appear to span more of the 1950s compared to Cohort 1.


With Cohort 4 we see a very large group fronted by all-time hits leader Pete Rose. So we could be focused on the 1960s or 1970s here; Rod Carew, Reggie Jackson, and Mike Schmidt, Hall of Famers all, are included, so the 1970s would seem to be the dominant theme. Perhaps we shouldn’t be surprised at the size of this group, as expansion in the 1960s afforded more players the opportunity to become an all-star. A few interesting figures can be found at the single game end of the radian – Bob Horner, Kent Hrbek, and Lonnie Smith, all with at least momentary brushes with greatness, but good enough to qualify for just one all-star nod apiece.


Joe DiMaggio is the lead for our next group, joined by the likes of Mel Ott, Bill Dickey, and Joe Medwick. The skew is toward the late 1930s and beyond; many members of this cohort would have had limited all-star game opportunities, as the game originated only in 1933. As proof of this, we find Hall of Famers Heinie Manush, Goose Goslin, and Kiki Cuyler at the low end of the radian, each with just a single all-star credit.


Hall of Famer Al Kaline heads up Cohort 6, so we know we’re in either the 50s or 60s, or more likely, a bit of both decades. Along with Kaline we have Mickey Mantle, Yogi Berra, and Ernie Banks, each of who had multiple appearances covering both decades. At the more modest end of the group we find Rocky Bridges, Bob Cerv, and perhaps surprisingly, the slugger Joe Adcock, each with just a single season as all-stars.


Cohort 7 is our one group that’s difficult to explain. Our graph modularity settings forced a small cohort of just nine players; Hank Aaron with 21 seasons, and eight others with a single season each. Consider this one a bit of a fluke.


The great Willie Mays leads the relatively small Cohort 8, joined by both Brooks and Frank Robinson, as well as Roberto Clemente. This would indicate a cohort of players who began in the 1950s and perhaps peaked in the 60s. Lower down the list this group features lesser known players like Tito Francona, Dick Howser, and Joey Jay, all one season all-stars.


Cohort 9 is a very large group led by Barry Bonds, Ivan “Pudge” Rodriguez, and Ken Griffey. Here we have three stars with illustrious careers launched around 1990 and extending into the new millenium. At the other extreme we find one-timers such as Jay Buhner, Mark Grudzelianek, and Lance Johnson.


Cohort 10 is a mid-sized group headed up by Alex Rodriguez, and featuring Manny Ramirez, John Smoltz, and Scott Rolen. This would appear to be a very similar group (if a few years later) to the prior cohort, and we should expect to find a great number of crossover connections between the two, as they each cover players from similar time periods.


Down to the final two groups! Cohort 11 is led by Cal Ripken, Ozzie Smith, and Roger Clemens, all major impact players in both the 1980s and 90s. This is another group featuring players who were quite talented but dented the all-star ranks just a single time. Among these were outfielders Jesse Barfield and Kevin Bass, and pitcher Teddy Higuera.


With Cohort 12, we see a large group led by 18-time all-star Carl Yastrzemski, supported by Johnny Bench, Tom Seaver, and Harmon Killebrew. These are all players who were at their productive peaks in the late 1960s through mid-1970s, an era where the National League was dominating the annual game. Chuck Hinton, Jerry Lumpe, and Joe Azcue are among those who can claim a single trip to the all-star game as a career highlight.


Quick note on the sidebar – you’ll see a few measures which I won’t go into too deeply here; there are 3 centrality measures (influence within the network), eccentricity (the number of steps to traverse the network, think six degrees of Kevin Bacon), and size, reflecting the number of seasons as an all-star. The key part of the sidebar lies in the listing of all connected players to the one currently selected, with numbers indicating the number of games as co-all-stars. Use these links to navigate through the network quickly. It’s fun!

So that’s it for our brief analysis. Now it’s time to explore for yourself by opening the MLB All-Star Network visualization. Be patient as the data loads; once it has cached the graph should be fairly fast at zooming, panning, and allowing you to explore to your heart’s content.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Exploratory DataViz Part 2

Having discussed some of Exploratory’s cool features in a prior post, I thought it would be fun to continue the exploration using JSON data as a starting point. I happen to have a fair amount of JSON on hand, thanks to a series of network graphs produced using Gephi and sigma.js, so why not put it to use with Exploratory and start creating a new dataviz?

If you have previously worked with JSON, you’re no doubt aware that it can be a bit fickle – miss a bracket or brace in one place and the entire file fails to load a visualization.However, knowing that my JSON has been successful in producing network graphs (see here for examples), I figured it was worth a shot with Exploratory.

To begin, start with the local import option, selecting the json option, and pointing it to your local file. Give it a name, run the process and cross your fingers! After a few seconds, I’ve got my results, and Exploratory has done a good job categorizing the data:


Since this is network data, we have nodes and edges, as well as any additional attributes, such as color or size. Exploratory has picked up those groupings, first the edges, and now the nodes.


Finally, the attribute values:


Since we’re satisfied with the import, we can move on to the summary data, which in this case doesn’t make a whole lot of sense. No matter, let’s see what can be done with some charts and analysis.


To start with, we have x and y values associated with each node, which sounds like a perfect candidate for a scatter plot. We add the x value to the x-axis (how convenient was that!), the y value to the y-axis, node size as the Size attribute, and finally the Eccentricity attribute for color. FWIW, eccentricity is not a measure of flakiness, but rather the distance between the most remote points in a graph. This is where the six degrees of separation (or Kevin Bacon, take your pick) concept comes into play; an eccentricity value of 6 equates to 6 degrees of distance. Here’s our result:


Not bad, eh? We can also hover over each node to see who it is (after adding Id to the Label field):


We still have a lot of activity in a limited space, so now let’s use a simple filter (see the command line at top) to grab the top 50 values, and see the results:


Now let’s create a new branch to explore further. I would like to sort my dataset using the Betweenness Centrality attribute, but there’s one problem – it’s a character value at the moment, so it doesn’t sort numerically. No matter, we can fix that easily using the Mutate command to convert the variable type. This can be seen in the right margin, where Exploratory conveniently stores all actions. Now we can sort our values in descending order to understand who is most influential in the network (at least by this measure). FYI – Betweenness Centrality tells us which nodes others must pass through most frequently to connect elsewhere within the network. Typically, but not always, it is someone centrally located within a network; sometimes it may be a less influential character (Pedro Borbon in this case) who connects more distant groups to one another.


So there you have it, another quick walk-through with Exploratory. Before I sign off, here’s the live scatter plot you can play with via the Exploratory server. Be sure to use the simple zoom features to traverse the chart!

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Open Source Data Viz: Exploratory

It’s absolutely a great time to be alive and involved in data viz, courtesy of the wealth of exceptional open source projects. Several recent open source discoveries are currently on my radar, and worthy of further exploration. Over the next few weeks I’ll examine a few of these options, using baseball data (of course) to illustrate the possibilities within each application. Specifically, we’ll take a look at Trelliscope, bokeh, rbokeh, and Exploratory, and provide some insight and examples into how each of these projects function. This post will focus on Exploratory, an exciting new tool from Kan Nishida.

Exploratory is another R-based application that leverages a multitude of R capabilities while providing its own intuitive interface. While still in beta testing, Exploratory appears to have a very bright future as a powerful visualization tool that allows non-coders to tap into the enormous power of R. The ability to harness a considerable portion of the R language through Exploratory’s GUI is a powerful option for those (like me) with limited R experience and expertise.

Exploratory has a very clean, intuitive interface that may feel a little unusual to long-time R users accustomed to multiple panes and workspaces. Yet beneath the surface, it possesses considerable power, as we’ll see in this tutorial. To start our process, we’ll need a data frame, a familiar object for R users. Let’s begin by examining our data frame options.

First up, we can load a local source file in a variety of formats:
Some of the usual suspects are here – text and Excel files, but we also have the ability to load json data as well as some of the more prominent statistical formats including SAS and SPSS data. Very cool. We’ll come back to this later.

Now let’s see the remote options:


Great! Not only can we gain direct access to MySQL databases (a huge plus for me), Exploratory also provides access to a diverse range of option including Twitter search, MongoDB, and web scraping. We’re going to look at some specific examples later, but for now, here’s a glimpse of the MySQL data import window:


As with the entire app, the design is clean and intuitive. In a bit, I’m going to load details into this window so we can test the MySQL functionality.

A third import option exists in the availability to access any existing R scripts you may have previously created:


I’m not going to spend a lot of time here, due to the fact that I don’t have a lot (any?) of personal scripts. However, for seasoned R coders, this seems like a great feature.

Now let’s walk through some of Exploratory’s capabilities using a MySQL connection. The MySQL setup is really easy – just fill in your database connection parameters and you’re good to go. Here’s what it looks like for this example, with a few fields grayed out for security reasons.

MySQL connection

Once the connection is established, Exploratory will display the initial rows in the dataset. If we click the Run button, our data is pulled into a Summary view, where every variable in the data is summarized. This is a great way to see if our data looks as expected, and allows us to determine if the correct variable type (integer, date, etc.) is associated with each field.


If everything looks good, we can move on to the Table option, which will resemble the MySQL view we just saw. No surprises here:


If we’re satisfied so far, then it’s time to move on to the fun aspects of Exploratory. For me, this starts with viewing data using the Charts selection. As of this writing, there are 10 chart options (two are actually mapping selections for geo data) including bars, scatter plots, box plots, heatmaps, and more. For me, this is a real strength of Exploratory; the ease with which we can see plots of our data is great! Here I’ve chosen a couple stat fields (at bats (AB) and runs (R)) to illustrate the scatter plot functionality.


The charts are clean and attractive, and provide some additional options. For scatter plots, labels can be added via a simple check box. This permits me to add hover labels, as seen below:


Pretty nice so far, don’t you think? But as the old commercials used to say ‘wait, there’s more’. The considerable power of R lies beneath the surface, enabling statistical testing, filtering, data manipulation, and so much more. Here’s a glimpse of just a handful of available options for working with your data:


Let’s select a filter option, where we’ll reduce the data to look only at players age 30 or greater. One of the other great aspects of Exploratory is it’s exposition of R code. We can use the built in menu commands while viewing the actual R code. For experienced R users, the functions can be entered directly in a text box, and for us less experienced coders, we can learn on the fly by seeing the output.


Now we see the same scatter plot populated with players 30 and older.

Another great feature is the ability to create branches within a project. This facilitates going down multiple paths within one workspace, rather than having to retrace our steps or rerun charts each time something changes. All we need to do is click the branch button, and a new tab is created for us. Very simple and intuitive, as is virtually everything in Exploratory.


In this instance, we’ve elected to run a correlation on the chart variables in our main flow, while we create a new box plot in our branch.


I’ve been very impressed thus far with Exploratory, and have barely scratched the surface. My next step will be to create some real content that can be shared in a post or via some new visualizations on the site. I love the ease of accessing my data via MySQL, and immediately having the ability to create plots, filter data, and run statistical explorations.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

ODSC: Analyzing Complex Networks Part 2

This is part two of a brief series sharing components of my presentation titled Analyzing Complex Networks Using Open Source Software at ODSC East in Boston on May 21st. The first post looked at a few examples from a Boston Red Sox players network, while this one examines a Miles Davis album and musician network. I’ll share a few examples of network analysis within the context of the Miles Davis graph.

The Miles Davis network could be described as a tripartite network, or one with three layers. Miles is at the center, and connects to each of nearly 50 recordings. Other musicians then connect to the respective recording(s) they played on, but not to one another. This approach provides a very clear look at musical phases in the career of the legendary trumpeter, without the graph being clouded by excessive detail. Here’s a view of the final network, after which we’ll look at some components of the graph.


We see some interesting patterns in the graph, specifically in viewing the pink circles, which represent individual albums. Musicians playing on a recording can be seen adjacent to that recording, except in the case of musicians present on multiple albums. We would expect them to be positioned relative to all of the recordings they played on. A quick visual scan leads to five distinct clusters, as seen in the next screenshot.


Now that we have identified these clusters, it would be helpful to understand their meaning and relevance to Miles career. Using the graph in interactive fashion, we can learn more about the recordings and musicians, and begin to formulate some insights. These can be confirmed by referring to album links on the web or in Wikipedia, which give context to what we are viewing. Based on these steps, here is a quick overview of the five clusters.


A final step might be to add some verbiage using PowerPoint or Inkscape, which I’ve done below in very minimalist fashion. We could also add this to a web version using CSS attributes to position the text, although this could get tricky as we pan and zoom on the graph. We might be better off using some sort of stylized marker (color or shape) to communicate some of this information.


There is much more that could be done, but I hope this brief example shed some light on the usefulness of network graphs, especially from a pure visual perspective.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

ODSC: Analyzing Complex Networks Using Open Source Software

I’ll be presenting at the 2016 ODSC East event in Boston May 20-22. ODSC stands for Open Data Science Conference, where the focus is on using open data or open source tools to do clever things in the information space. The topic of my presentation is Analyzing Complex Networks Using Open Source Software, where I’ll talk through several example networks built using Gephi and Sigma.js.

While the slides are not all prepared at this stage, I’ll share a few bits that will wind up in the talk. My goal is to convey to the audience how networks can be used to statistically and visually understand complex information. After providing an overview of network analysis (at a very high level), I’ll be sharing slides from three very different networks – a Miles Davis album network (created in 2014 and rebuilt in 2016), a Boston Red Sox player network (also built in 2014), and a brand new example using data from the amazing GDELT Project.

Here’s a glimpse into what I’ll be sharing, starting with the Red Sox examples, where we examine the networks of three well known players from the last 100 years. First, Ted Williams network:


Followed by Carl Yastrzemski:


Now Jason Varitek, longtime catcher and captain for two World Series championship teams:


In talking through each of these networks, I will attempt to highlight some differences in their respective structures based on the era in which each player spent time with the Red Sox. For example, there are many more connections in the Varitek network compared to Williams and Yaz, despite a shorter duration with the team. Why would this be the case? Perhaps spending time in the era of higher salaries, larger pitching staffs, and the evolution of free agency might go a long way towards explaining why Jason Varitek crossed paths with far more players than did his earlier predecessors.

Stay tuned for additional posts featuring the Miles Davis and GDELT networks.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Baseball Grafika Book: Excel Dashboards

Baseball stats are ideally suited for display using a wide variety of charts, network graphs, and other visualization approaches. This is true whether we are using spreadsheet tools such as Microsoft Excel or OpenOffice Calc, data mining tools like Orange, RapidMiner or R, network analysis software such as Gephi and Cytoscape, or web-based visualization tools like D3 or Tableau Public. The sheer scope and variety of available baseball statistics can be brought to life using any one of these or countless other tools.

This is why I felt the need to create a book merging the rich statistical and historical data in the baseball archive with the advanced analytic and visual capabilities of the aforementioned tools. With a bit of good luck and perseverance, the book will be published in late April, coinciding with the early stages of the 2016 baseball season, under the title Baseball Grafika. This series of articles will share a few pieces from the book, which is still undergoing additions and revisions at this stage. I hope these will help provide some insight into how I view the possibilities for visualization, and perhaps generate your interest for how other datasets could benefit from a similar approach.

One of the chapters of the book deals with the creation of dashboards in Excel that allow us to distill large datasets into a single page summarizing information. Here’s an example of a single pennant race, and how it’s unique story can be told using an array of charts, tables, and graphics.


Now that we’ve seen an entire dashboard, we’ll look at the component pieces and how they were built. As a reminder, this is all created in Excel, which is often maligned as a visualization tool. Used well, Excel can produce highly effective visualizations, although deploying them to the web is not practical. In the book, I walk through how to create this dashboard using Excel, taking readers through all the steps needed to create formulas, charts, text summaries, and more.

Creating flexible, powerful data displays in Excel frequently involves the use of pivot tables and slicers (filters) that allow for data manipulation. Building charts on top of these tools permits maximum flexibility. Done effectively, this means we can create a template that can be used over and over, with only the source data changing according to our slicer selections. Here’s an example pivot table with slicer options:


The slicer selections allow us to choose the data elements from our base dataset that are to be displayed in a pivot table. From there, name ranges and formulas can be used to select the data programatically, and feed it into charts that are not dependent on any additional manual intervention. One chart, used over and over, makes it simple to display new data with a single click of a slicer button.

Name ranges can be used extensively to automate the dashboard to a high degree, using native Excel functionality. Here’s a screenshot showing a name being defined in Excel:


A virtually unlimited number of name ranges can be created, and then used as references in Excel cell formulas, making it easy to populate cells, tables, or charts with updated information.

Each of the following sections of the final dashboard are populated using one or more name ranges based on pivot table data in most cases. All that is required in the dashboard is a simple formula to grab the right data based on the slicer selection.

First, we create a basic text summary recapping each season, which is then pulled into the top section of the dashboard:


This is then followed by the pennant race section of the dashboard, including both the pennant race charts as well as a table of season-ending standings information. One pivot table and its references populate the chart, while a second pivot is used to provide the table data, with cell-level formulas performing calculations.


Our third section makes use of the wonderful Sparklines for Excel add-in. Our dashboard benefits from the use of horizon and variance charts, as well as box plots. In between, we’re able to add some additional Excel cell calculations to display metric values.


The final section of the dashboard takes advantage of some cell formulas to create dotplots displaying relative values within a category. This allows readers to see who was higher or lower in a specific measure, maximizing space along the way, which is often critical when building dashboards.


The book will provide much more, including tutorials on creating this type of dashboard, in addition to other visual displays of baseball information. Ultimately, the goal is to share some of my approaches and hope that they drive others to create their own unique approaches, all in the interest of advancing the discipline of baseball data visualization.

Future posts will examine other ways we can explore our baseball data. Text mining, statistical distributions, interactive charts, historical maps, and network graphs will be among our future topics. See you soon.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

A New Book Resolution for 2016

As we enter a new year, I find myself eager to create a new book that explores the world of baseball data using a wide array of data visualization approaches. This idea has been in my head for several years at least, and has found partial fulfillment in my previously published pennant races book. However, I wish to tackle something broader that will touch a number of baseball categories as well as multiple data visualization approaches.

The working title for the book is ‘Baseball Grafika’, grafika being the Czech and Polish word for graphics, a word which still conveys the intent of the book regardless of language. If all goes well, the book will be available early in the 2016 baseball season, and will cover the following topics:

  • Franchise player networks
  • Trade pattern networks
  • Hall of Fame connection network
  • Franchise location maps
  • Player birthplace maps
  • Pennant race charts
  • Standings charts
  • Career trajectory graphs
  • Baseball dashboards

Fortunately, much work has been done over the last several years on at least a few of these topics, so we’re not starting from scratch, but this will still be a considerable, yet rewarding, challenge. Updates to come.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Warsaw Data+ Presentation

I have the good fortune to be the keynote speaker at this year’s Data+ conference in Warsaw, Poland on November 26th, so the traditional American Thanksgiving meal will not be in store for 2015. This is a very exciting opportunity, and comes on the heels of having presented in Boston at the Data Visualization Summit in September 2015, so it’s been a busy last few months getting presentations squared away.

My topic at the conference is Data Driven Storytelling, where I’ll walk the audience through some of my approach and philosophy about using data visualization to deliver information and insights about specific topics. In addition to the talk, I’ve created a story on my visualidity.com site that chronicles the last 21 seasons of play in the Ekstraklasa, the top level of play in Polish football.

Thus far it has been an absolute joy working with the folks at IDG/Computerworld, who are responsible for running the event. Patrycja Kuriata, Program Director for the conference, has been incredibly responsive and helpful with any questions or details, and has made the entire process a pleasurable one.

I’m putting the wraps on my content as October comes to an end, and look forward to visiting Poland in a few weeks, and reporting back on the conference as well as on the few days of sightseeing in my plans.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Data Visualization Summit 2015

I’m in the process of pulling together a presentation for next month’s Data Visualization Summit in Boston, a conference organized by the Innovation Enterprise team. The event attracts 150-200 industry folks to see what can be done using data visualization approaches. I committed to share some insights on using network visualization to visually analyze customer behavior, and after a few weeks of tossing ideas around, have settled on a final approach. Now it’s time to actually put some data together and create some impressive visualizations for the presentation.

The end goal is to share how interactive network graphs can be used to tap into customer insights from several angles. There are three levels of analysis I’m hoping to share with the group, using some wholly fictitious data for a consumer products company. In order, the three stages are:

  1. Create a network that displays customer purchase patterns by product, providing a quick yet insightful visual overview showing who buys what, and how different products intersect with one another. For example, we might see a strong visual correlation where the shoppers who purchase Product A also buy Product D, but rarely purchase Product C. This in itself should provide some value, although other visualization methods could also perform this task, albeit in a less elegant fashion.
  2. Stage two is to focus on overall customer satisfaction levels (with the company rather than individual products), and potentially on an individual product basis, although this gets a bit more complex to execute. Through the effective use of color, we can scale satisfaction levels using the original purchase graph, thus providing a more powerful visual image. Decision makers can now easily view multiple attributes in a single visualization, something that is often difficult to achieve using conventional charts or tables.
  3. The third stage providers viewers with the ability to see actual customer comments, including summarized versions of said comments. This will enable analysts and decision makers to discover common themes that may be linked to low (or high) satisfaction levels. Again, this would be a challenging task using other visualization approaches, but can be handled effectively using well designed network graphs.

So how do we pack all this information into a single, easy to use visualization? For starters, we employ Gephi, the powerful network graph tool that allows us to convert purchase behavior data into nodes and edges that define our network graph. We can use Gephi to define the best layout for our dataset, create specific groups, make adjustments to sizes and colors, and so on. From there, we’ll be exporting the graph file using the Gexf-JS Web Viewer plugin, which will enable user interactivity through a browser. Finally, we can tweak some of the settings to deliver an attractive, intuitive, highly useful network graph visualization.

Before I forget, I must mention that the brilliant Aylien text analysis service will be used to analyze and categorize our customer comments. The results can then be included in our Gephi source files, adding another layer or two of rich insights to the data and ultimately the network graph. Integrating text analysis results with transactional customer information is an area that continues to evolve, and is a key component in understanding the present and predicting the future of customer behavior.

I hope to share the final deck at a future point, or at least the network graph that makes up the primary component of the presentation. Until then, happy visualizing!

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More

Eyeo Day 4 – Reality, Fantasy, and Being Human

Day 4 (the final day) at Eyeo is always a little bittersweet, knowing that the great energy one feeds off is about to come to an end. At the same time, there is still one more day of great talks, followed by a closing party that provides yet another opportunity to talk with creative types from a variety of places and backgrounds.

Leading off the day’s schedule was Nicky Case, heretofore unknown to me and perhaps many others in the room. That was about to change, as Case took us on a tour of his personal and professional life, delivered with great panache. Turns out he is a masterful storyteller, embodied by his interactive work on projects like the ‘Parable of Polygons’ and ‘Explorable Emotions’. Every year, there are at least a couple talks that go way beyond expectations, and this turned out to be one of them for this version of Eyeo.

Next up was Beatrice Lartigue, a French artist who walked through a few of her interesting projects, including some interactive installations. Of particular note was an active learning project based on Prokofiev’s classic ‘Peter and the Wolf’.

The afternoon began with ‘Mapping Police Violence’ presented by Deray McKesson and Sam Sinyangwe. McKesson has employed Twitter as a powerful platform for protest, while Sinyangwe has taken the route of documenting police violence by mapping incidents, thus allowing for a more factual approach to identifying police forces with chronic issues.

To cap the afternoon session, Nick Hardeman and Theo Watson of design i/o took the audience through a magical tour of their ‘Connected Worlds’ project, now installed at the New York Hall of Science. This highly interactive exhibit is full of engaging characters that will surely draw children in while simultaneously teaching lessons about interactions with nature.

On to dinner, once again navigating my way to the hot North Loop area, this time to Borough. Borough not only has some terrific food, but also offers a very intriguing wine list with seldom seen options from across the globe. A couple glasses of wine, a delightful halibut terrine, and a very good guinea hen dish later, it was time to head to Nicollet Island for the closing talks and party.

Eyeo veterans Jake Barton and Zach Lieberman closed this year’s festival, with Barton discoursing on memory and future, as seen through a compelling exhibition created for the 9/11 Museum. Lieberman delivered a heartfelt, emotional farewell to his recently departed father, who told him that ‘storytelling is not about technique, but being fully human.’

The final party provided an opportunity to chat with a few more festival friends, prior to becoming a bit melancholy when the realization sinks in that Eyeo is coming to a close for another year. I took one last look at the creative people talking, sharing, and enjoying the scene, before electing to walk back to the hotel. A long walk seemed to be the best way to process my thoughts, think about what I learned and who I met, and how it might inform and inspire my work over the coming months. So long, Eyeo.

facebooktwittergoogle_plusredditpinterestlinkedinmailfacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
facebooktwittergoogle_pluslinkedinrssfacebooktwittergoogle_pluslinkedinrssby feather

Read More