Batting Explorers Updated Through 2017 Season

I’m pleased to share that the Batting Explorer visualization for the 2010s decade has now been updated with 2016 & 2017 statistics. This ongoing project captures batting statistics at the season level for every major league batter, and visualizes them in a baseball card type of format, as seen below:

Batting Explorer
Batting Explorer

A number of filters are provided to make it easy to browse across a wide range of attributes, including all of the major batting categories:

Filters
Filters

One of my favorite aspects of the Batting Explorer is the ability to link to greater detail by clicking on a specific player card, which will transport you to the Baseball-Reference page for that player:

baseball_ref_1

This project uses the Exhibit project software originally developed years ago as part of the MIT Simile project, as well as a lot of HTML & CSS for styling purposes. Give it a try, and thanks for reading.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Eyeo Day 3 – Words, Maps, Code

Day 3 at Eyeo had another intriguing assortment of speakers and topics to learn from, as well as the final planned night at Nye’s prior to its August closing. Starting the day in the McGuire Theater was Allison Parrish, who delivered an intriguing talk on words and semantic tools. This is an area of interest for me, specifically in striving to visualize word connections and context, so there was much to like about the talk. Parrish has put together several interesting Twitterbots, including the Power Vocab Tweet, Library of Emoji, and Deep Question Bot, all of which I am adding to my following list. Often silly yet clever use of semantics in the Twitter space.

Next up was Ingrid Burrington, who posed some challenging questions about the scope and invasiveness of fiber optic paths and other high tech connectivity infrastructure.

The afternoon session began with one of my favorite presentations of the festival, delivered by Ben Vershbow and Mauricio Giraldo from the NYPL Labs team. NYPL stands for New York Public Library, filled with incredible resources that Vershbow and Giraldo shared. Funny, engaging, and informative, attendees were taken through some of the great work going on at the library, including the oldnyc.org project which maps historical photos, and the community sourcing of Building Inspector, with it’s classic motto – Kill Time. Make History. Great work, fantastic talk.

Next up, Harlo Holmes talked about a few of her projects and interests, with a focus on security tools that protect users from surveillance and intrusion.

To finish the day, Ramsey Nasser presented a splendid talk on coding that was so much more than that. He talked about the need to expand the world of coding to encompass more than the traditional English-only, left to right text that underpins virtually all coding frameworks and languages. Nasser was very entertaining while driving home an important message about the need to make revolutionary changes if we are to maximize the potential for coding.

Another great set of daytime talks, and now there was to be some downtime before heading to dinner and the evening gathering at Nye’s. At least that was the original plan, until the great folks at M|I|C/A (Maryland Institute College of Art) announced a happy hour at The Third Bird, just across Loring Park. Not wanting to disappoint a generous sponsor, I joined dozens of others for a couple beers before making the 1.5 mile walk back to the hotel in a steady rain. One must be able to make sacrifices!

Dinner was planned for Pizza Nea, a spot I visited during the 2012 Eyeo Festival, for one of their excellent thin crust pies. Expecting a slim crowd on a rainy Wednesday at 8:00, I was surprised to find a nearly full restaurant. Taking a seat at the bar adjacent to the pizza making area, I was informed that I just missed perhaps their busiest Wednesday in memory. Why? Unbeknownst to me, the Rolling Stones were playing at the nearby University of Minnesota football stadium that evening, which presumably filled all the restaurants in the neighborhood prior to the show.

Finally, I made the two block walk to Nye’s, joining dozens of other Eyeoans for beer, booze, and piano bar frivolity, with Jer Thorpe in particularly good voice for his annual rendition of Neil Diamond’s Sweet Caroline. Where else but Eyeo!

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

MLB Batter Dashboard in Tableau Public

Tableau has revolutionized visual analysis for many users by providing a tool that makes it easy to create exceptional visualizations without the need to write code. For some, Tableau Desktop has been a godsend to rescue them from the challenges of creating meaningful charts using Excel, Cognos, SAS, or any number of other tools. For others, Tableau Public has provided an opportunity to enter the world of visualization. In my case, I use one at work and the other for my side projects, one of which I’ll introduce here.

I’ve long worked with major league baseball data provided by either Retrosheet, Baseball-Databank, or Sean Lahman, and thought Tableau Public could help me to create some fascinating dashboards for users to navigate. Baseball visualization is a relatively untapped area, and one where I expect to spend more time in 2015. In the meantime, I have a prototype to share via the Tableau Public site, as seen here:

The full dashboard can be found here:

https://public.tableausoftware.com/views/MLBBatters1980sDashboard-WIP/RankingsDashboard?:showVizHome=no#1

This dashboard allows you to filter by team, batter type (left or right handed or switch hitter), league, season, or age (as of July 1st each season). In addition, filters can be set based on the number of games played at a given position. Multiple filters can be combined to provide a variety of results across 15 offensive categories. Have fun with it, and let me know what you think.

More will be coming – some of it can be seen currently on Tableau Public, with more to follow. I’m also planning to expand the data beyond the 1980s, so we can see patterns from more than 100 years of data. Stay tuned.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Data Visualization, Aesthetics and Intuition

As I worked through a just completed project chronicling the diverse musical career of Neil Young, some valuable (if unintended) insights were reinforced once more. I work on a regular basis with a variety of large datasets that require analysis, interpretation, and ultimately visualization and presentation. Often, these goals are not easily reconciled, which leads to unsatisfactory results across one or more of these factors.

As much as we as analysts need to depict the data accurately and meaningfully, if we don’t do so with an attractive visual approach we risk not having our message get communicated at all. Merely presenting our data in a table may technically get the job done, but is also likely to bore the reader to tears while simultaneously failing to deliver the key messages. At the other extreme, we can pull out individual bits of the data and spend our time creating flashy infographics that may capture attention but fail to represent the data in its proper context. All flash, no substance. Neither approach is terribly effective.

At the same time, we may present all of the information using a reasonable visual approach that preserves the integrity of the data while still falling short of creating a fulfilling user experience. This is what I recently experienced with the Neil Young project, as I’ll detail below.

After spending a few days getting the data from the AllMusic site into Excel, and eventually as node and edge files into Gephi, it was finally time to create the network data visualization. I was determined to attempt one of the many force-based methods used in network graph analysis to create the graph. These methods are very popular and useful for creating graphs out of a variety of data networks, allowing viewers to see the larger patterns at work within the data.

After a few iterations, I wound up with a serviceable graph that covered most of the basics I spoke of earlier – all the data was exposed, element types were sized and color-coded for easier interpretation, and the project was navigable via the web. Here’s a look:

neil_young_gephi_20141023

Not bad, but there was something nagging at me as I viewed it, tweaked it, played with the styling, and so on. Everything was technically fine, but something was missing. So back I went to Gephi to find the answer. The next day, it occurred to me – I was using the wrong approach for the type of data I was trying to depict. Where the force-directed approach is ideal for dense, social media type networks, this was a unique network that didn’t possess the same structure. Therefore, it was not as aesthetically appealing or as intuitive as it could be.

After iterating through a few approaches, I came across a winner that best exploits the structure of the underlying data while conveying a far more intuitive feel to the end user. Why not have Neil at the center of the graph, surrounded by all of his albums, ordered by release date? On top of this, I could then have the style and mood data form an outer ring, as they needed only to link to the albums in some fashion. Now we have something that conveys the same information as the first attempt, but in a much more pleasing layout relative to this dataset. See for yourself:

neil_young_gephi_20141024

The new version addresses the issues of aesthetics and intuition where the first graph fell short. All moods and styles are now easily found; the same is true for all albums. Highlighting a single mood (or album) also provides an information-rich view for how the music changed over periods in Young’s career. This was nearly impossible to see in the initial layout.

So the message is this – visualizations not only don’t need to sacrifice aesthetics and intuition in order to be effective; rather, they should take advantage of these attributes to increase their appeal and impact. Don’t be afraid to experiment until you find the right formula, as it seldom presents itself the first time around, and trust your instincts.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Network Data Adventures with Gephi

Posting to the blog has become a luxury recently, what with a summer full of youth baseball, some organizational changes at work, summer home projects, and of course the upcoming Gephi book. I’ve learned that one has to be especially good at finding synergies between projects in order to get everything done. So it is with creating any new work in Gephi while writing the book. Any new projects will necessarily be created while in the process of developing material for the book.

Earlier this year, I created a host of network visuals, one per franchise, showing the relationships between all players who suited up for a given major league baseball team. This data made for some interesting visuals that were fun to explore. What the graphs didn’t do was to provide visual cues about how the players could have been grouped – by decade, position, birthplace, and so on. So the logical evolution was to take this idea and extend it as an example for how to use partitioning and clustering to visually segment a network graph.

Recently I began playing with this idea by looking at a few of these examples, and have included some in one of the book chapters. I’ll use some slightly different cases here to avoid redundancy, but the principles are identical. I’ll walk through an example for how we can extract intelligence from a network graph in a few easy steps, using the Boston Red Sox from 1901 through 2013.

  1. Start with the base graph, having used a layout algorithm to arrange it in some fashion. I used the ARF approach for this example.
  2. Size the nodes in the graph using some criterion, such as the number of games played as a catcher. This will help users to quickly spot the dominant players at that position.
  3. Color the nodes using a categorical variable like decades. In this case, the color will reflect the first decade a player suited up for the Red Sox.

In sequence, here are the three graphs:

redsox_1
Kind of dull – nothing but a lot of identical nodes and their connections. Let’s apply sizing based on the number of games played as a catcher:

redsox_2

Now that pops a few things! We have some easy starting points to work from. How about coloring the nodes by decade to see if that adds to the story:

redsox_3

Hmmm. Maybe this gives us some additional insight as well. Certain decades are split amongst multiple catchers, while in other cases we have a single dominant player. Of course we would want to allow the user to identify each of these cases (for example, the large green node at the top left is Jason Varitek) through some labeling or interactivity.

So you get the idea for how a couple simple tweaks can change the way we view a graph. I’ll be using a similar approach in the book to help readers create powerful stories with their own data.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

3 More MLB Network Graphs

Getting rolling now, using a templated approach to create a handful of franchise graphs, with many more to come. The first five cover the Tigers, Cubs, Red Sox, Dodgers, and Giants, showing all the connections between players from 1901-2013 within each franchise’s history. All credit is due to Gephi, the ARF layout, and the Chinese Whispers clustering algorithm. Data is courtesy of Sean Lahman’s baseball database. I’m merely the conductor who gets to bring these great tools together.

Here’s the roster if you want to go to a single graph, or you can go to the network graphs gallery on my website:

Check them out and let me know what you think.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Tigers Network Graph Final Update

The Tigers network graph is now complete, with all player nodes sized according to the number of years spent with the team. So now it becomes very easy to see the handful of players with the team for 10, 15, or even 20 seasons.

tigers-graph

What I also find interesting is the clustering behavior, as depicted by the node colors. Some of the clusters are huge, including more than two hundred players based on their similar patterns within the network. I allowed the ARF algorithm to run longer this time (8 minutes rather than 5), which spread the network out while creating more clusters. Some, as I noted, are huge, while others have less than 10 members. These smaller groupings contain players who spent many years with the club.

Here are a few of the small clusters and their members, each of which can be selected using the Group Selector dropdown list.

  • Group 16: Charlie Gehringer and Lou Whitaker
  • Group 15: Dick McAuliffe, Dizzy Trout, Donie Bush, Jack Morris
  • Group 18: Al Kaline and Ty Cobb
  • Group 19: Alan Trammell

It’s interesting to note the clustering across time, where players with similar longevity and influence are grouped together. This makes for a more interesting result (in my mind) versus simple time-based clustering by peer group.

Now it’s on to more teams, using an identical approach, in order to facilitate comparisons.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Another Tigers Network Graph

I’m having a lot of fun creating network graphs using Gephi, and excited about the possibilities for displaying a wide range of baseball information. My initial pass at showing team connections uses the Tigers, with this version incorporating all players from 1901 through 2013. Try it out here.

Update – new version with Tigers colors: Updated version

This is done using a radial axis layout with Chinese Whispers Clustering, based on a paper by Chris Biemann. The colors along each of the axes is based on the clusters created by this algorithm. Once again, I’ve used Sigma.js to create the interactive version, so you can dive into the graphic and gain an understanding for how the data is displayed. Here’s a static view of the graph:

Tigers-1901-2013-graph

Kind of colorful, isn’t it? For those of you who are long-time Tiger fans, you’ll soon detect that each cluster (color) represents a specific era in Tiger history, an by clicking on individual nodes, you’ll be able to see which players connect across multiple clusters. Typically, this will be players like Alan Trammell, Ty Cobb, and others who played many years with the club, and thus transcend their own cluster position.

Not sure if this will be the style for all the franchise graphs I plan to do this year, but it feels close, given the ability to display more than 1500 players and nearly 48,000 connections without having things too cluttered.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Mapping Redux

Last week I blogged about an interactive map I was working on that would enable users to view all major league players by birthplace and possibly birthdate. This could be done using Leaflet, a very cool javascript mapping app built on the Cloudmade platform. Well, a few things have changed since then!

After a lot of geocoding, testing, and sense checks, my direction has changed multiple times over the last several days. We now have a winner – good old Google Earth, using the GE web plugin, appears to offer the best combination of capability and wow! factor for this project.

One of the drivers behind this decision is the global nature of the data, with players born in such diverse places as Venezuela, Australia, the Netherlands, and of course, all parts of the United States. Looking at this data via a 2-dimensional map just didn’t work for me, so enter Google Earth and 3-D. There are also ways to categorize and filter the data using GE that will approximate what I could have done working with Exhibit or Leaflet.

If all goes according to plan (naive thought, I realize) this project should be online and functional in the next 7 days. I promise, it will be both fun to use and slick to look at, giving you the ability to sift and zoom through more than 17k player records.


Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather