d3 Drillable Bar Charts

A couple weeks back I began tinkering with interactive bar charts using the d3 javascript charting toolkit. d3 stands for Data Driven Documents, and is the latest from the great Mike Bostock, former developer of Protovis, another great charting tool.

Anyhow, one of the samples on the d3 site used a drillable bar chart, where users can start at an aggregate level and then dig into the details by clicking on a given bar – a simple concept that is beautiful when d3 is the platform.

This has led me to my first chart, with more to come. The initial example looks at home runs by team for the 1995-2010 seasons, and begins at the season level. Once a season is selected, we drill into homers at the team level, and then at the individual batter level. It's fast, intuitive, and a heck of a lot more esthetically pleasing than your typical report output from any one of dozens of high-powered reporting tools.

d3 Bar Chart

To see it in action, click here

 

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Using Scatterplots to Analyze Team Success

I've previously posted on Orange, the wonderful open source project for statistical analysis and visualization. My latest foray involves using a variety of charts to examine success patterns by team, as measured by wins in a season. We'll talk first about scatterplots.

Scatterplots, for the uninitiated, are basically two dimensional charts that allow users to see the relationship between two elements – say, hits on one axis and runs on the other. They can be a great tool for quickly spotting correlations between elements (e.g.- are more hits consistently associated with more runs over the course of a season). In Orange, we have the luxury of adding 3rd and 4th elements to the picture, using the color of the markers, as well as the size of each marker. Here's an example, using BB and HR on the axes, and wins for the size and runs for the color of the markers.

 
Note that the sizes of the circles are not proportional to the number of wins, but rather scale according to the range of the data. This is helpful in spotting patterns, but could be considered a distortion if one adheres closely to Edward Tufte's advice.

The scatterplot enables us to spot some easy correlations; notice that virtually all of the teams with high walk totals (x-axis) are also big winners – the Yankees, Rays, Braves, and Red Sox were all at 89 or more wins for the season. So apparently, with the exception of the Diamondbacks, walks were a good predictor of wins (we're not accounting for the pitching side of the equation here, which derailed the Diamondbacks).

 
On the Y-axis, home runs also appear to have a positive relationship to wins, although perhaps slightly less than walks. And woe be to any team with low positions on both axes – this is where the worst teams in baseball wound up in 2010, including the 57 win Pirates, the 61 win Mariners, and the 66 win Orioles. No on-base ability, coupled with no HR power, seems to be a strong predictor of failure, even without knowing anything about pitching for these teams.
 
Let's change the axes for another view; we'll focus on defense in this case, putting Errors on the x-axis and double plays on the Y.
 
 
Notice any patterns? Our underachieving teams are heavily skewed to the lower right of the chart, committing lots of errors with few double plays. The pattern is perhaps less obvious than in our first example, but it is intuitive. We also see a pair of teams, the Tigers and Braves, who appear to have overcome some of their proneness to errors by turning a lot of double plays. At the other extreme, we see the Giants, with very few double plays but also the fewest errors in MLB, reaching 92 wins in spite of their low run scoring ability. Obviously, pitching helped, but it appears their defense provided relatively error-free support to their mound corps.
 
Finally, we'll take one more view – home runs (HR) versus home runs allowed (HRA). 
 
 
Once again, there appears to be a fairly clear distinction, with the lower half of the display littered with the weaker teams that tended to surrender more homers than they hit themselves. Teams with a large positive home run differential (upper left of the chart) tended to have high win totals, although there were a few teams (Phillies, Rangers, Rays) with low differentials who still fared well, due to their strong pitching corps.
 
These are just a few examples created with Orange; I'll get these and many others out to the Collections page in the site as time permits. In summary, scatterplots, regardless of the user tool, are a great way to quickly view patterns in the underlying data, and the impact on a dependent variable such as wins.

 

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

A More Thorough Look at d3

In this post, I want to take a little deeper look at d3, the javascript-based charting library from Mike Bostock of Protovis fame. For those who like to view information that is intuitively, accurately, and esthetically well presented, d3 offers a marvelous range of possibilities.

I previously spoke on Calendar Views, a supremely intuitive means of viewing information that can be measured at a dialy level; today I'll look at a host of other approaches that I find most compelling for their potential to work with the baseball data that is my focus. Each chart type will feature a screenshot from the d3 site, as I haven't applied any of these approaches to the baseball data just yet.

OK, time to look at a few more chart types, with my brief synopsis of each.

Streamgraphs are an esthetically pleasing way to show information that might typically be shared via an area chart. However, as with all the d3 charts, there is far more flexibility available to the chart creator, albeit with some added complexity. The results, IMHO, are worth the trouble, compared to the traditional lifeless output we see from Excel or worse yet, Powerpoint charts. I have created thousands of Excel charts, and employed most of the available tricks and hacks, but there are still significant limitations to what one can do; for Powerpoint, forget about it as a useful charting tool.

Here's an example from the d3 site:

Not quite what we're used to seeing from Excel, is it? While the eye candy aspect is pleasing, I refuse to use charts only on that basis. Lord knows there are plenty of charts out there that have instant eye catching appeal, but they are the equivalent of Britney Spears versus the John Coltranes or Mozarts one can create with d3. All fluff, no substance, and destined to be banished to the scrap heap with all the other inaccurate, data distorting charts that preceded them.

 

As for potential uses with baseball data, I envision some historical tracking of hits data (singles, doubles, triples, home runs) for starters, and will doubtlessly find other relevant examples to share in the future.

Another chart type I intend to use is the Scatterplot Matrix. While this chart type is certainly not exclusive to d3, the ability to thoroughly customize the output makes its use extra appealing. This chart type fits into the category known as "small multiples" coined by the legendary Edward Tufte, and provides the user with a considerable wealth of information in a limited space, while simultaneously making the information easy to grasp.

I haven't figured out how to use these with baseball stats just yet – certainly there could be a multitude of ways to do so. Perhaps looking at multiple offensive categories from a batter's career would be one application; the individual data points could represent specific seasons or age levels associated with each category.

In any event, the scatterplot matrix, as well as other types of small multiples, are exceptional at providing large volumes of information in a limited space, making it easy for the observer to detect patterns and relationships – and isn't that the goal of most data display?

The Sunburst is part of a category of relationship-oriented displays available in d3, with the commom goal of displaying relationships within a data set. In the case of the Sunburst, the data is presented in a more hierachical visual form (versus a treemap, for example), while providing immediate insights into the interrelationship between entities within the data.

Once again, the d3 esthetics are most impressive, but the key is that the underlying data is not distorted or mis-represented by the beautiful display. We are quickly able to see which sub-elements flow up to a primary element, as well as which of the larger categories dominate the data, and thus the display.

I am certain that this display could apply to many different baseball stats; my job is to find where it makes the most sense from a viewer's perspective.

Lastly, for this post, is the bullet chart, created by visualization guru Steven Few, and since added to a number of graphing toolkits and libraries. I have used these for several years within Excel, and was pleased to see their availability in d3, for they provide a compact, intuitive look at target-based data.

By target-based, I mean that we can set a target level – for example a .900 OPS, and see how specific players measure up to that goal. Bullet charts provide the ability to show the target, actual performance, and projected performance, all framed by relative performance levels (i.e.- poor, average, good, etc.). Here's the d3 example:

Many baseball applications exist for the bullet chart – we could create a pitcher dashboard showing how a single pitcher compares to his career and league averages across a variety of statistical categories – WHIP, ERA, SO/9, etc.

The same would apply for batters, where we could rate them versus their career numbers, league averages, and so on, using batting average, OPS, OBP, and any number of other stats.

Once again, we have the potential for a "small multiples" type of display, where all the information the viewer needs is provided in a single page or screen view.

Well, that's it for now, but I'm certain to blog on other chart types within d3, as there are many more. Give it a look for yourself, even if just to understand the possibilities – d3 examples

 

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

d3 – Another Great Javascript Charting Tool

d3 is a relatively new javascript charting tool from Mike Bostock, one of the creators of the outstanding Protovis project, no longer under active development. But d3 has picked up the mantel, and made a number of significant improvements over Protovis, in my estimation.

Foremost among the improvements for me is the introduction of several new chart libraries, one of which I want to focus on here.

The Calendar View provides an ingenious way to show datasets that would typically be displayed in a line chart, due to the sheer volume of data points. The d3 calendar view helps us display daily data by year, month, and day, while simultaneously displaying results in a weekly grid. In addition, the color coding provides a quick and intuitive way to see patterns in the data. Here's an example, from the d3 site:

http://mbostock.github.com/d3/ex/calendar.html

The d3 example is built using stock market data, but the same approach can be used for any daily results – in a baseball context, this could be runs scored by game, winning (or losing) margin in each game, or even the number of pitches thrown per game by the starting pitcher.

I'll look at some of the other d3 libraries soon, and begin incorporating them into the Collections page on the site.

 

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather