As I wait for the review process for my book to complete so I can queue up the Kindle version, I thought it would be a good time to share some of the philosophy behind the book, while taking a further look at the rationale for some of the chart selections. I’ll also start with why Microsoft Excel was my primary tool for creating the charts (wait – aren’t you an open source champion?) and how it helped make the book a reality.
For those of you new to the subject, my book is titled MLB Pennant Races, 1901-1968: A Visual Analysis of Baseball’s Pennant Races and endeavors to put a new, highly visual spin on an old topic. I see on a daily basis how much of an impact data visualization is having, and noticed that baseball visualization has not kept pace. So it became clear to me that a book (or books) was needed that could help close this gap, and turn a wealth of data into meaningful graphics. I knew I could do this, but what would be the best tool to actually create a book? Could it really be Excel? Absolutely.
For those of you who don’t use Excel on a regular basis (my day job calls for multiple hours a day in Excel), it really is a powerful tool for all kinds of analysis, and yes, even charting. Now here’s the rub – Excel’s default charting selections aren’t so good (albeit much improved in Excel 2013), and in fact can be absolutely grotesque on occasion, particularly with respect to improper scaling for bar charts. However, with a few well practiced tweaks combined with lessons learned from Excel gurus, I can do darn near anything with Excel charts. As a frequent user of Tableau, not to mention open source beauties like Protovis and D3, I still find Excel to provide a great combination of data management coupled with charting capabilities.
Now if I were going to create a single chart, or even a small set of charts, Excel might not be my first choice. However, when the need is for 136 identical dashboard pages composed of multiple charts, where only the data is changing, then Excel is tough to beat. The trick is to use pivot tables with the proper ‘slicers’, enabling a single data source to be used for many individual seasons. So 68 seasons for each of two leagues can all feed from a single data source and then populate existing chart templates. This way, I need just a handful of charts that can be used many times over to create 136 unique instances.
Here’s a look at one of the pivot tables with accompanying slicers that allow me to select by season, league, and division (as needed) to automatically update the values in the pivot table.
Similarly, I set up the primary pennant race chart to update using the same sort of slicers for season, league, etc. If I were truly an Excel genius, I’m sure I could have had a single set of slicers that would have updated everything, but it was still quite easy using the pair. This is how the slicers look for the main chart:
One of the reasons this all works so well in Excel is due to the formulas I used. In some cases, these were very simple, perhaps just dividing the contents of one cell by another. In other cases, the logic becomes more complex, involving sorting results based on the order of finish, or by team nickname rather than the city (think Dodgers, not Brooklyn or Los Angeles). Excel provides a range of formulas that let advanced users do virtually anything with the data. If everything is done right, these formulas are set up one time, and then work hundreds of times behind the scenes to get the right data into each chart, all incumbent on the slicer selections.
By now some of you regular Excel users will have noticed that many of the charts I used aren’t standard issue Excel charts. Absolutely true, but this leads me into a discussion of how to use Excel even when the chart type doesn’t exist. Take, for example, the dotplots pictured below.
How the heck did we create those in Excel, without having a standard chart type that even comes close to that look? Simple – we created in-cell charts by using the Excel REPT formula, combined with a couple other values to tweak the scaling for each category. This basically involves repeating a value (in our case, a space) a selected number of times based on the data value. We then choose a shape, a font size, and a font color, and use our data value (and some sort of factor value) to display each dot further to the right (higher values) or to the left (lower values). This is a great trick to learn in Excel, as it gives you another tool when bar charts are less appropriate, which is quite often the case. Visually, dotplots are often superior because we don’t need to fix the scale at zero; this allows us to ‘zoom’ into a narrow range based on the actual data values. In this case, they work far better than bar charts (trust me, I tried those first) and are visually cleaner as well (less ink).
There are some other chart types that are not native to Excel (horizon charts, box plots, and advanced sparklines) but which can be added to Excel, courtesy of the Sparklines for Excel tool, created by Fabrice Rimlinger. This is an essential add-in if you wish to create some great looking graphics when you have limited space to work with (for instance, on a dashboard). I have been an advocate of this project for more than two years, and look forward to more future use. These charts are also typically in-cell, which makes it very easy to re-size the charts to suit your application. Here’s a view of some horizon charts:
That’s it for now; perhaps I’ll dive into a bit more of the formula detail in a future post.