A New Book Resolution for 2016

As we enter a new year, I find myself eager to create a new book that explores the world of baseball data using a wide array of data visualization approaches. This idea has been in my head for several years at least, and has found partial fulfillment in my previously published pennant races book. However, I wish to tackle something broader that will touch a number of baseball categories as well as multiple data visualization approaches.

The working title for the book is ‘Baseball Grafika’, grafika being the Czech and Polish word for graphics, a word which still conveys the intent of the book regardless of language. If all goes well, the book will be available early in the 2016 baseball season, and will cover the following topics:

  • Franchise player networks
  • Trade pattern networks
  • Hall of Fame connection network
  • Franchise location maps
  • Player birthplace maps
  • Pennant race charts
  • Standings charts
  • Career trajectory graphs
  • Baseball dashboards

Fortunately, much work has been done over the last several years on at least a few of these topics, so we’re not starting from scratch, but this will still be a considerable, yet rewarding, challenge. Updates to come.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Network Data Adventures with Gephi

Posting to the blog has become a luxury recently, what with a summer full of youth baseball, some organizational changes at work, summer home projects, and of course the upcoming Gephi book. I’ve learned that one has to be especially good at finding synergies between projects in order to get everything done. So it is with creating any new work in Gephi while writing the book. Any new projects will necessarily be created while in the process of developing material for the book.

Earlier this year, I created a host of network visuals, one per franchise, showing the relationships between all players who suited up for a given major league baseball team. This data made for some interesting visuals that were fun to explore. What the graphs didn’t do was to provide visual cues about how the players could have been grouped – by decade, position, birthplace, and so on. So the logical evolution was to take this idea and extend it as an example for how to use partitioning and clustering to visually segment a network graph.

Recently I began playing with this idea by looking at a few of these examples, and have included some in one of the book chapters. I’ll use some slightly different cases here to avoid redundancy, but the principles are identical. I’ll walk through an example for how we can extract intelligence from a network graph in a few easy steps, using the Boston Red Sox from 1901 through 2013.

  1. Start with the base graph, having used a layout algorithm to arrange it in some fashion. I used the ARF approach for this example.
  2. Size the nodes in the graph using some criterion, such as the number of games played as a catcher. This will help users to quickly spot the dominant players at that position.
  3. Color the nodes using a categorical variable like decades. In this case, the color will reflect the first decade a player suited up for the Red Sox.

In sequence, here are the three graphs:

redsox_1
Kind of dull – nothing but a lot of identical nodes and their connections. Let’s apply sizing based on the number of games played as a catcher:

redsox_2

Now that pops a few things! We have some easy starting points to work from. How about coloring the nodes by decade to see if that adds to the story:

redsox_3

Hmmm. Maybe this gives us some additional insight as well. Certain decades are split amongst multiple catchers, while in other cases we have a single dominant player. Of course we would want to allow the user to identify each of these cases (for example, the large green node at the top left is Jason Varitek) through some labeling or interactivity.

So you get the idea for how a couple simple tweaks can change the way we view a graph. I’ll be using a similar approach in the book to help readers create powerful stories with their own data.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Mastering Gephi Book Update

Lest anyone think of me as a full-time author, rest assured that the likes of J.K. Rowling are not trembling in fear. Even if I had the ability to conjure up creative plots, I type too damn slow to make it as a full-time literary lion. Fortunately, I don’t have to depend on my keyboarding skill (or lack of) as a full-time pursuit. Which brings me around to my topic – the current book I’m authoring on Gephi.

For those not exposed to networks and network analysis, Gephi is a French-based open source project that makes it possible for all sorts of users (including moi) to create interesting graphs from connected datasets. By connected I am referring to data where the individual nodes are connected in some way, shape, or form. This could be anything from movie actor databases, Facebook friend networks, baseball player connections, and so on. Anyone with a spreadsheet full of data and a bit of effort and persistence can use Gephi to create cool looking graphs that also tell a story of some sort.

My job in writing the book is to help people make sense of all the features and capabilities within Gephi, some of which are a bit complex to master. In the process, I get to learn more about the theory behind network analysis, and with it terms such as contagion, diffusion, clustering, and homophily. It’s really fascinating if you’re into understanding how people and institutions interact, contagion processes function, or how product adoption can be affected by the structure of a network. My higher math skills are not good enough to be at an academic level with this stuff, so I have to compensate with some logic and visual acuity.

Anyhow, here’s some of the stuff that Gephi can create:

7344OS_cover_01

7344OS_cover_02

7344OS_cover_03

I’m hoping that one of these images will serve as the book cover come publishing time, which should be sometime this fall. In the meantime, I have six more chapters to write (of 10 total), and will have the added joy of working through chapter edits where others catch the mistakes I’ve made.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

2014: What’s Next?

As we prepare to enter a new year, I’ve been doing some thinking about what I can create for 2014. I’ve already committed myself to a companion book to my recently completed pennant race book, with the new volume to cover the 1969 through 2013 seasons. So that’s a given, and should actually be a bit easier than the first volume, now that the basic template has been created. I want to create something that goes even further with the visualization and baseball marriage, and have come up with an idea to do just that. Read More

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Pennant Race Book is Here!

After multiple iterations, periodic delays, and last minute additions, my first pennant race book is finally available. MLB Pennant Races, 1901-1968 A Visual Analysis of Baseball’s Pennant Races is available through Amazon. I’m still working through the Kindle version formatting, and hope to have that available before the end of the year. I’m also working on getting a PDF version available through this site, and possibly a version for Nook and iPad as well. Read More

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Anatomy of a Book

As I wait for the review process for my book to complete so I can queue up the Kindle version, I thought it would be a good time to share some of the philosophy behind the book, while taking a further look at the rationale for some of the chart selections. I’ll also start with why Microsoft Excel was my primary tool for creating the charts (wait – aren’t you an open source champion?) and how it helped make the book a reality.

For those of you new to the subject, my book is titled MLB Pennant Races, 1901-1968: A Visual Analysis of Baseball’s Pennant Races and endeavors to put a new, highly visual spin on an old topic. I see on a daily basis how much of an impact data visualization is having, and noticed that baseball visualization has not kept pace. So it became clear to me that a book (or books) was needed that could help close this gap, and turn a wealth of data into meaningful graphics. I knew I could do this, but what would be the best tool to actually create a book? Could it really be Excel? Absolutely.

For those of you who don’t use Excel on a regular basis (my day job calls for multiple hours a day in Excel), it really is a powerful tool for all kinds of analysis, and yes, even charting. Now here’s the rub – Excel’s default charting selections aren’t so good (albeit much improved in Excel 2013), and in fact can be absolutely grotesque on occasion, particularly with respect to improper scaling for bar charts. However, with a few well practiced tweaks combined with lessons learned from Excel gurus, I can do darn near anything with Excel charts. As a frequent user of Tableau, not to mention open source beauties like Protovis and D3, I still find Excel to provide a great combination of data management coupled with charting capabilities.

Now if I were going to create a single chart, or even a small set of charts, Excel might not be my first choice. However, when the need is for 136 identical dashboard pages composed of multiple charts, where only the data is changing, then Excel is tough to beat. The trick is to use pivot tables with the proper ‘slicers’, enabling a single data source to be used for many individual seasons. So 68 seasons for each of two leagues can all feed from a single data source and then populate existing chart templates. This way, I need just a handful of charts that can be used many times over to create 136 unique instances.

Here’s a look at one of the pivot tables with accompanying slicers that allow me to select by season, league, and division (as needed) to automatically update the values in the pivot table.

Similarly, I set up the primary pennant race chart to update using the same sort of slicers for season, league, etc. If I were truly an Excel genius, I’m sure I could have had a single set of slicers that would have updated everything, but it was still quite easy using the pair. This is how the slicers look for the main chart:

One of the reasons this all works so well in Excel is due to the formulas I used. In some cases, these were very simple, perhaps just dividing the contents of one cell by another. In other cases, the logic becomes more complex, involving sorting results based on the order of finish, or by team nickname rather than the city (think Dodgers, not Brooklyn or Los Angeles). Excel provides a range of formulas that let advanced users do virtually anything with the data. If everything is done right, these formulas are set up one time, and then work hundreds of times behind the scenes to get the right data into each chart, all incumbent on the slicer selections.

By now some of you regular Excel users will have noticed that many of the charts I used aren’t standard issue Excel charts. Absolutely true, but this leads me into a discussion of how to use Excel even when the chart type doesn’t exist. Take, for example, the dotplots pictured below.

How the heck did we create those in Excel, without having a standard chart type that even comes close to that look? Simple – we created in-cell charts by using the Excel REPT formula, combined with a couple other values to tweak the scaling for each category. This basically involves repeating a value (in our case, a space) a selected number of times based on the data value. We then choose a shape, a font size, and a font color, and use our data value (and some sort of factor value) to display each dot further to the right (higher values) or to the left (lower values). This is a great trick to learn in Excel, as it gives you another tool when bar charts are less appropriate, which is quite often the case. Visually, dotplots are often superior because we don’t need to fix the scale at zero; this allows us to ‘zoom’ into a narrow range based on the actual data values. In this case, they work far better than bar charts (trust me, I tried those first) and are visually cleaner as well (less ink).

There are some other chart types that are not native to Excel (horizon charts, box plots, and advanced sparklines) but which can be added to Excel, courtesy of the Sparklines for Excel tool, created by Fabrice Rimlinger. This is an essential add-in if you wish to create some great looking graphics when you have limited space to work with (for instance, on a dashboard). I have been an advocate of this project for more than two years, and look forward to more future use. These charts are also typically in-cell, which makes it very easy to re-size the charts to suit your application. Here’s a view of some horizon charts:

That’s it for now; perhaps I’ll dive into a bit more of the formula detail in a future post.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Hitting the Homestretch

The last 10 days have been exceptionally productive in getting my first visual pennant race book (1901-1968) together, with the final content pages being completed earlier this evening. While that was extremely gratifying in itself, now comes the part where publishers typically do the work – the front and back matter. Title pages, table of contents, acknowledgments, preface, and introduction are all essential pieces in creating a finished product. And guess what? Since I’m the publisher for my own book this time around, I get to figure this all out on my own. Fun.

Fortunately, it isn’t rocket science, as every book follows a general framework that I can learn from and mimic to the best of my ability. Of course, there are other little details when you elect to create two versions of a book, one for print and one for Kindle and other e-readers. Creating bookmarks for each and every one of 175 pages is just one of the steps I’ll need to take for the e-version, but the goal is to make the book as easy to use and polished as possible, so this step is essential.

I’ve previously shared earlier versions of the season content. Here’s a glimpse of how the summary sections will appear:

The goal is to have both versions available by the 20th of this month, with the companion book (1969-2013 pennant races) likely to appear in March 2014. Just in time for the holidays, so maybe I can actually spend a couple weeks without creating charts, copying formulas, and building PDF files. Or I could get an early start on downloading the 2013 data I need for volume two. Just don’t tell my family what I’m up to.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Gephi Book is Coming Soon!

It’s been more than a month since I posted, but lest any of you suspect me of getting lazy, I’ve been busy with two book projects, plus the usual summer assortment of activities. Blogging, tweeting, and Facebook posting have taken a backseat for a stretch as I tweak formulas and layouts for one book (baseball pennant races), and submit rewrites for chapters on the other book (Gephi and network visualization).

The past week is a good case in point as I submitted eight chapter rewrites in less than a week, as the publisher is pushing (nicely) to have the book available in September. For anyone interested in the topic (I am personally fascinated with networks and what they reveal about a variety of subjects), here’s a link to the book’s page at Packt.

It’s pretty exciting to be part of this whole publishing process, and to be implementing suggestions from a group of reviewers who I’ve never met, but who are obviously passionate about both Gephi and the broader subject of network visualization. Their constructive criticism and honest feedback is making this book many times better than it would have been if I was working alone through the process. Once the book is complete, I’ll offer more detail and insight into the people behind the book.

We’re now entering the layout and design phase of the publishing cycle, which can be challenging for books such as this that combine text with a lot of images. Given the hundreds of books of this type that Packt has produced, I’m confident the final layout will look great, and we’ll have produced a book that helps introduce new users to the exciting world of Gephi and network graphs.

Meanwhile, I’m back on the pennant race book, and still holding to a 2013 publishing date (albeit later in the year than originally intended). If the 2013 season data is available in time, I may be able to include it in the book while still publishing before the December holiday season. Who knows, it might make a nice Christmas gift for that baseball fan on your list!

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Pennant Race Book Sample

For someone who’s never written a book, trying to get two books out in the same year has proven to be a unique challenge. I have to say I’m grateful that one of them has the structure and support of an established publisher, which provides me with a good bit of guidance as well as a timetable to work from. On the other hand, self-publishing the other volume allows me to stretch out a bit, make more frequent revisions, and eventually get to the book I envisioned 12 months ago.

So, on at least the fourth format revision (who’s counting?), I believe the main section of the pennant race book is now set, and only requires the updating of each season’s data to feed the template. Sometimes time away from a project leads to better solutions, as it did in this case, with some new formulas, improved graphics, and a greater degree of automation. I like the results, and hope that others will as well.

Here’s a quick look, and I’ve also provided an attached file (.pdf) if you wish to download a few seasons and get a feel for what I’m trying to accomplish. In the near future, I’ll have a legend page that will explain all of the charts you see on each page. Trust me, they do make sense if you know what it is you’re seeing!

Much more to come over the next 2-3 months as I try to get the book launched this summer/fall.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Writing a Book is Hard Work – duh!

As I continue to create my baseball visualization book, my admiration for authors grows daily. There is a lot of discipline needed to keep writing without veering too far off course. A day or two here and there is fine for recharging or creating some new visuals, but one can’t stray too far off course for long. Especially when the target release date is two months away, and time must be cleared for editing, proofing, etc.

Basically, this is my excuse for not blogging more often in recent weeks, even though there’s plenty I would like to talk about. As I get closer to publishing, the plan is to share more info, and I’ll probably park a chapter or two on the site as .PDFs and solicit any feedback. Thanks!

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather