Recapping 2017

Observers of this blog will note that posts were scarce in 2017 – in fact this is the only one, and it’s being completed in 2018! This is the result of a variety of causes, including external projects, busy schedules, and focus that was shifted in other, unrelated directions. Still, 2017 was not without its moments.

For starters, I managed to create three data visualization courses for Packt:

Learning Data Visualization

Data Visualization Techniques

Advanced Data Visualization

Retrosheet data for the 2016 and 2017 seasons has also been downloaded, and is in the update process as we speak, which will enable some new visualization work (and perhaps a new book title) in 2018. Soon, annual season data from the Baseball-Databank and Sean Lahman will be available as well.

I’m also in the process of launching a new site at jazzgraphs.com, where I’ll use network visualizations to uncover the complex web of relationships between jazz musicians, labels, and recordings. Posters and a book are in the plans for 2018, so stay tuned.

Wishing all a happy and prosperous 2018, and I promise more content to come this year!

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Baseball Grafika Book: Excel Dashboards

Baseball stats are ideally suited for display using a wide variety of charts, network graphs, and other visualization approaches. This is true whether we are using spreadsheet tools such as Microsoft Excel or OpenOffice Calc, data mining tools like Orange, RapidMiner or R, network analysis software such as Gephi and Cytoscape, or web-based visualization tools like D3 or Tableau Public. The sheer scope and variety of available baseball statistics can be brought to life using any one of these or countless other tools.

This is why I felt the need to create a book merging the rich statistical and historical data in the baseball archive with the advanced analytic and visual capabilities of the aforementioned tools. With a bit of good luck and perseverance, the book will be published in late April, coinciding with the early stages of the 2016 baseball season, under the title Baseball Grafika. This series of articles will share a few pieces from the book, which is still undergoing additions and revisions at this stage. I hope these will help provide some insight into how I view the possibilities for visualization, and perhaps generate your interest for how other datasets could benefit from a similar approach.

One of the chapters of the book deals with the creation of dashboards in Excel that allow us to distill large datasets into a single page summarizing information. Here’s an example of a single pennant race, and how it’s unique story can be told using an array of charts, tables, and graphics.

AL_1967_5

Now that we’ve seen an entire dashboard, we’ll look at the component pieces and how they were built. As a reminder, this is all created in Excel, which is often maligned as a visualization tool. Used well, Excel can produce highly effective visualizations, although deploying them to the web is not practical. In the book, I walk through how to create this dashboard using Excel, taking readers through all the steps needed to create formulas, charts, text summaries, and more.

Creating flexible, powerful data displays in Excel frequently involves the use of pivot tables and slicers (filters) that allow for data manipulation. Building charts on top of these tools permits maximum flexibility. Done effectively, this means we can create a template that can be used over and over, with only the source data changing according to our slicer selections. Here’s an example pivot table with slicer options:

team_pivot

The slicer selections allow us to choose the data elements from our base dataset that are to be displayed in a pivot table. From there, name ranges and formulas can be used to select the data programatically, and feed it into charts that are not dependent on any additional manual intervention. One chart, used over and over, makes it simple to display new data with a single click of a slicer button.

Name ranges can be used extensively to automate the dashboard to a high degree, using native Excel functionality. Here’s a screenshot showing a name being defined in Excel:

Excel_name_range

A virtually unlimited number of name ranges can be created, and then used as references in Excel cell formulas, making it easy to populate cells, tables, or charts with updated information.

Each of the following sections of the final dashboard are populated using one or more name ranges based on pivot table data in most cases. All that is required in the dashboard is a simple formula to grab the right data based on the slicer selection.

First, we create a basic text summary recapping each season, which is then pulled into the top section of the dashboard:

AL_1967_1

This is then followed by the pennant race section of the dashboard, including both the pennant race charts as well as a table of season-ending standings information. One pivot table and its references populate the chart, while a second pivot is used to provide the table data, with cell-level formulas performing calculations.

AL_1967_2

Our third section makes use of the wonderful Sparklines for Excel add-in. Our dashboard benefits from the use of horizon and variance charts, as well as box plots. In between, we’re able to add some additional Excel cell calculations to display metric values.

AL_1967_3

The final section of the dashboard takes advantage of some cell formulas to create dotplots displaying relative values within a category. This allows readers to see who was higher or lower in a specific measure, maximizing space along the way, which is often critical when building dashboards.

AL_1967_4

The book will provide much more, including tutorials on creating this type of dashboard, in addition to other visual displays of baseball information. Ultimately, the goal is to share some of my approaches and hope that they drive others to create their own unique approaches, all in the interest of advancing the discipline of baseball data visualization.

Future posts will examine other ways we can explore our baseball data. Text mining, statistical distributions, interactive charts, historical maps, and network graphs will be among our future topics. See you soon.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

A New Book Resolution for 2016

As we enter a new year, I find myself eager to create a new book that explores the world of baseball data using a wide array of data visualization approaches. This idea has been in my head for several years at least, and has found partial fulfillment in my previously published pennant races book. However, I wish to tackle something broader that will touch a number of baseball categories as well as multiple data visualization approaches.

The working title for the book is ‘Baseball Grafika’, grafika being the Czech and Polish word for graphics, a word which still conveys the intent of the book regardless of language. If all goes well, the book will be available early in the 2016 baseball season, and will cover the following topics:

  • Franchise player networks
  • Trade pattern networks
  • Hall of Fame connection network
  • Franchise location maps
  • Player birthplace maps
  • Pennant race charts
  • Standings charts
  • Career trajectory graphs
  • Baseball dashboards

Fortunately, much work has been done over the last several years on at least a few of these topics, so we’re not starting from scratch, but this will still be a considerable, yet rewarding, challenge. Updates to come.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Mastering Gephi Book Update

Lest anyone think of me as a full-time author, rest assured that the likes of J.K. Rowling are not trembling in fear. Even if I had the ability to conjure up creative plots, I type too damn slow to make it as a full-time literary lion. Fortunately, I don’t have to depend on my keyboarding skill (or lack of) as a full-time pursuit. Which brings me around to my topic – the current book I’m authoring on Gephi.

For those not exposed to networks and network analysis, Gephi is a French-based open source project that makes it possible for all sorts of users (including moi) to create interesting graphs from connected datasets. By connected I am referring to data where the individual nodes are connected in some way, shape, or form. This could be anything from movie actor databases, Facebook friend networks, baseball player connections, and so on. Anyone with a spreadsheet full of data and a bit of effort and persistence can use Gephi to create cool looking graphs that also tell a story of some sort.

My job in writing the book is to help people make sense of all the features and capabilities within Gephi, some of which are a bit complex to master. In the process, I get to learn more about the theory behind network analysis, and with it terms such as contagion, diffusion, clustering, and homophily. It’s really fascinating if you’re into understanding how people and institutions interact, contagion processes function, or how product adoption can be affected by the structure of a network. My higher math skills are not good enough to be at an academic level with this stuff, so I have to compensate with some logic and visual acuity.

Anyhow, here’s some of the stuff that Gephi can create:

7344OS_cover_01

7344OS_cover_02

7344OS_cover_03

I’m hoping that one of these images will serve as the book cover come publishing time, which should be sometime this fall. In the meantime, I have six more chapters to write (of 10 total), and will have the added joy of working through chapter edits where others catch the mistakes I’ve made.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Mastering Gephi Book Scheduled for Fall 2014

It’s official – I’ll be writing a follow-up Gephi book for Packt Publishing, with a release date of later this year – probably around October. This one will be a more substantial effort, with an emphasis on performing higher level network analysis using Gephi. I’m already busy doing research on some of the theoretical aspects of network graphs so I will be prepared to apply different ideas using the Gephi tools.

If you’re not familiar with what Gephi can do, here’s a recent example I created featuring all 48 Miles Davis studio recordings, with the musicians who played on each album. Click to go to the interactive version.

miles-davis-network

Based on the discussions I’ve been having with some generous members of the Gephi groups at Facebook and LinkedIn, there is a definite need to explore some of the deeper topics not typically addressed in either the documentation or the forums. My first Gephi book was intended as a more introductory level volume, and did not cover any of the more advanced topics in detail. The new book will spend lots of time with concepts such as filtering, graph statistics, dynamic networks, clustering, contagion, and graph aesthetics among others, in addition to further exploration of selecting the right layout. Helping users navigate through some of the less intuitive parts of Gephi will also receive ample attention.

In addition to the writing, this also means many hours spent working in Gephi so I can understand all the nuances that need to make it into the book. Fortunately, I love the software and what it has helped me create in my own section of the universe (mostly baseball networks), but there is so much more to explore in this rapidly evolving discipline. I anticipate some synergy as new graphs are created that may well wind up in the book, or as supporting resources.

This book will be intended for intermediate to advanced users of Gephi, and for brave newcomers. For a more basic intoduction to Gephi, my first book for Packt is available here.

More to come as I take this exciting journey deep into the world of network graphs and Gephi.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

2014: What’s Next?

As we prepare to enter a new year, I’ve been doing some thinking about what I can create for 2014. I’ve already committed myself to a companion book to my recently completed pennant race book, with the new volume to cover the 1969 through 2013 seasons. So that’s a given, and should actually be a bit easier than the first volume, now that the basic template has been created. I want to create something that goes even further with the visualization and baseball marriage, and have come up with an idea to do just that. Read More

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Pennant Race Book is Here!

After multiple iterations, periodic delays, and last minute additions, my first pennant race book is finally available. MLB Pennant Races, 1901-1968 A Visual Analysis of Baseball’s Pennant Races is available through Amazon. I’m still working through the Kindle version formatting, and hope to have that available before the end of the year. I’m also working on getting a PDF version available through this site, and possibly a version for Nook and iPad as well. Read More

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Anatomy of a Book

As I wait for the review process for my book to complete so I can queue up the Kindle version, I thought it would be a good time to share some of the philosophy behind the book, while taking a further look at the rationale for some of the chart selections. I’ll also start with why Microsoft Excel was my primary tool for creating the charts (wait – aren’t you an open source champion?) and how it helped make the book a reality.

For those of you new to the subject, my book is titled MLB Pennant Races, 1901-1968: A Visual Analysis of Baseball’s Pennant Races and endeavors to put a new, highly visual spin on an old topic. I see on a daily basis how much of an impact data visualization is having, and noticed that baseball visualization has not kept pace. So it became clear to me that a book (or books) was needed that could help close this gap, and turn a wealth of data into meaningful graphics. I knew I could do this, but what would be the best tool to actually create a book? Could it really be Excel? Absolutely.

For those of you who don’t use Excel on a regular basis (my day job calls for multiple hours a day in Excel), it really is a powerful tool for all kinds of analysis, and yes, even charting. Now here’s the rub – Excel’s default charting selections aren’t so good (albeit much improved in Excel 2013), and in fact can be absolutely grotesque on occasion, particularly with respect to improper scaling for bar charts. However, with a few well practiced tweaks combined with lessons learned from Excel gurus, I can do darn near anything with Excel charts. As a frequent user of Tableau, not to mention open source beauties like Protovis and D3, I still find Excel to provide a great combination of data management coupled with charting capabilities.

Now if I were going to create a single chart, or even a small set of charts, Excel might not be my first choice. However, when the need is for 136 identical dashboard pages composed of multiple charts, where only the data is changing, then Excel is tough to beat. The trick is to use pivot tables with the proper ‘slicers’, enabling a single data source to be used for many individual seasons. So 68 seasons for each of two leagues can all feed from a single data source and then populate existing chart templates. This way, I need just a handful of charts that can be used many times over to create 136 unique instances.

Here’s a look at one of the pivot tables with accompanying slicers that allow me to select by season, league, and division (as needed) to automatically update the values in the pivot table.

Similarly, I set up the primary pennant race chart to update using the same sort of slicers for season, league, etc. If I were truly an Excel genius, I’m sure I could have had a single set of slicers that would have updated everything, but it was still quite easy using the pair. This is how the slicers look for the main chart:

One of the reasons this all works so well in Excel is due to the formulas I used. In some cases, these were very simple, perhaps just dividing the contents of one cell by another. In other cases, the logic becomes more complex, involving sorting results based on the order of finish, or by team nickname rather than the city (think Dodgers, not Brooklyn or Los Angeles). Excel provides a range of formulas that let advanced users do virtually anything with the data. If everything is done right, these formulas are set up one time, and then work hundreds of times behind the scenes to get the right data into each chart, all incumbent on the slicer selections.

By now some of you regular Excel users will have noticed that many of the charts I used aren’t standard issue Excel charts. Absolutely true, but this leads me into a discussion of how to use Excel even when the chart type doesn’t exist. Take, for example, the dotplots pictured below.

How the heck did we create those in Excel, without having a standard chart type that even comes close to that look? Simple – we created in-cell charts by using the Excel REPT formula, combined with a couple other values to tweak the scaling for each category. This basically involves repeating a value (in our case, a space) a selected number of times based on the data value. We then choose a shape, a font size, and a font color, and use our data value (and some sort of factor value) to display each dot further to the right (higher values) or to the left (lower values). This is a great trick to learn in Excel, as it gives you another tool when bar charts are less appropriate, which is quite often the case. Visually, dotplots are often superior because we don’t need to fix the scale at zero; this allows us to ‘zoom’ into a narrow range based on the actual data values. In this case, they work far better than bar charts (trust me, I tried those first) and are visually cleaner as well (less ink).

There are some other chart types that are not native to Excel (horizon charts, box plots, and advanced sparklines) but which can be added to Excel, courtesy of the Sparklines for Excel tool, created by Fabrice Rimlinger. This is an essential add-in if you wish to create some great looking graphics when you have limited space to work with (for instance, on a dashboard). I have been an advocate of this project for more than two years, and look forward to more future use. These charts are also typically in-cell, which makes it very easy to re-size the charts to suit your application. Here’s a view of some horizon charts:

That’s it for now; perhaps I’ll dive into a bit more of the formula detail in a future post.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Hitting the Homestretch

The last 10 days have been exceptionally productive in getting my first visual pennant race book (1901-1968) together, with the final content pages being completed earlier this evening. While that was extremely gratifying in itself, now comes the part where publishers typically do the work – the front and back matter. Title pages, table of contents, acknowledgments, preface, and introduction are all essential pieces in creating a finished product. And guess what? Since I’m the publisher for my own book this time around, I get to figure this all out on my own. Fun.

Fortunately, it isn’t rocket science, as every book follows a general framework that I can learn from and mimic to the best of my ability. Of course, there are other little details when you elect to create two versions of a book, one for print and one for Kindle and other e-readers. Creating bookmarks for each and every one of 175 pages is just one of the steps I’ll need to take for the e-version, but the goal is to make the book as easy to use and polished as possible, so this step is essential.

I’ve previously shared earlier versions of the season content. Here’s a glimpse of how the summary sections will appear:

The goal is to have both versions available by the 20th of this month, with the companion book (1969-2013 pennant races) likely to appear in March 2014. Just in time for the holidays, so maybe I can actually spend a couple weeks without creating charts, copying formulas, and building PDF files. Or I could get an early start on downloading the 2013 data I need for volume two. Just don’t tell my family what I’m up to.

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather

Gephi Book Now Available!

I’m pleased to announce that my first book has been published (thanks to all at Packt Publishing!) and is now available online.

Network Graph Analysis and Visualization with Gephi provides a gentle introduction to the world of network graph visualization using Gephi, a powerful open source tool. In this post, I’ll walk you through a few examples from the book to illustrate how you can begin creating your own network graphs with Gephi.

Before diving into any specific examples, I want to give you an idea of what the book covers, so here’s the Table of Contents:

  • Preface
  • Chapter 1: Installing Gephi
  • Chapter 2: Creating Simple Network Graphs
  • Chapter 3: Exploring Additional Layout Options
  • Chapter 4: Creating a Gephi Dataset
  • Chapter 5: Exploring Plugins
  • Chapter 6: Advanced Features
  • Chapter 7: Deploying Gephi Visualizations
  • Appendix: Network Visualization Resources

While this book makes no claim to covering everything you can do with Gephi (not even close!), it does provide the reader with a broad and accessible overview, while also addressing some of the basic concepts and terminology of network graph analysis.

Here are a few excerpts from a companion article for the book; you can also download a sample chapter from the book page at Packt.

“Gephi is a versatile and powerful tool that will help you create simple network visualizations quickly, while also providing the capabilities to build complex graphs based on large datasets. In this article, you will learn some of the fundamentals of Gephi and network visualization, which will rapidly empower you to create your own graphs…”

“Network graphs are essentially based on the construct of nodes and edges. Nodes represent points or entities within the data, while edges refer to the connections or lines between nodes. Individual nodes might be students in a school, or schools within an educational system, or perhaps agencies within a government structure…”

“Network graphs are drawn through positioning nodes and their respective connections relative to one another. In the case of a graph with 8 or 10 nodes, this is a rather simple exercise, and could probably be drawn rather accurately without the help of complex methodologies. However, in the typical case where we have hundreds of nodes with thousands of edges, the task becomes far more complex…”

“Gephi is an ideal tool for users new to network graph analysis and visualization, as it provides a rich set of tools to create and customize network graphs. The user interface makes it easy to understand basic concepts such as nodes and edges, as well as descriptive terminology like neighbors, degrees, repulsion, and attraction. New users can move as slowly or as rapidly as they wish, given Gephi’s gentle learning curve…”

So if you or anyone you know is interested, navigate to the book’s page, where you’ll find more information, including a sample chapter, as well as links to a number of book sellers. Thanks, and happy visualizing!

FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather
FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather