Major League Baseball Trade Networks, Part 1

One topic that has long fascinated me as network graph material is trade data between major league baseball (MLB) teams. I have previously created a static visualization showing activity at a macro level, i.e.- the number of trades between teams over a 100+ year period. Yet there was a desire to do something more, and to make it interactive so users would be able to sift through the data for their favorite teams to understand trade patterns through a visual representation. Today, after weeks spent tinkering with this topic, I finally have something to share, and will walk through how it is created and how to engage with it online. If you want to play with it before reading further, visit the Tigers trade network.

Here’s what we’ll wind up with:

Tigers trade network
Tigers trade network

My tools of choice in this endeavor are familiar ones to anyone working with baseball data, network graphs, or perhaps both, although I haven’t seen many instances of the latter. The trade data can be found at Retrosheet, as part of a seemingly boundless array of baseball data, both statistical and historical in nature. Gephi, the open source network analysis tool is again my choice for creating the network structure from the raw data, and Sigma.js is once more the tool for web implementation. Mix in a bit of Excel and PowerPoint for good measure, and we have all the tools necessary to create a pretty cool (IMHO) finished graph.

So let’s get started. Our first step is to go to the Retrosheeet site and download trade data, found at Retrosheet transactions. Be aware that there is much more than trade data in this dataset; free agent transactions, releases, and many other transaction types are available. My approach is to grab the entire dataset, which I can then load into a MySQL database for filtering and matching to other baseball data from both Retrosheet and the Lahman archives. For our example, only trades will be used; this leaves open the future possibility to examine free agent signings or other transaction types.

Once I have the data in MySQL (I’m purposely skipping over this process), the coding steps begin. This was a very iterative process as I gradually figured out how Gephi would play with the output data, but I won’t bore you with my multiple missteps. Instead, let’s have a look at the code snippets, and I’ll explain their usage and the thought process behind them. We’ll start with a view of the code, created within the (free) Toad for MySQL tool. In creating this code, we need to understand how Gephi (or other network analysis tools) work. At the risk of over-simplifying, Gephi only needs nodes and edges. Nodes will represent the players or teams in our visual, while edges will show the linkages within a single trade – who was traded for whom, and which players moved together from one organization to another.

Node creation is simple – we just grab all players involved in a trade, and do likewise for all teams. Here’s the code for players:

SELECT t.Player as Player, CONCAT(m.nameFirst, ” “, m.nameLast) as Name, count(*) as transactions

FROM trades2015 t
INNER JOIN Master m
ON t.Player = m.retroID

WHERE t.Type = ‘T’ and (t.TeamFrom > ‘A’ OR t.TeamTo > ‘A’)
GROUP BY t.Player, CONCAT(m.nameFirst, ” “, m.nameLast)

All we’re doing here is creating a node size for each player, based on the number of trades they are involved in.

For Teams, the logic is a bit more complex; since team names have changed from season to season, we need to join on both team and season to get the correct name assignments. We also want to account for the direction of each transaction, which we do using a UNION query.

SELECT b.Team AS Id, b.Name As Label, SUM(b.transactions) as Size
FROM
(SELECT t.TeamFrom as Team, te.name as Name, count(*) as transactions

FROM trades2015 t
INNER JOIN Teams te
ON te.teamID = t.TeamFrom and t.Season = te.yearID

WHERE t.Type = ‘T’ and t.TeamFrom > ‘A’
GROUP BY t.TeamFrom, te.name

UNION ALL

SELECT t.TeamTo as Team, te.name as Name, count(*) as transactions

FROM trades2015 t
INNER JOIN Teams te
ON te.teamID = t.TeamTo and t.Season = te.yearID

WHERE t.Type = ‘T’ and t.TeamFrom > ‘A’
GROUP BY t.TeamTo, te.name) b
GROUP BY b.Team, b.Name

After running the queries, we have results that can be posted into Excel or other spreadsheet software, where a tab-delimited file can be saved for use in Gephi. Our file data looks like this:

Id Label Size
aardd001 David Aardsma 4
aaroh101 Hank Aaron 1
aased001 Don Aase 1
abadf001 Fernando Abad 1
abbae101 Ed Abbaticchio 1
abbeb101 Bert Abbey 1
abbof101 Fred Abbott 1
abboj001 Jim Abbott 2

and for the team entries:


OAK Oakland Athletics 936
PH4 Philadelphia Athletics 6
PHA Philadelphia Athletics 355
PHI Philadelphia Blue Jays 28
PHI Philadelphia Phillies 1445
PHI Philadelphia Quakers 3
PIT Pittsburg Alleghenys 9
PIT Pittsburgh Pirates 1416

This is all Gephi requires for displaying nodes – an ID, a Label, and size. Even the label and size are not required fields, but they do make things easier if done in advance. So far, so good. Next we’ll move on to the somewhat more involved process of creating edge files.

As I progressed deeper into this project, it became evident to me that there were four different types of edges to display. The first two were obvious and easy – players being traded to a team, or from a team. Yet I also wanted to see the other players involved in each transaction, which necessitated the addition of two more edge type – traded with other players, and traded for other players. Note that in many cases just two or three of these might come into play, and for many prominent players, we’ll have none at all. Thus, the likes of an Al Kaline or Ted Williams will not be found in any of these graphs, as they remained with a single team for their entire careers.

Here’s the final edge code I wound up with to create the four categories of trades to be displayed in a graph. Gephi requires three edge attributes – a source value, a target value, and an edge type. The edge type must be either undirected or directed; for our graph, all edges will be directed, since we intend to show the bi-directional movements within each transaction. The first bit of code is for instances where a player was traded from a team:

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.TeamFrom AS Source, tr.Player as Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “, CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” from “, t.name) AS Label, ‘Directed’ as Type, ‘Traded From’ as CategoryDetail

FROM trades2015 tr
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Teams t ON tr.TeamFrom = t.teamIDretro and t.yearID = tr.season

WHERE tr.type = ‘T’

Note the legacy code covering free agency and releases, rendered moot by the WHERE clause. These will have to wait for another set of graphs. In a similar fashion we have code for trades where a player comes to a team.

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.Player AS Source, tr.TeamTo as Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “, CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” to “, t.name) AS Label, ‘Directed’ as Type, ‘Traded To’ as CategoryDetail

FROM trades2015 tr
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Teams t ON tr.TeamTo = t.teamIDretro and t.yearID = tr.season

WHERE tr.type = ‘T’

Next, it’s time to create linkages with players from the same transaction, first those moving in the same direction (traded with) in the trade.

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.Player AS Source, tr2.Player AS Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “, CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” with “, m2.nameFirst, ” “, m2.nameLast) AS Label, ‘Directed’ as Type,
‘Traded With’ as CategoryDetail

FROM trades2015 tr
INNER JOIN trades2015 tr2
ON tr.TransactionID = tr2.TransactionID
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Master m2 ON tr2.player = m2.retroID

WHERE tr.type = ‘T’

Note the need to duplicate the Master table in the code, since we now require multiple player names to populate the Source and Target fields in Gephi. The same holds true for our last snippet, where players are traded for one another.

SELECT tr.Season, tr.TransactionID, tr.PrimaryDate, tr.Player AS Source, tr2.Player AS Target,
CASE WHEN tr.Type = ‘T’ THEN ‘Trade’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Release’ END as Category,
CONCAT(m.nameFirst, ” “, m.nameLast, ” “,CASE WHEN tr.Type = ‘T’ THEN ‘Traded’ WHEN tr.Type = ‘F’ THEN ‘Free Agent Signing’ WHEN tr.Type = ‘Fg’
THEN ‘Free Agent Granted’ WHEN tr.Type = ‘R’ THEN ‘Released’ END, ” on “, tr.PrimaryDate, ” for “, m2.nameFirst, ” “, m2.nameLast) AS Label, ‘Directed’ as Type,
‘Traded For’ as CategoryDetail

FROM trades2015 tr
INNER JOIN trades2015 tr2
ON tr.TransactionID = tr2.TransactionID
INNER JOIN Master m ON tr.player = m.retroID
INNER JOIN Master m2 ON tr2.player = m2.retroID

WHERE tr.type = ‘T’

Each of these bits of code outputs results, which are then copied and pasted into our edges spreadsheet. Here are five rows showing each of our four trade categories:

Season TransactionID PrimaryDate Source Target Category Label Type CategoryDetail
2010 62908 20100731 KCA ankir001 Trade Rick Ankiel Traded on 20100731 from Kansas City Royals Directed Traded From
2010 60709 20100831 TEX ariaj001 Trade Joaquin Arias Traded on 20100831 from Texas Rangers Directed Traded From
2010 62264 20101118 COL barmc001 Trade Clint Barmes Traded on 20101118 from Colorado Rockies Directed Traded From
2010 72627 20101217 TBA bartj001 Trade Jason Bartlett Traded on 20101217 from Tampa Bay Rays Directed Traded From
2010 72622 20100709 TEX beavb001 Trade Blake Beavan Traded on 20100709 from Texas Rangers Directed Traded From

2010 62908 20100731 ankir001 ATL Trade Rick Ankiel Traded on 20100731 to Atlanta Braves Directed Traded To
2010 60709 20100831 ariaj001 NYN Trade Joaquin Arias Traded on 20100831 to New York Mets Directed Traded To
2010 62264 20101118 barmc001 HOU Trade Clint Barmes Traded on 20101118 to Houston Astros Directed Traded To
2010 72627 20101217 bartj001 SDN Trade Jason Bartlett Traded on 20101217 to San Diego Padres Directed Traded To
2010 72622 20100709 beavb001 SEA Trade Blake Beavan Traded on 20100709 to Seattle Mariners Directed Traded To

2010 62908 20100731 ankir001 blang001 Trade Rick Ankiel Traded on 20100731 for Gregor Blanco Directed Traded For
2010 62908 20100731 ankir001 chavj001 Trade Rick Ankiel Traded on 20100731 for Jesse Chavez Directed Traded For
2010 62908 20100731 ankir001 collt001 Trade Rick Ankiel Traded on 20100731 for Tim Collins Directed Traded For
2010 60709 20100831 ariaj001 franj004 Trade Joaquin Arias Traded on 20100831 for Jeff Francoeur Directed Traded For
2010 72627 20101217 bartj001 figuc001 Trade Jason Bartlett Traded on 20101217 for Cole Figueroa Directed Traded For

2010 62908 20100731 ankir001 farnk001 Trade Rick Ankiel Traded on 20100731 with Kyle Farnsworth Directed Traded With
2010 66840 20101219 betay001 greiz001 Trade Yuniesky Betancourt Traded on 20101219 with Zack Greinke Directed Traded With
2010 72622 20100709 beavb001 luekj001 Trade Blake Beavan Traded on 20100709 with Josh Lueke Directed Traded With
2010 72622 20100709 beavb001 smoaj001 Trade Blake Beavan Traded on 20100709 with Justin Smoak Directed Traded With
2010 62908 20100731 blang001 chavj001 Trade Gregor Blanco Traded on 20100731 with Jesse Chavez Directed Traded With

We have now successfully prepared the data for Gephi. In our next post, I’ll examine the process starting with the Gephi data import phase. Thanks for reading!

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Exploratory DataViz Part 2

Having discussed some of Exploratory’s cool features in a prior post, I thought it would be fun to continue the exploration using JSON data as a starting point. I happen to have a fair amount of JSON on hand, thanks to a series of network graphs produced using Gephi and sigma.js, so why not put it to use with Exploratory and start creating a new dataviz?

If you have previously worked with JSON, you’re no doubt aware that it can be a bit fickle – miss a bracket or brace in one place and the entire file fails to load a visualization. However, knowing that my JSON has been successful in producing network graphs (see here for examples), I figured it was worth a shot with Exploratory.

To begin, start with the local import option, selecting the json option, and pointing it to your local file. Give it a name, run the process and cross your fingers! After a few seconds, I’ve got my results, and Exploratory has done a good job categorizing the data:

exploratory_2.1

Since this is network data, we have nodes and edges, as well as any additional attributes, such as color or size. Exploratory has picked up those groupings, first the edges, and now the nodes.

exploratory_2.2

Finally, the attribute values:

exploratory_2.3

Since we’re satisfied with the import, we can move on to the summary data, which in this case doesn’t make a whole lot of sense. No matter, let’s see what can be done with some charts and analysis.

exploratory_2.4

To start with, we have x and y values associated with each node, which sounds like a perfect candidate for a scatter plot. We add the x value to the x-axis (how convenient was that!), the y value to the y-axis, node size as the Size attribute, and finally the Eccentricity attribute for color. FWIW, eccentricity is not a measure of flakiness, but rather the distance between the most remote points in a graph. This is where the six degrees of separation (or Kevin Bacon, take your pick) concept comes into play; an eccentricity value of 6 equates to 6 degrees of distance. Here’s our result:

exploratory_2.5

Not bad, eh? We can also hover over each node to see who it is (after adding Id to the Label field):

exploratory_2.6

We still have a lot of activity in a limited space, so now let’s use a simple filter (see the command line at top) to grab the top 50 values, and see the results:

exploratory_2.7

Now let’s create a new branch to explore further. I would like to sort my dataset using the Betweenness Centrality attribute, but there’s one problem – it’s a character value at the moment, so it doesn’t sort numerically. No matter, we can fix that easily using the Mutate command to convert the variable type. This can be seen in the right margin, where Exploratory conveniently stores all actions. Now we can sort our values in descending order to understand who is most influential in the network (at least by this measure). FYI – Betweenness Centrality tells us which nodes others must pass through most frequently to connect elsewhere within the network. Typically, but not always, it is someone centrally located within a network; sometimes it may be a less influential character (Pedro Borbon in this case) who connects more distant groups to one another.

exploratory_2.9

So there you have it, another quick walk-through with Exploratory. Before I sign off, here’s the live scatter plot you can play with via the Exploratory server. Be sure to use the simple zoom features to traverse the chart!

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Open Source Data Viz: Exploratory

It’s absolutely a great time to be alive and involved in data viz, courtesy of the wealth of exceptional open source projects. Several recent open source discoveries are currently on my radar, and worthy of further exploration. Over the next few weeks I’ll examine a few of these options, using baseball data (of course) to illustrate the possibilities within each application. Specifically, we’ll take a look at Trelliscope, bokeh, rbokeh, and Exploratory, and provide some insight and examples into how each of these projects function. This post will focus on Exploratory, an exciting new tool from Kan Nishida.

Exploratory is another R-based application that leverages a multitude of R capabilities while providing its own intuitive interface. While still in beta testing, Exploratory appears to have a very bright future as a powerful visualization tool that allows non-coders to tap into the enormous power of R. The ability to harness a considerable portion of the R language through Exploratory’s GUI is a powerful option for those (like me) with limited R experience and expertise.

Exploratory has a very clean, intuitive interface that may feel a little unusual to long-time R users accustomed to multiple panes and workspaces. Yet beneath the surface, it possesses considerable power, as we’ll see in this tutorial. To start our process, we’ll need a data frame, a familiar object for R users. Let’s begin by examining our data frame options.

First up, we can load a local source file in a variety of formats:
exploratory_local
Some of the usual suspects are here – text and Excel files, but we also have the ability to load json data as well as some of the more prominent statistical formats including SAS and SPSS data. Very cool. We’ll come back to this later.

Now let’s see the remote options:

exploratory_rscript

Great! Not only can we gain direct access to MySQL databases (a huge plus for me), Exploratory also provides access to a diverse range of option including Twitter search, MongoDB, and web scraping. We’re going to look at some specific examples later, but for now, here’s a glimpse of the MySQL data import window:

exploratory_mysql

As with the entire app, the design is clean and intuitive. In a bit, I’m going to load details into this window so we can test the MySQL functionality.

A third import option exists in the availability to access any existing R scripts you may have previously created:

exploratory_rscript

I’m not going to spend a lot of time here, due to the fact that I don’t have a lot (any?) of personal scripts. However, for seasoned R coders, this seems like a great feature.

Now let’s walk through some of Exploratory’s capabilities using a MySQL connection. The MySQL setup is really easy – just fill in your database connection parameters and you’re good to go. Here’s what it looks like for this example, with a few fields grayed out for security reasons.

MySQL connection

Once the connection is established, Exploratory will display the initial rows in the dataset. If we click the Run button, our data is pulled into a Summary view, where every variable in the data is summarized. This is a great way to see if our data looks as expected, and allows us to determine if the correct variable type (integer, date, etc.) is associated with each field.

exploratory_summary

If everything looks good, we can move on to the Table option, which will resemble the MySQL view we just saw. No surprises here:

exploratory_table

If we’re satisfied so far, then it’s time to move on to the fun aspects of Exploratory. For me, this starts with viewing data using the Charts selection. As of this writing, there are 10 chart options (two are actually mapping selections for geo data) including bars, scatter plots, box plots, heatmaps, and more. For me, this is a real strength of Exploratory; the ease with which we can see plots of our data is great! Here I’ve chosen a couple stat fields (at bats (AB) and runs (R)) to illustrate the scatter plot functionality.

exploratory_chart

The charts are clean and attractive, and provide some additional options. For scatter plots, labels can be added via a simple check box. This permits me to add hover labels, as seen below:

exploratory_chart_label

Pretty nice so far, don’t you think? But as the old commercials used to say ‘wait, there’s more’. The considerable power of R lies beneath the surface, enabling statistical testing, filtering, data manipulation, and so much more. Here’s a glimpse of just a handful of available options for working with your data:

exploratory_options

Let’s select a filter option, where we’ll reduce the data to look only at players age 30 or greater. One of the other great aspects of Exploratory is it’s exposition of R code. We can use the built in menu commands while viewing the actual R code. For experienced R users, the functions can be entered directly in a text box, and for us less experienced coders, we can learn on the fly by seeing the output.

exploratory_filter

Now we see the same scatter plot populated with players 30 and older.

Another great feature is the ability to create branches within a project. This facilitates going down multiple paths within one workspace, rather than having to retrace our steps or rerun charts each time something changes. All we need to do is click the branch button, and a new tab is created for us. Very simple and intuitive, as is virtually everything in Exploratory.

exploratory_branch

In this instance, we’ve elected to run a correlation on the chart variables in our main flow, while we create a new box plot in our branch.

exploratory_branch_chart

I’ve been very impressed thus far with Exploratory, and have barely scratched the surface. My next step will be to create some real content that can be shared in a post or via some new visualizations on the site. I love the ease of accessing my data via MySQL, and immediately having the ability to create plots, filter data, and run statistical explorations.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Baseball Grafika Book: Excel Dashboards

Baseball stats are ideally suited for display using a wide variety of charts, network graphs, and other visualization approaches. This is true whether we are using spreadsheet tools such as Microsoft Excel or OpenOffice Calc, data mining tools like Orange, RapidMiner or R, network analysis software such as Gephi and Cytoscape, or web-based visualization tools like D3 or Tableau Public. The sheer scope and variety of available baseball statistics can be brought to life using any one of these or countless other tools.

This is why I felt the need to create a book merging the rich statistical and historical data in the baseball archive with the advanced analytic and visual capabilities of the aforementioned tools. With a bit of good luck and perseverance, the book will be published in late April, coinciding with the early stages of the 2016 baseball season, under the title Baseball Grafika. This series of articles will share a few pieces from the book, which is still undergoing additions and revisions at this stage. I hope these will help provide some insight into how I view the possibilities for visualization, and perhaps generate your interest for how other datasets could benefit from a similar approach.

One of the chapters of the book deals with the creation of dashboards in Excel that allow us to distill large datasets into a single page summarizing information. Here’s an example of a single pennant race, and how it’s unique story can be told using an array of charts, tables, and graphics.

AL_1967_5

Now that we’ve seen an entire dashboard, we’ll look at the component pieces and how they were built. As a reminder, this is all created in Excel, which is often maligned as a visualization tool. Used well, Excel can produce highly effective visualizations, although deploying them to the web is not practical. In the book, I walk through how to create this dashboard using Excel, taking readers through all the steps needed to create formulas, charts, text summaries, and more.

Creating flexible, powerful data displays in Excel frequently involves the use of pivot tables and slicers (filters) that allow for data manipulation. Building charts on top of these tools permits maximum flexibility. Done effectively, this means we can create a template that can be used over and over, with only the source data changing according to our slicer selections. Here’s an example pivot table with slicer options:

team_pivot

The slicer selections allow us to choose the data elements from our base dataset that are to be displayed in a pivot table. From there, name ranges and formulas can be used to select the data programatically, and feed it into charts that are not dependent on any additional manual intervention. One chart, used over and over, makes it simple to display new data with a single click of a slicer button.

Name ranges can be used extensively to automate the dashboard to a high degree, using native Excel functionality. Here’s a screenshot showing a name being defined in Excel:

Excel_name_range

A virtually unlimited number of name ranges can be created, and then used as references in Excel cell formulas, making it easy to populate cells, tables, or charts with updated information.

Each of the following sections of the final dashboard are populated using one or more name ranges based on pivot table data in most cases. All that is required in the dashboard is a simple formula to grab the right data based on the slicer selection.

First, we create a basic text summary recapping each season, which is then pulled into the top section of the dashboard:

AL_1967_1

This is then followed by the pennant race section of the dashboard, including both the pennant race charts as well as a table of season-ending standings information. One pivot table and its references populate the chart, while a second pivot is used to provide the table data, with cell-level formulas performing calculations.

AL_1967_2

Our third section makes use of the wonderful Sparklines for Excel add-in. Our dashboard benefits from the use of horizon and variance charts, as well as box plots. In between, we’re able to add some additional Excel cell calculations to display metric values.

AL_1967_3

The final section of the dashboard takes advantage of some cell formulas to create dotplots displaying relative values within a category. This allows readers to see who was higher or lower in a specific measure, maximizing space along the way, which is often critical when building dashboards.

AL_1967_4

The book will provide much more, including tutorials on creating this type of dashboard, in addition to other visual displays of baseball information. Ultimately, the goal is to share some of my approaches and hope that they drive others to create their own unique approaches, all in the interest of advancing the discipline of baseball data visualization.

Future posts will examine other ways we can explore our baseball data. Text mining, statistical distributions, interactive charts, historical maps, and network graphs will be among our future topics. See you soon.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

CartoDB MLB Birthplace Map

Finally, thanks to the brilliance of the CartoDB platform and abetted by the beautiful Stamen Design watercolors theme, I have a map that tracks the debut of thousands of major league ballplayers from 1871 to 2013 (2014 data will be added at a future date). This is one I’ve been cooking up for awhile, but couldn’t get to as a top priority, given that it required some late night time fixing geo codes for hundreds of towns in places like the Dominican Republic, Puerto Rico, Japan, and Venezuela. All that was finally completed, giving me a dataset with a high degree of integrity – probably 99% accurate.

Have a look at the finished map – going to full screen mode will let you appreciate it even more:

This is the first in what could become a series, as the same information could be displayed in a variety of other formats such as bubbles, choropleth (filled maps), or clusters.

What to say about CartoDB? It’s absolutely brilliant in both concept and execution, and the founders seem willing to make strategic modifications on the fly. For now, I’m working with the free version (limited data capacity), but in time, may want to step up, given the capabilities of the software.

Here’s a look at what I’m talking about, so you can get a feel for the user interface – very clean and easy to navigate. First, the entire window for the current project:

birthplace_carto_map_20141122

CSS styling is also available for those wishing to tweak their maps, with the wizards providing the initial styling:

carto_css_20141122

You can even limit your data using the available SQL window, a great option for users (like myself) who are well acquainted with SQL:

carto_sql_20141122

Finally, a simple toggle at the top of the window lets you move seamlesly between the map and data views. Here’s a quick look at the data for this project:

carto_data_20141122

I should mention that working with the data is just as easy as styling the map or using the wizards. I have been able to quickly change string values to dates, and to geo-reference the data using the latitude and longitude fields in my text file. Anyone with experience working with Excel or any number of database platforms knows that converting field types is often very challenging, and sometimes comes with the risk of losing the data in that field. Not so with CartoDB, as it easily converted the date values to timestamps suitable for the torque (timeline) mapping wizard.

You should be seeing more work from me using CartoDB, and it won’t be limited to just baseball data.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Sorry MicroStrategy: Back to Excel

A few recent posts have documented my explorations with the new desktop software from MicroStrategy, clearly designed to compete with the likes of Tableau and Excel. One of the chief advantages of the MS offering is the free price point (as in $0), versus the much pricier Excel and especially, Tableau. As usual, I had to perform my due diligence, as is the case with every new tool I get my hands on.

The MS Analytics offering does have a lot to recommend it by, as I noted in my previous posts. However, after a few weeks of intensive exploration, I have come to the conclusion that it isn’t a good fit for what I’m currently attempting to do, which is to create graphics for a pair of upcoming books. While it is perhaps easier to manage the data compared to Excel, the structure is a bit too rigid, and the chart options are also too structured and limiting for my current needs. It’s still useful, but not so much in the current context.

Which brings me back to Excel. For all its faults, Excel is still very powerful, and most importantly for me, very flexible. I can hack my way into almost any sort of chart, aided by the likes of Jon Peltier, Chandoo, and Fabrice Rimlinger. So Excel it is, at least for this project.

Now that I know what I’m doing, it’s high time to get back to work and deliver these books! Look for the 1901-68 pennant races in December, and the 1969-2013 to follow in early 2014.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Saiku Treemaps and Sunbursts

I’m becoming a real advocate for the Saiku Server tool, the best open source OLAP product I have experienced, and one that keeps getting better. One of the great strengths of Saiku is the use of the terrific Protovis visualization library, which enables users to quickly toggle from a crosstab view to a diverse range of charts. Now, with version 2.6, the Saiku developers have added a few new features that make the tool even more powerful. I’ll walk through these in the following paragraphs, and show a few examples along the way.

The first new features are two visualization types that are becoming increasingly popular through tools such as Protovis and d3 – Sunbursts and Treemaps. These two chart types have a common theme, in that they display multiple levels of data within a single chart. In the case of a treemap, this is done using a rectangular format composed of smaller internal rectangles based on the proportions of respective data elements. A sunburst uses a similar approach, but with a circular framework that fans out from the center to encompass multiple layers of data, laid out based on proportions, just as with the treemap.

This will be much easier to explain with pictures, so let’s do that. In this example, I have set up a crosstab with home runs by franchise for the seasons 2006 through 2010, filtered by the American League. To make it easier to comprehend, I’ve also sorted by the number of home runs byfranchise over this period. This will provide enough data for the chart to be interesting while not overwhelming us visually. Each team will have a rectangle for each season sized by the number of home runs hit in that particular year. As with some stacked bar charts, the total for each season will add up to 100%, regardless of the raw numbers. Here we are:

You should get a pretty good idea of the numbers based on individual rectangle sizes, and if you don’t, Saiku rides to the rescue with some nice mouseover titles that tell you exactly what it is you’re seeing.

Now let’s see how the same data works with the sunburst chart:

Gets a little crowded compared to the treemap, due to the limitations of the circular layout. Let’s reduce the size of our data by including just the 2008-2010 seasons. This should help us to take advantage of one of the strengths of the sunburst diagram – the ability to size each level of the diagram. So in the case where one season has a higher number of home runs hit than the next season, the chart will reflect this visually. Recall that the treemap did not do this, instead providing a fixed 100% for each season, regardless of the raw numbers.

Our improved chart:

Even with the reduced dataset, we have some issues, mainly with the long titles of some of the franchise names. This could be edited outside of Saiku using Inkscape or Adobe Illustrator, or perhaps we can tinker with the CSS inside of Protovis to shrink the font size. Nonetheless, I hope this has illustrated the basic premise behind sunburst charts, and their ability to display multiple layers of information in a single chart. Bear in mind they may not always be the optimal chart for displaying this sort of data, especially in a static chart.

The folks at Saiku promise a lot more to come, and based on the heretofore swift evolution of the product, I have little reason to doubt them. Remember, if you wish to play with this yourself, simple login at VBP Lab, using the credentials labuser/user to access the tool.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

MicroStrategy Analytics Update

A few days ago, I stumbled across the free analytics offerings from MicroStrategy, as detailed in my last blog post. As you may recall, I had begun dabbling with the desktop version (there is also the online Analytics Express, with some slight differences), and promised to report back with further insights into the strengths and weaknesses of the tool. So here I am, in front of a Friday night fire (contained within our fireplace) in chilly Detroit, after having spent much of the day working with Analytics Desktop, or AD, as I’ll refer to it for the remainder of the post.

On balance, I’ve been favorably impressed with AD, although there may be a selfish motivation to take advantage of what AD can do. Given that I am in the midst of preparing a couple of highly visual baseball books to come out later this year and in early 2014, I saw an opportunity to tap into AD to create some of the visuals for the book. So I was sincerely hoping that I would like it, and that it could help make it easier for me to create certain portions of the book that would be far more laborious using Excel or other tools.

So with that out of the way, let’s walk through an analysis of the strengths and weaknesses of AD, using examples whenever possible. Let’s start by getting the weaknesses out of the way, and then move on to the longer list of strengths AD brings to the table.

Weaknesses and shortcomings:

  1. The big one – AD uses Flash for all charts and dashboards within the app. Given that Flash is on the way out as a technology, this seems a curious choice. Perhaps the folks at MicroStrategy had a team of Flash developers sitting around, making it easier to launch the product quickly. Clearly, javascript has taken over from Flash in the data viz universe, for a multitude of reasons, so using Flash is not going to wow anyone who’s familiar with d3, Protovis, or a handful of other open source libraries.

  2. Next is the use of a java server to run the app – based on the default port (8082), this feels like a Tomcat instance built into the application. This means that it takes some time to launch the app and have it load in your browser. To AD’s credit, it has run flawlessly on my machine, and was a breeze to install. Still, the combination of a java server and Flash may feel a bit awkward to a desktop user familiar with offerings from Tableau or other data viz vendors. It certainly will for the Excel crowd.

  3. For someone familiar with d3, or for that matter Tableau or Excel, AD will feel a bit constrained in terms of options; for example, it is a bit trickier to get colors to do what you want (a right click will get you the nearly meaningless Flash option settings). As someone with extensive Excel and Tableau experience, this is the most challenging element for me. Do not expect the same sort of capability from AD, although there are some options for customizing your charts and tables. In some ways, this makes the app feel outdated – modern visualization tools provide a great deal of flexibility by comparison.

Those are my three major observations after a few days of use, with number three covering a wide range of options that are either not available or that have been hard coded into the app, thus restricting or limiting your ability to create the chart of your dreams. Now, on to the positive stuff:

  1. Connecting to my data was an absolute breeze, at least after jumping through the Windows ODBC hoops (32 bit vs 64 bit). Once I figured out the correct ODBC executable on my machine, I was off to the races, and had a database connection within seconds. I use MySQL, so of course I needed the correct driver installed locally, but no issues there. After these steps were complete, I was able to view all my database tabs, and then write some simple SQL code in AD to bring back the data I needed. Once connected, here’s what I saw:

  2. The GUI is easy to navigate as well, showing the available dashboards, including samples to get you started, as shown here:

  3. Charts are very attractive, and can be easily re-sized within a dashboard; in fact, charts are easier to re-size here than in a comparable Tableau dashboard, where you need to set up frames in addition to the charts. Here are some example charts that may wind up in one of the books:

  4. A number of options are available for each chart, using a menu-driven approach (definitely not javascript here!)

  5. AD also has a ‘Page By’ option which is great for cases where you want to replicate the same charts or tables across multiple instances of a variable. In my case, this could be by team or by season. Set the page up once, set the page by variable, and you instantly get the same charts populated with data specific to the individual page. Pretty slick! Tableau has a similar feature, and you can use Excel pivot table functionality in the same way, although I find the AD approach to be more powerful and simple.

  6. Exporting to a PDF or image (.png) format is also very simple. From my perspective, the image export is excellent, as it captures the entire panel view of multiple charts as one image.

  7. Finally, it’s very easy to create multiple panels within a single page layout, as well as to add additional layout pages. These features make for a cleaner, easier to navigate feel versus the many tabs used in Tableau or Excel.

  8. That’s my take for now – all in all, AD is a great addition to the data viz toolkit, even with the Flash-based limitations. It won’t get you to some of the same places you can go with Excel, Tableau, or especially d3, but it is at least their equal in data handling and dashboard layout possibilities. If you work with .csv, Excel, or database data, give it a try!

    Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
    Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

A New Analytics Tool from MicroStrategy

Being the analysis geek that I am, I’m always on the lookout for anything new in the data analysis and visualization space. New insights, techniques, people, datasets, tools, etc. always intrigue me and help keep things fresh. Any time I come across something new makes it a rewarding day, especially when it leads to something that can help me with my baseball analysis.

So today was one of those days where I stumbled across a new tool, courtesy of a new blog I also stumbled upon. The new tool comes courtesy of MicroStrategy, one of the mainline Business Intelligence (BI) players in the industry, and is called Analytics Desktop, and quite remarkably is a free tool (as in $0.00!). So of course I had to evaluate the latest (released October 2013) addition to this interesting space.

My previous impression of MicroStrategy was tepid at best, given my familiarity with their large scale BI installations for major corporations, including one I had previously used in the corporate world. It was highly structured, felt inflexible, and churned out canned reports that took too long to run. In short, I saw it as a dinosaur app, even several years ago, competing with the likes of Cognos and other major BI players in the world of operational reporting.

The new tool is clearly designed to compete with Tableau and other nimble, visually-oriented BI vendors who understand that the business analysis space has evolved (far) beyond applications that are controlled by the IT folks. Many of the old apps wind up gathering dust or are used to generate dull reports that shed little insight into the needs of the business. Good analysts have been bypassing those tools for years by dropping ad hoc data into Excel or (more recently) Tableau, where they can at least create some decent charts and tables without having to wage battles with unfriendly BI servers.

Enough of the background – what’s the early verdict? In a word, impressive! I’m only a couple hours in, but have managed to complete the installation, configure the connection to my databases, and begin playing with data. The user interface is easy to navigate, the charts are clean and nicely styled, and the ability to work with filters, pages, sorting, and more appears to make this a very powerful app. While some of my favorite chart types are not here – horizon charts, bullet charts, and a couple others – my first impressions are hugely favorable.

I’m going to give a more complete review soon, after I’ve had time to work through some of the ideas I had been intending for Excel. In the meantime, here’s a quick look at a dashboard I created using team level data.

Pretty sweet, isn’t it? The ability to combine multiple chart and data elements into a dashboard is one of the strengths of Analytics Desktop, and one that I expect to tap into in the coming weeks. Much more to come on this one.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather

Saiku Toolbar Tutorial Part 1

I recently posted on Saiku-Server, the new tool for the VBP Analysis Lab. In that post, I walked you though some of the many cool features, including a number of charting options. Now I want to shift the focus to the Saiku toolbar, which affords us many more options. I’ll cover this in two tutorials, given the many available options on the toolbar.

We’ll walk through a series of screenshots of the toolbar, highlighting each button as we go, with a brief explanation of the functionality.

First up is the new query button:

This option allows users to open a new query window (tab) without having to close any existing query windows, similar to web browsers or Excel. Any actions taken in the new window will be completely independent of existing queries; you can set new filters, rows,columns, sorts, etc.

Next comes the open query option, which lets you open a previously created query that is now stored in the Saiku repository.

Here’s what you’ll see when you select this option:

Our third button is the save query function. Simply put, this will save an existing query to the repository for future use.

You’ll see a similar looking dialog window that allows you to save your query for future use.

Next up is is the run query function. This will run the existing query, useful if you don’t have the automatic execution option set.

The automatic execution button is a toggle that lets you run queries automatically – or not. If you elect not to use it, then the run query button will become your best friend.

The next three buttons are used to manipulate your display options. In order from left to right they are the Toggle fields,Toggle sidebar, and Hide parentsbuttons.

Toggle fields will hide the rows, columns, and filters from view, providing more vertical workspace for your queries. Here are before and after looks.

Toggle sidebar will hide the folders in the left sidebar, giving you more horizontal space. Again, here’s a look after we chose to toggle the sidebar:

Finally, the hide parents option lets you choose how to view your layout in the case where you have multiple levels within a hierarchy. Here are before and after views:

The Non-empty button is used to hide missing results; in other words, rows and columns within a hierarchy that would return empty cells are removed from the results, leading to a cleaner crosstab. Once again, let’s examine what happens when we decide to show the empty rows in the dataset, and see why we probably want to use the non-empty function most of the time.

That’s it for the first tutorial. In the second part, I’ll cover a range of buttons that are oriented toward exporting the data as well as viewing the MDX (Multidimensional Expressions) language behind the query. See you then!

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmailby feather
Facebooktwittergoogle_pluslinkedinrssFacebooktwittergoogle_pluslinkedinrssby feather