Updated Pennant Race Charts

The 2020, 2021, and 2022 MLB pennant race charts using Retrosheet data have been updated on the Exploratory Server: https://exploratory.io/dashboard/kc2519/Pennant-Races-1901-current-aUu5vDT1EW. All seasons from 1901-2022 are now available using the simple parameter selection (just make sure it’s set to the interactive mode).

Here’s a screenshot:

Views of the 2022 National League pennant races by division

Meanwhile, I’m struggling with some JSON output for my traditional version, so no updates there yet.

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Bad Trades, Red Sox Edition

This is the first in a series of posts where I take a look at notoriously one-sided baseball trades, using the baseball trade networks published on this site earlier in 2022. I won’t necessarily rank these deals in any sort of order; rather I will pick out a few from the network trade graphs and provide some analysis and context for some of the most notorious transactions.

If you haven’t seen the trade networks previously, here’s a link.

The networks were built using data from Retrosheet and Neil Paine, loaded into Gephi, a network analysis and visualization tool, and ultimately pushed to the web where I could finish styling the graphs. Graph nodes (the circles in the networks) are sized based on the total future WAR (Wins Above Replacement) accrued by the teams involved in the trade. All values must occur at the major league level (MLB), so players involved in the deal who don’t reach the MLB level with their new team will have a zero value. Only the cumulative WAR value while playing for the new team is included; we are not calculating WAR once a player leaves one of the teams involved in the transaction.

Finding a bad trade by scanning the networks is more an art than a science; the key is to look for large nodes (indicating a lot of future WAR value), and then dissecting the trade to see how much value each team received. The other alternative is if we already know the player(s) we are looking for; in these cases we can perform a simple search to find the trade. Here’s a classic example that Red Sox fans would love to forget – trading future Hall of Famer Jeff Bagwell for journeyman reliever Larry Anderson. Let’s go to the Red Sox trade network and search for Jeff Bagwell.

Red Sox trade network

Typing in Jeff Bagwell locates him quickly within the trade network. Note that even if a player is involved in multiple trades to or from the same team (rare but possible) the search will locate each transaction. Here’s the Bagwell transaction, showing his player node and future WAR value connected to the transaction node; every player involved in that transaction will be connected to the trade node, as long as there is some future WAR value. If a player in the trade did not play in the majors for the receiving team, they will not be reflected in the graph. Here’s a view of Jeff Bagwell relative to the trade:

Jeff Bagwell transaction

We can also click on the transaction node to see the value provided to each team by all of the players involved in the trade, again assuming they spent time with the team and were not limited to the minor leagues. Clicking on that node will display the respective WAR values in the sidebar on the left of the screen:

WAR values of the trade

Here’s where we get to the details of the trade, and specifically the direct benefits accrued to each team. The Red Sox received 1.1 future WAR from Larry Anderson; to put this in perspective, we might expect this sort of value for an average player for a single season. The Astros, on the other hand received an incredible 93.8 WAR from Jeff Bagwell, or close to 6 WAR per season for 16 years! That is a Hall of Fame level performance, and it eventually led to his selection to Cooperstown in 2017. Here’s a profile that mentions the one-sided trade.

While we have the Red Sox network open, let’s see if there are any other disastrous transactions (other than the cash sale of Babe Ruth to the Yankees, technically not a trade). After scanning the network, we find this one from 1928:

Transaction 59324 – Buddy Myer

This one is clearly not a Bagwell-level disaster, but was still quite negative for the Red Sox, with a WAR differential of 30 points. The primary villain here is Buddy Myer, a solid infielder who hit .300 or better seven times for the Senators. Not a major star, but the owner of a very nice career, including leading the American League in batting average in 1935.

Let’s try to find one more before closing this piece, this time favoring the Red Sox. We zero in on this deal:

Transaction 59403 – Jimmie Foxx

The Red Sox netted nearly 45 future WAR value while surrendering just 0.1; most of the benefit was generated by slugging future Hall of Famer Jimmie Foxx, but they also received a nice three season contribution from pitcher Johnny Marcum. Note how we also removed nodes not involved in the trade by clicking on the edges icon on the bottom left of the display area; this makes it easier to focus on the details.

Feel free to try your hand at finding more of these one-sided deals in the Red Sox or any other trade networks. I’ll be back with some other teams before long. Thanks for reading!

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

First 10 WAR Trade Networks Published!

The first 10 WAR (Wins Above Replacement) Trade Networks are now available for exploring! This initial group includes nine team networks and one overall graph with all teams included. Here’s a list of the 10 graphs:

Each of these and any upcoming WAR trade networks can be found on this page.

Let’s walk through how the graphs work, using the Detroit Tigers network as an example. We’ll begin with an anatomy of the graph display:

As the image shows, the primary focus will be the main graph area in the center of the window. This is where all nodes (transactions, teams, and players) will reside, connected by edges based on common relationships. Transaction nodes will vary in size based on the total value of a trade with the largest nodes indicating a trade that created significant future WAR for one or both teams. Team and player nodes are set to constant sizes so that the initial visual focus will be on the transaction nodes. The size differences become more noticeable when we zoom in to the network. More on that shortly.

Edges are also sized based on WAR value; this is where we see the value provided to a team and by specific players. Edge sizes (weights) will be more easily seen when we zoom in to the network.

On the left are some graph controls to assist in navigating the graph. We can zoom in using the slider control or the plus/minus buttons adjacent to the slider. Zooming can also be done with a mouse scroll if you prefer that option. The fisheye lens can be toggled on or off and can be used to highlight certain areas of the graph by hovering over a selected region. Finally, the edges button will enable showing or hiding edges and connected nodes. This is useful when you wish to reduce surrounding nodes and focus on specific transactions. We can also pan the graph by dragging it using a mouse – this is helpful in centering a network or viewing specific regions of the graph.

At the upper left of the window is a color legend for each node type, and hidden on the left (not shown in our image) is an information pane that will show specifics about the network. More on that in a bit.

Now let’s examine the information window – this is what makes the network truly powerful. When the network is first displayed or the browser window is refreshed the information pane displays information about the graph (open it by clicking on the arrows icon at the top left):

You can see the simple overview of the graph, the source data, and what it aims to accomplish. Here’s an enlarged version for easier reading:

If we zoom in and select a specific transaction the pane displays the relevant details for that selection:

Now we have the details for the transaction – the season, teams, and players involved. Here’s the enlarged view:

You can do this for any transaction in a graph, or you could choose to select a team or player to see how they fit into the network. The possibilities are nearly endless and it’s a fun way to understand the relationships between teams, players, and trades.

We’ll do more exploring of the networks in upcoming posts; I’ll also be adding more teams until we have a complete set of trade networks. In the meantime, feel free to explore the graphs to learn more about the best (and worst) trades your favorite team has made over the last 120 years. Enjoy, and thanks for reading!

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

2019 Batting Explorer Updates Complete

Welcome to Visual-baseball.com! I’ve just completed an update where data for the 2019 season is now part of the Batting Explorer 2010-2019 interactive visualization. This is a tool where we have every batter in a given season depicted via a baseball card type of framework, showing their key stats for the season, as well as the positions played in the field (by number of games), using a visual of a baseball diamond. It’s a highly interactive way to filter through multiple seasons worth of data by team, player, position, and more.

So while you’re waiting for on the field action to start, have a look at the Batting Explorer to answer some of the questions on your baseball mind. For example, here’s a quick look at all of the left fielders who played at least 110 games at the position in 2019 (by using the filters pill at the top left):

We can take the same results and sort it by the number of home runs, to see who the power hitters in the group were for 2019, by sorting from high to low:

Now we see Kyle Schwarber and Juan Soto at the top of the display. Let’s look at Schwarber’s details by simply hovering over his card:

We now have a pop-up within Schwarber’s card telling a mini-story about his batting stats for the 2019 season. Here’s a closer look:

From this, we learn that Schwarber hit 38 home runs, batted .250, and had an OPS (on-Base + Slugging) of .871, as well as multiple other details. Additionally, you likely noticed the “View the full stats at Baseball-Reference.com” pop-up tag. To get there, simply click on Schwarber’s card, and you’ll be transported (in a new tab) to his page at Baseball-Reference:

Pretty cool, right? Give it a try, or pick your own filters, look at specific teams and seasons, and so on. Here’s the Batting Explorer page, with every decade back to 1900-1910 available for your curiosity.

More updates to come soon on the 2019 data, and thanks for reading!FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Updating Player Networks – Part 3

In Part 1 of this series, we looked at how to generate node and edge data for all players within a single franchise’s history. Part 2 examined how we could take that data and create a network using Gephi, adding graph statistical measures along the way. In this, the final part of the series, our focus is on moving the graph beyond Gephi and on to the web, where users can interact with the data and interrogate the player network using sigma.js software. So let’s pick up with the process of moving the network from Gephi  to sigma.js.

Recall our basic network structure in Gephi, which looks like this:

One of our goals when we export the graph to the web is to enable user interaction, so the above graph becomes a bit less intimidating. As a reminder, this is at most a moderate sized network; the need to provide interactive capabilities becomes even greater for large networks.

There are a few ways we can create files suitable for web deployment using Gephi. In this case, the choice is to use the simple sigma.js export plugin located at File > Export > Sigma.js template. Selecting this option will provide a set of options similar to this:

This template allows for a modest level of customization, including network descriptions, titles, author info, and other attributes relevant to the network. When all fields are filled to your satisfaction, click on the OK button to save the template. Your network will be saved to the location specified in the blank space at the top of the template window (grayed out in this case). A word of caution is in order here – if you make some custom entries to the template, and then make adjustments to your network, be sure to specify a new location to save the generated files. Otherwise, the initial set will be overwritten. This is especially critical if you have gone behind the scenes to customize colors, fonts, and other display attributes. More on that capability in a moment.

Once the template is complete and the OK button is clicked, a set of folders and files is generated that can then easily be copied to the web. Here’s a view of the created file structure:

These files and sub-folders are all housed within a single folder named ‘network’. If you wish to tinker with your graph in Gephi, rename the network folder to something else prior to exporting a second (or 3rd or 4th time). This will help keep you sane. 🙂

Without going into great detail here, let’s talk about the key files:

  • data.json stores all of your graph data, including positioning attributes, statistics created in Gephi, plus node and edge details
  • config.json contains many of the primary graph settings that can be easily edited for optimal web display. It’s quite easy to go through a trial and error process, since the file is so small. Simply make changes, then refresh your browser to see the result.
  • index.html has a few basic settings relevant to web display, most notably the title information that the browser will use

Within the css folder are .CSS files where you can make changes to many display attributes. This is typically where you will adjust fonts and font sizes, as well as some colors. The js folder has javascript files that can be edited to a certain degree, although caution is recommended if you’re not a javascript guru. Finally, the images folder contains any relevant image files to be used for web display, such as logos.

Alright, now that we have had a brief view of the technical details, let’s have a look at the network graph in the browser. Note that this is still a bit experimental at this stage; I’m attempting to customize each graph based on the official team colors or close variations in the color family.

To see some of the interactive functionality, let’s select a specific player. Simply type Ted Williams (the greatest Red Sox batter of all time) in the search box, and view the results:

Now we see only the direct connections (a 1st degree ego network) for Ted Williams (270 degrees in this case), as well as a wealth of statistical information previously calculated in Gephi, seen in the right panel. At the bottom of the panel are hyperlinks where any one of the 270 connections may be clicked, allowing us to view their network. As you can see, sigma.js quickly provides great interactivity for graph viewers.

Even better, we can scroll in to the network at any time:

Hovering on a node generates a pop-up title for that node, as seen for Ted Williams in this instance. We also begin to see the names of other prominent players at this zoom level. Additional zooming will reveal more player titles – a great way to embed information without making the original graph visually chaotic by displaying all titles at every level.

For the current web version of this graph, click here. I’ll try to keep this version active, even if I make improvements to the final network. Once again, thanks for reading!FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Updating Player Networks – Part 2

In our previous post, we looked at how to acquire and load our baseball player data into Gephi. In this second installment, the focus will be on creating a player network graph in Gephi, and customizing many settings to deliver a network graph we can export to the web. Player networks are used to detail the connections between all players who are connected to one another in some fashion. In this instance, it is based on players having played for the same team in one or more common seasons. So let’s begin with the process of creating the graph using our raw data from the first installment.

Importing .csv data into Gephi is quite simple – we create individual node and edge files (as we showed in the previous post), and use the Gephi import functions to pull the data in. I always start with the node file, since it will typically have additional information not included in the edges file. After importing the node data, I then import the edge data, which gives us the information to form our initial graph. If we were to start with the edge file, Gephi will create our node data automatically, and we will not have the detail needed for our graph. This approach may work for simple graphs, but not for our current case.

Once both data files have been imported, we can begin thinking about what we want form our graph. Here are several questions we might pose:

  • How will we use color?
  • What sort of layout will be best?
  • Which measures should we calculate?
  • How should we depict node sizes?

In many cases, the answers to these questions come about through trial and error. We may have some ideas going into the process, but invariably, there will be modifications along the way. So be patient, and be willing to experiment as you create network graphs. The graph you will see in this post went through many of these modifications, which I won’t take the time to detail. Instead, this post will detail my final choices, along with some explanations for why these choices were made. So let’s take a walk through the various facets of the visualization.

Layout

While a network will retain the same underlying structure from a statistical point of view (degrees, centrality, eccentricity, etc.) regardless of our layout choices, it is still important to select a layout that will visually represent the underlying patterns in the network. Otherwise, we could just as well deliver a spreadsheet with all of the network statistics. So layout selection is critical, and often involves an iterative process.

For the baseball network graphs I built in 2014, I eventually settled on the ARF layout algorithm, which ran quickly and created an attractive circular network graph display using the player connection data. Alas, there is no ARF algorithm available for Gephi 0.9.2, so I required a different approach for the updates. Ultimately, this led to a 2-step approach using a pair of layout algorithms – OpenOrd followed by Force Atlas 2. OpenOrd is especially effective at creating a quick layout from large datasets, although with far less precision than some other force-directed approaches. Still, it is a great tool for creating a general understanding of the structure of a network very quickly. Force Atlas 2, is the near opposite of OpenOrd – a very precise approach that can be tweaked easily using the various settings in Gephi. It is ideal for putting the finishing touches on what OpenOrd started.

Here are the settings I eventually settled on for Force Atlas 2, after much trial and error:

Force_Atlas_2

Some of the more important things to note here are the Scaling and Gravity settings. I reduced the scaling to 0.5 so the network would display appropriately in a single window without the need for scrolling. The Gravity setting was increased to 2.5 to force nodes slightly toward the center of the display. The LinLog mode and Prevent Overlap options are also selected in order to make this particular graph more visually effective. For other graphs, I have used the Dissuade Hubs option, forcing large nodes to the perimeter of the graph; in this case, that was not an ideal choice.

Color

The use of color is also important within a network graph display. Color can be used to highlight nuances in the data that distinguish one or more nodes relative to another group of nodes. Often we use color to visually represent clusters within the graph, as grouped using the modularity classes statistic or some similar input. In the case of this series of graphs (ultimately one graph per team), I made a decision to use the official team colors to differentiate each graph. Thus my initial graph for the Boston Red Sox would be based on the two primary hex colors for the current team (these colors do change over time for many teams).

Here are the Red Sox primary colors:

c8102e_Color_Hex_-_2018-06-10_09.15.48 0c2340_Color_Hex_-_2018-06-10_09.16.33

After capturing current team colors in a spreadsheet for easy reference, I used the color-hex.com site to select complementary colors for the Red Sox graph. Using complementary colors allows me to differentiate clusters in the graph while remaining true to the original concept of employing team colors for each graph. So instead of a wide range of colors one would normally see in a Gephi output, I was able to input the complementary colors for each group. Thus, one team color could be used for the graph background, while the other color (and it’s complements) could be used for the graph structure (nodes & edges). We’ll share the effect later in this post.

Statistics

Graph statistics are critical to the full understanding of the structure of a network. While we can view a graph and begin to understanding the general structure of a network, the various statistics will aid and reinforce our initial visual comprehension. Gephi provides a nice range of statistical measures to choose from:

  • Eccentricity (the number of steps needed to traverse the network)
  • Centrality – betweenness, eigenvector, closeness, harmonic closeness (various measures of importance of an individual node)
  • Clustering coefficient (to discern cliques in the network)
  • Number of triangles (a friends of friends measure)
  • Modularity Class (clusters)
  • Degrees (the number of connections)

Sizing

Node sizing is another key element of effective graph design. In this case, there were a few options I could pursue for node sizing – the number of seasons played (I used this in the 2014 graphs), one of the various centrality measures we calculated, or the number of degrees (connections) an individual player possesses. After computing each of these statistics, I eventually decided to use the number of degrees as a representation of influence in the graph. Visually, I want to show how many other players a single individual is related to, and using node size is an effective means of doing so.

Summary

Our final graph in Gephi is shown below; the eventual web-based version will differ slightly and include additional functionality, but that’s for another post.

red_sox_20180610

Next Post

My third and final post in this series will address exporting this graph to the web using the sigma.js plugin, and making some additional customization to the web version. Thanks for reading, and see you soon!

 

 

 

 

 

 FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

A Network Graph of MLB All-Stars

With the Major League Baseball All-Star game this week, I was suddenly possessed by an urge to create a network graph of all the players who have been selected for the game from 1933-2015. The goal, as with most of my network graph efforts, was to see where the data took me, and what stories it might tell. Along for the ride was my trusty companion Gephi, version 0.9.1 in this instance. Data for this exercise comes, as it so often does, from the Lahman baseball database, which happens to have a nice table with all the necessary all-star information.

The challenge inherent in this data, as with many temporal datasets, is to create an interesting graph that isn’t entirely driven by the time element, in this case as baseball seasons. Some otherwise excellent layout algorithms tend to turn this sort of data into long, worm-like displays that are not visually appealing. I could just as easily use a timeline if that were the goal of the visualization. So in an effort to balance aesthetics with the underlying data, I finally settled on the radial axis layout in Gephi. With this layout, we have the ability to create multiple axes radiating from the center of the graph, grouped by something meaningful. In this case, that turns out to be the modularity class, a sort of clustering mechanism that groups nodes together based on common or similar characteristics.

After some trial and error, I wound up with 13 distinct classes, a nice manageable number for this type of display. The end result can be interpreted as some sort of exotic, colorful starfish, or perhaps as a multi-colored fireworks display. In any event, I believe it tells an interesting story in a visually appealing manner, and allows for understanding the common threads within each cluster of players. Here’s the complete graph layout:

allstar_graph_base

I’ll spend the rest of the post with some quick analyses of each group, and then point you to the entire graph in interactive form so you can discover your own patterns and learn more about the all-star connections of individual players. I’ll also provide a quick overview for how to read the sidebar output for the graph when you interact with the data.

Our 13 clusters (you can think of them as cohorts) tell us some interesting things about the history of all-star participants. Let’s walk through each of the 13 (numbered 0 through 12) to learn more. The clusters begin with 0 (in green) at the upper left and move counter-clockwise around the graph. Each is rank ordered from small to large radiating out from the center, so the member with the most years as an all-star will be at the tip of each group. That individual will serve as our focal point in each of the following screenshots, followed by a brief overview of other members of the cohort.

First up is our Cohort 0, headlined by Miguel Cabrera, with 10 selections through 2015. Obviously, this would appear to be a cohort of current or recent all-stars based on Cabrera’s appearance. We can easily navigate the graph to see if that’s the case. Who else is prominent in the group? Yadier Molina, Matt Holliday, and Robinson Cano, to name a few, all big name stars for most of their careers. How about at the low end of the spectrum, players with a single all-star selection? Here’s where we find the likes of Billy Butler, R.A. Dickey, and Melky Cabrera. All long-tenured, noteworthy players, but certainly not in the same category as the first group.

allstar_0

Cohort 1 takes us on some time travel, with Johnny Mize as the representative star, also with 10 all-star selections and Hall of Fame membership as well. Joining Mize in the group are Bobby Doerr, Vern Stephens, Joe Gordon, and Bob Feller, all Hall of Famers with the exception of Stephens. At the other end of the cohort, each with one appearance, are Oscar Grimes, Red Barrett, and Nick Etten, among others. Based on the career arcs of the stars in this group, we could characterize it as primarily a 1940s-based cohort, certainly with overlap into the surrounding decades.

allstar_1

Derek Jeter is our icon for Cohort 2, so it figures to be a group that immediately precedes the Cabrera-led Cohort 0. Perhaps the focus here will be on stars from the early 2000s, at the center point of Jeter’s long career. Mariano Rivera, Albert Pujols, and David Ortiz are among the top stars here, confirming the hypothesis that this group is largely post-2000 in nature. Among the lesser knowns with a single selection each are Gil Meche, Joe Crede, and Ryan Ludwick.

allstar_2

Cohort 3 is headed up by the legendary Stan Musial, whose career covered the entirety of the 40s and 50s. Given that Cohort 1 was largely concentrated on players from the 40s, we might anticipate more of a skew towards 1950 and beyond. We’ll see in a moment if that’s true. Next to Musial we have Ted Williams and Warren Spahn, two more whose careers spanned both decades, so perhaps we have players here with greater longevity compared to the Mize cohort. Let’s go a bit deeper, where we find Roy Campanella, Larry Doby, and Robin Roberts, all with career pinnacles primarily in the 50s. So while there will certainly be connections across the two groups, Cohort 3 does appear to span more of the 1950s compared to Cohort 1.

allstar_3

With Cohort 4 we see a very large group fronted by all-time hits leader Pete Rose. So we could be focused on the 1960s or 1970s here; Rod Carew, Reggie Jackson, and Mike Schmidt, Hall of Famers all, are included, so the 1970s would seem to be the dominant theme. Perhaps we shouldn’t be surprised at the size of this group, as expansion in the 1960s afforded more players the opportunity to become an all-star. A few interesting figures can be found at the single game end of the radian – Bob Horner, Kent Hrbek, and Lonnie Smith, all with at least momentary brushes with greatness, but good enough to qualify for just one all-star nod apiece.

allstar_4

Joe DiMaggio is the lead for our next group, joined by the likes of Mel Ott, Bill Dickey, and Joe Medwick. The skew is toward the late 1930s and beyond; many members of this cohort would have had limited all-star game opportunities, as the game originated only in 1933. As proof of this, we find Hall of Famers Heinie Manush, Goose Goslin, and Kiki Cuyler at the low end of the radian, each with just a single all-star credit.

allstar_5

Hall of Famer Al Kaline heads up Cohort 6, so we know we’re in either the 50s or 60s, or more likely, a bit of both decades. Along with Kaline we have Mickey Mantle, Yogi Berra, and Ernie Banks, each of who had multiple appearances covering both decades. At the more modest end of the group we find Rocky Bridges, Bob Cerv, and perhaps surprisingly, the slugger Joe Adcock, each with just a single season as all-stars.

allstar_6

Cohort 7 is our one group that’s difficult to explain. Our graph modularity settings forced a small cohort of just nine players; Hank Aaron with 21 seasons, and eight others with a single season each. Consider this one a bit of a fluke.

allstar_7

The great Willie Mays leads the relatively small Cohort 8, joined by both Brooks and Frank Robinson, as well as Roberto Clemente. This would indicate a cohort of players who began in the 1950s and perhaps peaked in the 60s. Lower down the list this group features lesser known players like Tito Francona, Dick Howser, and Joey Jay, all one season all-stars.

allstar_8

Cohort 9 is a very large group led by Barry Bonds, Ivan “Pudge” Rodriguez, and Ken Griffey. Here we have three stars with illustrious careers launched around 1990 and extending into the new millenium. At the other extreme we find one-timers such as Jay Buhner, Mark Grudzelianek, and Lance Johnson.

allstar_9

Cohort 10 is a mid-sized group headed up by Alex Rodriguez, and featuring Manny Ramirez, John Smoltz, and Scott Rolen. This would appear to be a very similar group (if a few years later) to the prior cohort, and we should expect to find a great number of crossover connections between the two, as they each cover players from similar time periods.

allstar_10

Down to the final two groups! Cohort 11 is led by Cal Ripken, Ozzie Smith, and Roger Clemens, all major impact players in both the 1980s and 90s. This is another group featuring players who were quite talented but dented the all-star ranks just a single time. Among these were outfielders Jesse Barfield and Kevin Bass, and pitcher Teddy Higuera.

allstar_11

With Cohort 12, we see a large group led by 18-time all-star Carl Yastrzemski, supported by Johnny Bench, Tom Seaver, and Harmon Killebrew. These are all players who were at their productive peaks in the late 1960s through mid-1970s, an era where the National League was dominating the annual game. Chuck Hinton, Jerry Lumpe, and Joe Azcue are among those who can claim a single trip to the all-star game as a career highlight.

allstar_12

Quick note on the sidebar – you’ll see a few measures which I won’t go into too deeply here; there are 3 centrality measures (influence within the network), eccentricity (the number of steps to traverse the network, think six degrees of Kevin Bacon), and size, reflecting the number of seasons as an all-star. The key part of the sidebar lies in the listing of all connected players to the one currently selected, with numbers indicating the number of games as co-all-stars. Use these links to navigate through the network quickly. It’s fun!

So that’s it for our brief analysis. Now it’s time to explore for yourself by opening the MLB All-Star Network visualization. Be patient as the data loads; once it has cached the graph should be fairly fast at zooming, panning, and allowing you to explore to your heart’s content.

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Exploratory DataViz Part 2

Having discussed some of Exploratory’s cool features in a prior post, I thought it would be fun to continue the exploration using JSON data as a starting point. I happen to have a fair amount of JSON on hand, thanks to a series of network graphs produced using Gephi and sigma.js, so why not put it to use with Exploratory and start creating a new dataviz?

If you have previously worked with JSON, you’re no doubt aware that it can be a bit fickle – miss a bracket or brace in one place and the entire file fails to load a visualization. However, knowing that my JSON has been successful in producing network graphs (see here for examples), I figured it was worth a shot with Exploratory.

To begin, start with the local import option, selecting the json option, and pointing it to your local file. Give it a name, run the process and cross your fingers! After a few seconds, I’ve got my results, and Exploratory has done a good job categorizing the data:

exploratory_2.1

Since this is network data, we have nodes and edges, as well as any additional attributes, such as color or size. Exploratory has picked up those groupings, first the edges, and now the nodes.

exploratory_2.2

Finally, the attribute values:

exploratory_2.3

Since we’re satisfied with the import, we can move on to the summary data, which in this case doesn’t make a whole lot of sense. No matter, let’s see what can be done with some charts and analysis.

exploratory_2.4

To start with, we have x and y values associated with each node, which sounds like a perfect candidate for a scatter plot. We add the x value to the x-axis (how convenient was that!), the y value to the y-axis, node size as the Size attribute, and finally the Eccentricity attribute for color. FWIW, eccentricity is not a measure of flakiness, but rather the distance between the most remote points in a graph. This is where the six degrees of separation (or Kevin Bacon, take your pick) concept comes into play; an eccentricity value of 6 equates to 6 degrees of distance. Here’s our result:

exploratory_2.5

Not bad, eh? We can also hover over each node to see who it is (after adding Id to the Label field):

exploratory_2.6

We still have a lot of activity in a limited space, so now let’s use a simple filter (see the command line at top) to grab the top 50 values, and see the results:

exploratory_2.7

Now let’s create a new branch to explore further. I would like to sort my dataset using the Betweenness Centrality attribute, but there’s one problem – it’s a character value at the moment, so it doesn’t sort numerically. No matter, we can fix that easily using the Mutate command to convert the variable type. This can be seen in the right margin, where Exploratory conveniently stores all actions. Now we can sort our values in descending order to understand who is most influential in the network (at least by this measure). FYI – Betweenness Centrality tells us which nodes others must pass through most frequently to connect elsewhere within the network. Typically, but not always, it is someone centrally located within a network; sometimes it may be a less influential character (Pedro Borbon in this case) who connects more distant groups to one another.

exploratory_2.9

So there you have it, another quick walk-through with Exploratory. Before I sign off, here’s the live scatter plot you can play with via the Exploratory server. Be sure to use the simple zoom features to traverse the chart!

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather

Open Source Data Viz: Exploratory

It’s absolutely a great time to be alive and involved in data viz, courtesy of the wealth of exceptional open source projects. Several recent open source discoveries are currently on my radar, and worthy of further exploration. Over the next few weeks I’ll examine a few of these options, using baseball data (of course) to illustrate the possibilities within each application. Specifically, we’ll take a look at Trelliscope, bokeh, rbokeh, and Exploratory, and provide some insight and examples into how each of these projects function. This post will focus on Exploratory, an exciting new tool from Kan Nishida.

Exploratory is another R-based application that leverages a multitude of R capabilities while providing its own intuitive interface. While still in beta testing, Exploratory appears to have a very bright future as a powerful visualization tool that allows non-coders to tap into the enormous power of R. The ability to harness a considerable portion of the R language through Exploratory’s GUI is a powerful option for those (like me) with limited R experience and expertise.

Exploratory has a very clean, intuitive interface that may feel a little unusual to long-time R users accustomed to multiple panes and workspaces. Yet beneath the surface, it possesses considerable power, as we’ll see in this tutorial. To start our process, we’ll need a data frame, a familiar object for R users. Let’s begin by examining our data frame options.

First up, we can load a local source file in a variety of formats:
exploratory_local
Some of the usual suspects are here – text and Excel files, but we also have the ability to load json data as well as some of the more prominent statistical formats including SAS and SPSS data. Very cool. We’ll come back to this later.

Now let’s see the remote options:

exploratory_rscript

Great! Not only can we gain direct access to MySQL databases (a huge plus for me), Exploratory also provides access to a diverse range of option including Twitter search, MongoDB, and web scraping. We’re going to look at some specific examples later, but for now, here’s a glimpse of the MySQL data import window:

exploratory_mysql

As with the entire app, the design is clean and intuitive. In a bit, I’m going to load details into this window so we can test the MySQL functionality.

A third import option exists in the availability to access any existing R scripts you may have previously created:

exploratory_rscript

I’m not going to spend a lot of time here, due to the fact that I don’t have a lot (any?) of personal scripts. However, for seasoned R coders, this seems like a great feature.

Now let’s walk through some of Exploratory’s capabilities using a MySQL connection. The MySQL setup is really easy – just fill in your database connection parameters and you’re good to go. Here’s what it looks like for this example, with a few fields grayed out for security reasons.

MySQL connection

Once the connection is established, Exploratory will display the initial rows in the dataset. If we click the Run button, our data is pulled into a Summary view, where every variable in the data is summarized. This is a great way to see if our data looks as expected, and allows us to determine if the correct variable type (integer, date, etc.) is associated with each field.

exploratory_summary

If everything looks good, we can move on to the Table option, which will resemble the MySQL view we just saw. No surprises here:

exploratory_table

If we’re satisfied so far, then it’s time to move on to the fun aspects of Exploratory. For me, this starts with viewing data using the Charts selection. As of this writing, there are 10 chart options (two are actually mapping selections for geo data) including bars, scatter plots, box plots, heatmaps, and more. For me, this is a real strength of Exploratory; the ease with which we can see plots of our data is great! Here I’ve chosen a couple stat fields (at bats (AB) and runs (R)) to illustrate the scatter plot functionality.

exploratory_chart

The charts are clean and attractive, and provide some additional options. For scatter plots, labels can be added via a simple check box. This permits me to add hover labels, as seen below:

exploratory_chart_label

Pretty nice so far, don’t you think? But as the old commercials used to say ‘wait, there’s more’. The considerable power of R lies beneath the surface, enabling statistical testing, filtering, data manipulation, and so much more. Here’s a glimpse of just a handful of available options for working with your data:

exploratory_options

Let’s select a filter option, where we’ll reduce the data to look only at players age 30 or greater. One of the other great aspects of Exploratory is it’s exposition of R code. We can use the built in menu commands while viewing the actual R code. For experienced R users, the functions can be entered directly in a text box, and for us less experienced coders, we can learn on the fly by seeing the output.

exploratory_filter

Now we see the same scatter plot populated with players 30 and older.

Another great feature is the ability to create branches within a project. This facilitates going down multiple paths within one workspace, rather than having to retrace our steps or rerun charts each time something changes. All we need to do is click the branch button, and a new tab is created for us. Very simple and intuitive, as is virtually everything in Exploratory.

exploratory_branch

In this instance, we’ve elected to run a correlation on the chart variables in our main flow, while we create a new box plot in our branch.

exploratory_branch_chart

I’ve been very impressed thus far with Exploratory, and have barely scratched the surface. My next step will be to create some real content that can be shared in a post or via some new visualizations on the site. I love the ease of accessing my data via MySQL, and immediately having the ability to create plots, filter data, and run statistical explorations.

FacebooktwitterlinkedinrssFacebooktwitterlinkedinrssby feather
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmailby feather