I've previously posted on Orange, the wonderful open source project for statistical analysis and visualization. My latest foray involves using a variety of charts to examine success patterns by team, as measured by wins in a season. We'll talk first about scatterplots.
Scatterplots, for the uninitiated, are basically two dimensional charts that allow users to see the relationship between two elements – say, hits on one axis and runs on the other. They can be a great tool for quickly spotting correlations between elements (e.g.- are more hits consistently associated with more runs over the course of a season). In Orange, we have the luxury of adding 3rd and 4th elements to the picture, using the color of the markers, as well as the size of each marker. Here's an example, using BB and HR on the axes, and wins for the size and runs for the color of the markers.
Note that the sizes of the circles are not proportional to the number of wins, but rather scale according to the range of the data. This is helpful in spotting patterns, but could be considered a distortion if one adheres closely to Edward Tufte's advice.
The scatterplot enables us to spot some easy correlations; notice that virtually all of the teams with high walk totals (x-axis) are also big winners – the Yankees, Rays, Braves, and Red Sox were all at 89 or more wins for the season. So apparently, with the exception of the Diamondbacks, walks were a good predictor of wins (we're not accounting for the pitching side of the equation here, which derailed the Diamondbacks).
On the Y-axis, home runs also appear to have a positive relationship to wins, although perhaps slightly less than walks. And woe be to any team with low positions on both axes – this is where the worst teams in baseball wound up in 2010, including the 57 win Pirates, the 61 win Mariners, and the 66 win Orioles. No on-base ability, coupled with no HR power, seems to be a strong predictor of failure, even without knowing anything about pitching for these teams.
Let's change the axes for another view; we'll focus on defense in this case, putting Errors on the x-axis and double plays on the Y.
Notice any patterns? Our underachieving teams are heavily skewed to the lower right of the chart, committing lots of errors with few double plays. The pattern is perhaps less obvious than in our first example, but it is intuitive. We also see a pair of teams, the Tigers and Braves, who appear to have overcome some of their proneness to errors by turning a lot of double plays. At the other extreme, we see the Giants, with very few double plays but also the fewest errors in MLB, reaching 92 wins in spite of their low run scoring ability. Obviously, pitching helped, but it appears their defense provided relatively error-free support to their mound corps.
Finally, we'll take one more view – home runs (HR) versus home runs allowed (HRA).
Once again, there appears to be a fairly clear distinction, with the lower half of the display littered with the weaker teams that tended to surrender more homers than they hit themselves. Teams with a large positive home run differential (upper left of the chart) tended to have high win totals, although there were a few teams (Phillies, Rangers, Rays) with low differentials who still fared well, due to their strong pitching corps.
These are just a few examples created with Orange; I'll get these and many others out to the Collections page in the site as time permits. In summary, scatterplots, regardless of the user tool, are a great way to quickly view patterns in the underlying data, and the impact on a dependent variable such as wins.