Thursday, April 30, 2009

Scoring Chances: Part I of Many

For those that don't know, Dennis King tracked scoring chances for the Oilers this past season and posted his results at mc79hockey.com after the games. This was outstanding work, and myself and many others will surely be using this information for weeks to come.

Scott at Gospel of Hockey compiled this data on a few occasions, I'm using his data from mid season and the final sums in this post. This link will generate the NHL.com player stats for the games in question, automatically skipping over the handful of games that Dennis missed. As usual, change the values in the URL to look at different segments of the season.

So to pick a big apple from the bottom of the tree, I'm taking a look at the effect of players on the quality of the scoring chance on the Oiler goal. Dennis used a binary system, meaning he either marked down an opportunity as a scoring chance or not at all, their was no grading of the quality of the scoring chance. This is the way that every team tracks them, as far as I know, and I think that this is wise.

So if presume that Dennis was fair and consistent, and the players on the ice were not having any impact at all on the quality of scoring chance against, then we can model what we expect to see, which is:

  • Over the first half of the season, using the top 20 players by ice time (Stortini being the cutoff point), the average will be 11.9 goals against per 100 scoring chances against. And we should see a dispersion of results (measured here with sample standard deviation) of about 3.1
  • We in fact see a a standard deviation of 2.9. So, check.
  • Over the second half of the season, the average will be 12.2 goals against per 100 scoring chances against. And we should see a dispersion of results (measured here with sample standard deviation) of about 2.8
  • We in fact see a a standard deviation of 3.2. Check.
  • Over the season as a whole, the average will be 12.1 goals against per 100 scoring chances against. And we should see a dispersion of results (measured here with sample standard deviation) of about 2.0
  • We in fact see a a standard deviation of 1.5. Check.
  • The rates of 'goals against per scoring chance against' for the players, this should not repeat from the first half of the season to the next. As a convenient measure of this I used Pearson Correlation.
  • We in fact see small negative relationship for the results from the front half to the back half. So no repeatability at all. Check.

For the season as a whole the results are shown below, click to enlarge:


By way of example, for every 100 EV scoring chances that the opposition had when Smid was on the ice, 12 goals were scored. He was actually fantastically lucky in the first half of the season with GA rate of only 4 per 100 chances, and hit a stretch of snakeyes in the back half of the season resulting in that rate quadrupling. Get enough buggers rolling the dice and that sort of thing is bound to happen to one of them. If we divided up the games by odd and even game numbers, the similar thing will be there for someone else, though it's impossible to say who it will be, because randomness is random.

As you can see, the players are grouped tightly together. And the variation between them is small and accounted for entirely by expected chance variation.

I could have posted 9 more charts as generated by the random model, and they look identical in form, though obvious some players have better luck some simulated seasons than others. And if I had done that, I expect that about 1 in 10 people would have been able to pick out the real chart from the random clones.

And this statistic doesn't repeat at all, because it is likely both honestly measured and it is almost entirely luck, or near enough pure luck that it would be extremely difficult to hear the tiny skill component squeaking through the noise.

Now if I was playing defense for the Oilers last season this wouldn't be the case, the quality of the scoring chances would be through the roof, and it would skew this whole picture. But I didn't play for the Oilers, that was done by practised and trained professional hockey players.

As another check for homerism and bias, we expect the EV scoring chances per 100 shots-direct-at-net (Corsi+) to be similar both for and against. And it is:
35.6 For.
35.7 Against.

Damn.

So we should be good to go. This seems like a reasonable starting point, a check to make sure that the world is still round. Though it may well seem obvious to some and unbelievable to others.

And a couple of random thoughts to tag onto the end of this rambling post, based on my sense of it after kicking at this stuff a bit over the past couple of days:
  • Mike Babcock was right, possession is everything.
  • It's probably fairer to use 'shots direct at net while on the ice', instead of time, as a leveling tool, especially when comparing players on different teams. This for any even strength statistic.

Monday, April 27, 2009

Projecting: Andrew Cogliano

It's roster projection time in Oiler land - a time when many fans get to dream big about the off-season. Every day that Marian Hossa goes without a contract extension, Jay Bouwmeester goes without signing elsewhere, or Jaromir Jagr goes without promising an NHL return to the Penguins is a day that these sorts of home-run pipe dreams are still, at their core, possible.

And so we dream.

Over at LT's the other day, the suggestion was made about a trade for Jaroslav Halak. His merits notwithstanding, a key feature of all of the proposals tossed out there was that the Edmonton Oilers can not go forward with all of Hemsky, Gagner, Cogliano, and O'Sullivan in their top six. Hemsky's the franchise today, Gagner is the franchise tomorrow, and O'Sullivan just got here so of course that makes #13 the go-to guy for any and all trade proposals. At LT's blog I argued that it is vital for Edmonton, optics wise, to show and to prove to other players in the NHL that they can hold onto their young stars... re-build Edmonton's reputation one player at a time, so to speak. I hold by that stance and suggest that unless it's a package for someone truly top flight, players-of-the-future like Andrew Cogliano need to stay Oilers.

But is Andrew Cogliano really capable of being an impact player in the NHL?

To attempt an answer at this question I set the "impact player" bar for 2008/2009 at 0.9 points per game, a convenient figure that keeps Ales Hemsky's season just in the conversation and Todd White's just out. There were 33 players who scored at this pace or better this season ranging from Ovechkin at the top to Eric Staal at the bottom, so I think it's a fair assessment of the true impact players of this league.

This is where a smart person would probably have stopped - I think it's fair to say at first blush that Andrew Cogliano's ceiling is lower than Eric Staal - but I decided to continue my little experiment anyway. Next up is a figure representing the 21 year old seasons of every 2008/2009 impact player, save two:*

The numbers along the left represent the total number of points each player scored, or would have scored via Desjardins, at the NHL level during their 21 year old season. The length of the bar represents how many players fit into each particular point range. For example, two players (Crosby and Malkin) scored between 100 and 120 points during their 21 year old season.

What is clear is that most players who went on to become impact players by my definition scored at least 40 points over a prorated 82 NHL games at age 21. Our boy Cogliano doesn't appear too far off the pace with 38, but just FYI the mean of these numbers was 57 points (and the median was 53, if you'd rather not let Crosby and Malkin pull up the average score).

The statistical logic I am about to use is a little bit backwards and I hope someone with more knowledge can correct me if it leads to a false conclusion. If we pretend that the average amount of points put up by every impact player ever was not in fact 57 points but Andrew Cogliano's 38 points instead... and assume that this 31 player group is just one 31 player group out of thousands of 31 player groups I could have looked at... (and then do some math)... the conclusion that would be drawn is that 57 points is too high an average for these players to be accounted for by the ordinary variation you'd get because every group of 31 would have a different average. IE if 38 points is the average age 21 total for EVERY impact player in history, the supremely high 57 point total of the 2008/2009 group can NOT be explained by chance alone.

Does this mean that Cogs is doomed? Not necessarily. The 7 players that fall in the 20-40 range are Alexander Semin, Patrick Elias, Zach Parise, Daniel Sedin, Henrik Sedin, Mike Richards, and Mike Cammalleri. This isn't the worst company to be a part of, though these guys are obviously the cream of the crop when it comes to <40 point seasons at age 21.

Thoughts? Has this shed any light at all on guessing a projection? Can anyone see important flaws or suggest a better way at going at the problem? Thank you for anything that helps the process!

*Players excluded were Mike Green for being a (freakish) defenceman and Martin St. Louis who was playing in the ECAC(!!) at age 21.

Wednesday, April 08, 2009

The Poetry of Logical Ideas

The title of this blog post comes from an Albert Einstein definition of math. I don't think that this was Einstein's best quote on the subject though, this was:
Mathematics are well and good but nature keeps dragging us around by the nose.
That is an important thing to remember, methinks.

I like knowing what makes things tick, and that includes sports. And if we truly understand how something works, than we can quantify it, express it in numbers. If something is beyond the grasp of math, then it is beyond the grasp of human comprehension, and I have trouble believing that hockey and baseball are really that complicated.

In sports in general, and the NHL and MLB in particular, chance plays a large role in the game. We always called it puck luck when I was growing up, and I think we all agree that the score on the clock in a hockey game is often not a fair reflection of the play in the game. Stuff happens. Also, I think we all sense that over the course of a season the bounces tend to even out for players and for teams. But how much do they settle out? Is 82 games enough for the bounces to even out completely? If not, how much noise is left?

In this part of the internet, we all watch a lot of NHL hockey, and spend a lot of time talking about it. And a lot of us like to use statistics to back up our observations, because it usually makes for the most compelling argument. In doing so, along the way we sometimes find out things that we weren't expecting. We notice things in the analysis of stats that make us watch for them in the games, and we notice things in the games and strive to find a way to measure them using the data we have available.

But trying to understand hockey better with statistics is a doomed venture if the role of chance in the game is not accounted for. And for all the time we spend on this hobby, it's worth understanding how luck works, at least as well as we can. Call it 'chance variation' if you're a stickler for terminology.

So in this post I'm going to take a run at explaining the super duper simple binomial distribution. It's dead easy, and with this you should be able to follow the reasoning of most of the Oilogosphere's statzis, and it should be all the math you need to truly understand the vast majority of Bill James' more complex work as well, for the many here who like MLB.

YAHTZEE

What are the chances of rolling a pair of sixes with one throw of the dice in Yahtzee? No more and no less, just a pair?

If you look at the dice in the picture, they've been lined up on a table that has the numbers 1 though 5 written on it, based on how far away from you they landed. And we would calculate the chances of rolling two sixes (with a six in the first position and a six in the third position) this way :

  • chances of rolling a six with the die in the first position: 1 in 6, i.e. 1/6
  • chances of rolling something other than a six with the die in the second position: 5/6
  • chances of rolling a six with the die in the third position: 1/6
  • chances of rolling something other than a six with the die in the fourth position: 5/6
  • chances of rolling something other than a six with the die in the fifth position: 5/6

Multiply those together and you have your odds, easy as beans:

1/6 * 5/6 * 1/6 * 5/6 * 5/6 = 125/7776

which is .016, or 1.6%, or 62:1 odds

To make it easier on the eyes we'll say that p = 1/6. And the chance of rolling a non-six is therefore 1-p. So the equation can be written as a neat and tidy:
p2 * (1-p)3

At my house we always played monopoly with $500 in the middle of the board as a Free Parking square windfall. We also scored Scrabble with extra points for words within words, which made counting a bitch, I wouldn't recommend that. A lot of families have variations on board game rules, but I don't think that anyone is mad enough to have a Yahtzee rule like "you can only count a pair of sixes if they are the nearest and third nearest dice from you". If you did, well then you're done here, the rest of us have to do a smidge more math.

Next we figure out how many possible orders these dice could have landed in.
  • We look at the first die, the six in the first position ... it could have be positioned in four other spots, so five total.
  • And the next die, the three, it could land in any of the four remaining spots that are left, i.e. not occupied by the first six.
  • And the next die, the other six, it could land in any of the three remaining spots that are now left.
  • And the next die, the two, it could land in any of the two remaining spots that are now left.
  • And the last die has one spot left to go in.

So the number of possible ways that these dice could be arranged is 5 * 4 * 3 * 2 * 1 = 120
You can also write that as 5! to save your typing fingers. And if you really want to get your geek on, call that five factorial.

If you don't believe me, get 5 different coins and see how many different ways to can arrange them.

Now since we don't give a damn, for yahtzee purposes, which six is which, we divide that 120 by 2. Or more properly 2!

And since we also don't care which order the non-six dice are laying in, we divide by 3!. 3! is 3 x 2 x 1 = 6. Because there are six ways to arrange three items. Again, grab three coins if you don't believe me, though hopefully you learned your lesson farting about with five coins, that probably took a while.

So, finally, the chances of rolling exactly one pair of sixes with the throw of five dice:

5!/(3! * 2!) * p2 * (1-p)3

Which works out to .161, or 16.1% if you prefer.

Congrats, you've just derived the binomial probability equation from first principles. And if you can get your head around this, it's all easy breezy downhill coasting from here on out.

A final thing, to make it more general, just in case you're ever playing some strange foreign version of yahtzee with four 12-sided dice.

We'll call the number of dice 'n', and the number we're looking to test (in this case it was a pair, or two), we'll call that 'k'.

n!/((n-k)! * k!) * pk * (1-p)(n-k)

Since you probably don't want to be doing this much arithmetic, you can go to a site like this, you have to scroll down a little bit to get to the calculator. Then punch in your n, k and p, then click calculate. It is conventional to word it as 'n choose k given p'. Or '5 choose 2 given 1/6' for this specific example.

So, did anyone make it this far?