Wednesday, April 08, 2009

The Poetry of Logical Ideas

The title of this blog post comes from an Albert Einstein definition of math. I don't think that this was Einstein's best quote on the subject though, this was:
Mathematics are well and good but nature keeps dragging us around by the nose.
That is an important thing to remember, methinks.

I like knowing what makes things tick, and that includes sports. And if we truly understand how something works, than we can quantify it, express it in numbers. If something is beyond the grasp of math, then it is beyond the grasp of human comprehension, and I have trouble believing that hockey and baseball are really that complicated.

In sports in general, and the NHL and MLB in particular, chance plays a large role in the game. We always called it puck luck when I was growing up, and I think we all agree that the score on the clock in a hockey game is often not a fair reflection of the play in the game. Stuff happens. Also, I think we all sense that over the course of a season the bounces tend to even out for players and for teams. But how much do they settle out? Is 82 games enough for the bounces to even out completely? If not, how much noise is left?

In this part of the internet, we all watch a lot of NHL hockey, and spend a lot of time talking about it. And a lot of us like to use statistics to back up our observations, because it usually makes for the most compelling argument. In doing so, along the way we sometimes find out things that we weren't expecting. We notice things in the analysis of stats that make us watch for them in the games, and we notice things in the games and strive to find a way to measure them using the data we have available.

But trying to understand hockey better with statistics is a doomed venture if the role of chance in the game is not accounted for. And for all the time we spend on this hobby, it's worth understanding how luck works, at least as well as we can. Call it 'chance variation' if you're a stickler for terminology.

So in this post I'm going to take a run at explaining the super duper simple binomial distribution. It's dead easy, and with this you should be able to follow the reasoning of most of the Oilogosphere's statzis, and it should be all the math you need to truly understand the vast majority of Bill James' more complex work as well, for the many here who like MLB.


What are the chances of rolling a pair of sixes with one throw of the dice in Yahtzee? No more and no less, just a pair?

If you look at the dice in the picture, they've been lined up on a table that has the numbers 1 though 5 written on it, based on how far away from you they landed. And we would calculate the chances of rolling two sixes (with a six in the first position and a six in the third position) this way :

  • chances of rolling a six with the die in the first position: 1 in 6, i.e. 1/6
  • chances of rolling something other than a six with the die in the second position: 5/6
  • chances of rolling a six with the die in the third position: 1/6
  • chances of rolling something other than a six with the die in the fourth position: 5/6
  • chances of rolling something other than a six with the die in the fifth position: 5/6

Multiply those together and you have your odds, easy as beans:

1/6 * 5/6 * 1/6 * 5/6 * 5/6 = 125/7776

which is .016, or 1.6%, or 62:1 odds

To make it easier on the eyes we'll say that p = 1/6. And the chance of rolling a non-six is therefore 1-p. So the equation can be written as a neat and tidy:
p2 * (1-p)3

At my house we always played monopoly with $500 in the middle of the board as a Free Parking square windfall. We also scored Scrabble with extra points for words within words, which made counting a bitch, I wouldn't recommend that. A lot of families have variations on board game rules, but I don't think that anyone is mad enough to have a Yahtzee rule like "you can only count a pair of sixes if they are the nearest and third nearest dice from you". If you did, well then you're done here, the rest of us have to do a smidge more math.

Next we figure out how many possible orders these dice could have landed in.
  • We look at the first die, the six in the first position ... it could have be positioned in four other spots, so five total.
  • And the next die, the three, it could land in any of the four remaining spots that are left, i.e. not occupied by the first six.
  • And the next die, the other six, it could land in any of the three remaining spots that are now left.
  • And the next die, the two, it could land in any of the two remaining spots that are now left.
  • And the last die has one spot left to go in.

So the number of possible ways that these dice could be arranged is 5 * 4 * 3 * 2 * 1 = 120
You can also write that as 5! to save your typing fingers. And if you really want to get your geek on, call that five factorial.

If you don't believe me, get 5 different coins and see how many different ways to can arrange them.

Now since we don't give a damn, for yahtzee purposes, which six is which, we divide that 120 by 2. Or more properly 2!

And since we also don't care which order the non-six dice are laying in, we divide by 3!. 3! is 3 x 2 x 1 = 6. Because there are six ways to arrange three items. Again, grab three coins if you don't believe me, though hopefully you learned your lesson farting about with five coins, that probably took a while.

So, finally, the chances of rolling exactly one pair of sixes with the throw of five dice:

5!/(3! * 2!) * p2 * (1-p)3

Which works out to .161, or 16.1% if you prefer.

Congrats, you've just derived the binomial probability equation from first principles. And if you can get your head around this, it's all easy breezy downhill coasting from here on out.

A final thing, to make it more general, just in case you're ever playing some strange foreign version of yahtzee with four 12-sided dice.

We'll call the number of dice 'n', and the number we're looking to test (in this case it was a pair, or two), we'll call that 'k'.

n!/((n-k)! * k!) * pk * (1-p)(n-k)

Since you probably don't want to be doing this much arithmetic, you can go to a site like this, you have to scroll down a little bit to get to the calculator. Then punch in your n, k and p, then click calculate. It is conventional to word it as 'n choose k given p'. Or '5 choose 2 given 1/6' for this specific example.

So, did anyone make it this far?


Blogger Jonathan Willis said...

And all of the statistics/probability courses I ever took come rushing back.

I'd honestly forgotten that 5! was 5 factorial.

4/08/2009 11:25 pm  
Blogger Matt said...

Ha, so far so good Vic. That's Math 20 or 30 in AB (I took them blended together, so I'm not too sure), Grade 11 or 12 for most anyway.

Though I'm an engineer, I got lost in math somewhere between there and here.

At any rate, hope you keep up the primers. My own interest in the stats (in this specific mathematical sense) is re: the question about something unusual or unexpected, the old, "Pffft, what are the odds that is was just random good/bad luck?" When you're looking at a specific question, whether the answer to that question is 2% or 0.02% is an awfully damn useful thing to have a handle on.

4/09/2009 12:02 am  
Blogger Vic Ferrari said...

Yeah, that's the next step Matt, and a very small one.

If you went to that link and entered these numbers you'd get the probability of two sixes rolled, the same as in my post. You'd also be given the probability of two-or-fewer-sixes-rolled and of two-or-more-sixes-rolled.

So the probability of the L.A Kings being an average EV shooting team in terms of ability, but just unlucky this year ... it's 1726 choose 116 given .08, enter in those numbers and the program spits out the chances of them scoring that few goals, or less, by chance alone at about 3%.

Can't ignore the forest, and JLikens has already done the same for all thirty teams and ploted it in a graph. Startling stuff.

4/09/2009 12:27 am  
Blogger Scott said...

Thanks for doing this Vic. I'll admit to being a bit slow with mathematics and I honestly found that first equation confusing on the first read-through but I think that I've got a handle on it now. My working year is almost through so the summer should be a good time to work through some things. If you have a specific project in mind that requires some grunt work, let me know and I'll be happy to help. I know that I still have the old sheet you passed along from an LT thread months back and I should have something together with that by the end of April anyway. Thanks again for the primer.

4/09/2009 2:44 am  
Blogger B.C.B. said...

I think I got most of this. I am one of those that hasn't done math in years, and am trying to relearn stats for hockey. I have to say your teaching tools is better then the University of Phx online tutorials. Thanks for using Yahtzee, and not an abstract model.

I personal am interested in mostly set theory and the possibilities of using math to explain the ontological structures of hockey (I even used your site as an eventual name of event of stat). But I always find your work interesting.

4/09/2009 9:40 am  
Blogger Schitzo said...

Excellent primer, Vic.

4/09/2009 11:37 am  
Blogger Earl Sleek said...

So, did anyone make it this far?

I would say that I'm about n!/(n+p)! x r^2 from finishing this post.

I should have paid more attention to my Yahtzee professor in high school.

4/09/2009 2:27 pm  
Blogger Vic Ferrari said...

Scott: I look forward to it. And I'm glad you got this, I know you don't have a math background, so you're the target audience really.

This is the tool we use to answer Matt's question "What are the chances of them happening by coincidence alone?"

And although it's simple, it takes a lot of computer horsepower to run binomial probabilities. So until recently people have used other statistical tools to approximate the same thing.

So if we know that the general population behaves almost identically to our soulless clone teams in the parallel universe. Then we know that overwhelmingly EV shooting%, at the team level, is a product of chance alone.

Streak patterns will show this in compelling fashion as well, but that's for another day.

BTW: That link I gave doesn't like really big numbers. This one can handle more:

Just cahnge the values of n, k and p in the url.

That url is for the Oilers late season surge. Had Oilers management read and understood this post, and checked those numbers themselves ... it would have been a different summer methinks. And a different winter for the Oilers as a consequence.

4/09/2009 2:33 pm  
Blogger Vic Ferrari said...


Where did I lose you?

4/09/2009 2:35 pm  
Blogger Earl Sleek said...

No, I'm with you so far. I'm just playing idiot.

I am looking forward to another installment, though -- I'm probably not ready to apply the Yahtzee lesson back to the sport of hockey on my own yet.

4/09/2009 2:39 pm  
Blogger JLikens said...

Good post. Very instructive.

And that binomial application you've created at timeonice -- the one that you've linked to -- is quite useful.

I figured that I'd use it to look at the Pens EV shooting stats. Last I checked, the Pens had the best EV S% in the league, and according to the playershots application at timeonice, it's 10% exactly.

Using values of n=1707 k=171 and p=0.08, the probability of having a shooting percentage that good or better by chance alone is well under one percent.

I'm not sure how to account for this, although leading/trailing effects might be relevant. The Bruins were shooting around 11% over the first half of the year and some of that had to do with the fact that they had tended to play a large portion of their games while ahead.

4/09/2009 4:13 pm  
Blogger Schitzo said...

JLikens, why are you using p=0.08 in your analysis? Is that league-wide EV S%, or pulled from somewhere else?

4/10/2009 10:54 am  
Blogger JLikens said...

"JLikens, why are you using p=0.08 in your analysis? Is that league-wide EV S%, or pulled from somewhere else?"

For one, 0.08 is the probability that Vic used in his comment about the Kings (see above).

Looking at the numbers from behindthenet, it appears that the league average 5-on-5 shooting percentage is somewhere in the area of 0.084 (with the % for 4-on-4 play being similar). But the team shooting percentages at timeonice are consistently lower than those listed at behindthenet (possibly because the numbers at BTH don't exclude empty netters, but I'm not sure). So I figured that 0.08 was a reasonable approximation.

4/10/2009 11:11 am  
Blogger mc79hockey said...

I'm pretty sure that you're right about the numbers at BTN not excluding empty netters.

4/10/2009 12:20 pm  
Blogger Schitzo said...

This comment has been removed by the author.

4/10/2009 1:02 pm  
Blogger PunjabiOil said...

You forgot to end the post with an obligatory,

"Make sense, no LT?"

4/11/2009 10:20 am  
Blogger Black Dog said...

Oh Vic.

Almost made it, have to reread and reread again. I got to the final formula and then my head exploded.

I was excellent at math until it got even a tiny bit complicated.


Ok will reread it tomorrow - I'm almost there.


4/17/2009 1:34 pm  
Blogger Showerhead said...

A little late to the party but I am on board with the math and want to see where this is going. Shooting percentages are binomial, what else is? PP%, PK%, Sv%...?

Also, does this mean that hypothesis testing is right around the corner? I think that's what you're getting at with the LA Kings example and I think that to the stats inclined (or those who you teach to be), hypothesis testing would be a relatively clear way to ask how likely these things are to happen by chance alone.

Also, if you keep this up, I don't have to forget everything I learned in Stats 1&2 this year :)

4/26/2009 11:53 am  
Blogger Jim Philips said...

It is a great quote and pretty true. it is part of the surreal life that we live. but it can be applied also to the pph free demo

5/30/2013 11:58 pm  

Post a Comment

<< Home