Saturday, July 18, 2009


As most of you know, I am a goldsmith. You may not know that after the Pronger trade Oilers blogger Lain Babcock commissioned me to make eight coins. Each coin was to have the head of Joffrey Lupul on one side and Ladislav Smid on the other. He kept one for himself and sold the other seven to other Oiler bloggers. I'm still surprised that Dennis bought one, by the way.

Due to cheap materials and the hand crafted nature of these items, they weren't fair to use in a coin toss. Some tended to flip more Lupuls, others to flip more Smids.

Anyhow, shortly after their manufacture the game of Lupulsmid was born. If you click on the link you'll see the rules explained. It's a head to head coin flipping game which is always played for money. Because like horse racing and NFL football, it's pretty tedious without wagering in play.

On that link is also a table showing the weighting of the coins, which I tested before shipment. There is also a table of the history of head to head matches between the eight owners of the coins. And the final results, expressed as winning percentage, of the eight flippers. This will be different each time you load the page.

The cumulative results from 1000 parallel universes are also shown in the bottom table on lupulsmid.html, as linked above.

The question is: if we didn't know the weighting of the coins, could we figure it out from the results? In other words, how accurately can you calculate the quality of competition in Lupulsmid using just results?

Since I know the quality of competition effect for each coin flipper (remember I made and tested the coins) simple arithmetic yields this:

But if we don't know the weighting of the coins, it's trickier to do.

Let's try the Desjardins methodology:

Looking at the big sample for now, from 1000 parallel universes:

From Lain's point of view, he played 300 matches against Tyler. And Tyler had a winning percentage of .529, so we multiply 300 * .542 and get 162.6.

He played 100 matches against slipper. And slipper had a winning percentage of .524, so we multiply 100 * .524 and get 52.4.

Do the same for the other five guys that Lain played against, and divide by the total number of games and voila! His averaged competition was 49.5%, or slightly weaker coins than average. So his Desjardins Lupulsmid QualComp is -.5%

The complete list is as follows:

The red bit is just Desjardins' methodology parlayed through. So we look at Desjardins' results and think "Damn! Rivers is better than I thought, he just foolishly chose to play too many matches against the guys with good coins (Tyler and Lain). Maybe we should bump Rivers' coin weighting guesstimate up a touch and run Desjardins' methodology again."

And the same for everyone else. And we take the results and adjust again, over and over until it doesn't seem to be making a difference. Or at least not enough difference to have a material effect on Lupulsmid wagering.

So you can see that Desjardins' method worked well, and parlayed Desjardins is damn near spot on. That's a Pearson correlation of .96. for the simple Desjardins method.

On the smaller samples, one season at a time, it's going to vary a bit from one parallel universe to the next, and the average correlation between simple Desjardins and actual is .93 over the 100 samples, with almost all of them between .90 and .96.

Parlayed Desjardins averages .99 correlation for the individual universes, with the staggering majority being .98 or better.

Now the methodologies used by both Desjardins and Willis for their NHL Quality of Competition metrics don't lend itself to being parlayed through. But just the simple metrics they use give a damn good indication, and I can't think of any good reason that you'd need a finer measure.

Willis' results for the 07/08 season are here, by the way, and they correlate very strongly with Desjardins numbers for 07/08 on a team by team basis, as you can easily check for yourself.


Blogger said...

are you surprised at how well the inductive method matched the deductive one?

7/18/2009 10:50 pm  
Blogger Vic Ferrari said...

Which are you qualifying as deductive and inductive, Sunny?

I would think that using the known coin weightings is surely deductive.

Using the the results to determine the "quality of competition" and parlaying it through is also deductive, as well as empiracal. And should yield the exactly correct result, less the noise created by randomness.

Or course the randomness affects the individual flipper's results more than the quality of competition, because it is a smaller sample.

That's the central theme, really. QualComp, by any measure (Willis' method rates players by points/game and marks head to head moments by goals, Desjardin's uses on-ice plus/minus less off-ice plus/minus as the player valuator and every second of icetime as a head to head marker, mudcrutch used shots +/- as a player quality measure and shots as the head-to-head markers.

All yield very similar results because:
a.) All are fairly reasonable measures of a player's quality.
b.) All are run against enough opponents that the luck associated with these measures is largely washed away.

This is evident here with just 7 opponents. In accounting for team qualcomp you'd have 29 opponents, and player qualcomp you'd have hundreds.

And using any sensible measure of coin flipper quality (using his previous results) to correct for his quality of competition with a once-through adjustment is also deductive and empirical, no? Just not very thorough.

And in all of these 1000 trials, using this analogy, the estimation of qualcomp is a dramatic improvement over no adjustment at all, and in the vast majority is very strongly associated with the known qualcomp for the model.

I'm not sure what you were driving at, or what value these categorizations hold, or even if I've selected the right categories.

7/20/2009 3:28 pm  
Blogger said...

hi vic,

No you pretty much nailed what I was thinking, which is that sample size is everything. Your excellent thought experiment here aside (where you were able to run 1000 trials, etc), I was also curious if you thought that in reality, Gabe's QoC numbers for individual players do generally have enough sample size to not only be meaningful, but also not be misleading. (I haven't spent much time looking at his QoC stuff.)

Sounds like you do think so, and if that's the case I certainly agree that using QoC is better than making no adjustment, particularly if we also apply our own individual HDOS (Healthy Dose of Skepticism) factor onto each number. I.e. - perhaps not nitpick about small differences between two numbers, but perhaps pay attention when one number is way on the high or low side.

As for the slightly philosophical realm you went into about inductive vs. deductive, lol, I certainly didn't mean to spark that, but it's an interesting topic for sure. While definitely we use deductive methods in empirical analysis, the two methods are different philosophically imo.

Creating a formula in which truth is known because it has to be because we created it (e.g, we create a roulette table in which we know the odds are 37-to-1 because we created 38 equal pockets) is a little different than being able infer but not ensure truth (e.g., we look at the results of spins on a roulette wheel but are never able to view the wheel itself).

The latter will always be subject to SOME amount of randomness (no matter how large or small). Though the latter is also almost certainly more applicable and more important to learning about things in our life, since we rarely know anything for "sure".

It's the whole falsification thing - i.e. seeing only black crows throughout your life doesn't prove that "all crows are black", but seeing one non-black crow is enough to disprove it.

Etc. I love this topic. Have you read much Popper?

7/20/2009 4:12 pm  
Blogger Vic Ferrari said...

I've never read Popper, sunny. I've not read much Philosophy at all, perhaps I should, it's just never really appealed.

To my mind the language of logic is math, though certainly the narrative carries more weight with most audiences. Or perhaps more correctly; the analogy or metaphor carries more weight.

People like black and white answers I think. Perhaps that explains the popularity of the black swan metaphor. And even more than that they like answers that confirm their original bias.

Google Albert's paper on streakiness in baseball, it's terrifically honest stuff. Some complex math in places, but not as complicated as it feels, the shorthand of mathematicians just makes it seem that way.

As you probably suspect, the staggering majority of 'streakiness' in baseball player's batting averages can be accounted for with random chance. There is very little difference between the distribution that expected by random chance on a simple model, and the actual distribution. But there is some, ever so slight. There is very little difference between actual and expected in the general population.

A few months afterwards Bill James wrote an article on streakiness, a simplified version of some of the same methodology. He's a wonderful writer, he coins the term "batting temperature" and he uses two famous baseball players in his study, Brett and Schmidt I think, though I could be wrong on that. He also fairly points out the limitations of his study. But the damage is already done, the narrative is too powerful.

There is an online chat at Alan Reifman's old site, that seems to be missing now, there Reifman asks Albert about the hot streak of a Yankee player, a guy who has seemingly come out of nowhere to hit the hell out of the ball over the last two months of the season.

99% of the people who read James' article would bet that it was just coincidence.

Albert responded that he thinks the player has probably improved over this stretch, he's probably not been hitting much leather either, but likely something has changed in his swing mechanics.

That's not a contradiction. He thinks that the terrific A's winning streak of about a decade ago WAS, most likely, nothing more than coincidence. That's not a contradiction either, because the universe needed a team to have a streak like that.

And since I have no particular direction with this ramble, I'll just cut it off here.

7/21/2009 2:08 pm  
Blogger Scott Reynolds said...

Hi Vic,

Thanks for the article, fantastic stuff. I was curious about your thoughts regarding the quality of competition statistic. I have long thought that the values are generally useful on a team by team basis. This as opposed to the league-wide rankings. So, "Shawn Horcoff faced the most difficult competition among forwards on the Oilers" would have more value then "Shawn Horcoff's QC was 0.04 which ranks him ??th in the league." Do you think the results you've come to here give some validity to this second approach?

7/22/2009 1:37 pm  
Blogger Vic Ferrari said...


No, not at all. Desjardin's QualComp uses a player value based on how he did at EV+/- relative to his team mates. So any comparison of players from different teams is going to be meaningless.

This unless you determine the difficulty of each team's schedule in terms of team EV+/- and then factor it back in. Even then, it would be dodgy I think, especially if you don't do the sked difficulty calc well.

7/22/2009 2:31 pm  
Blogger Scott Reynolds said...

Thanks for the answer Vic. What about the method that Willis started using? From what I recall he's organizing players based around pts/gm and there is no inherent comparison to teammates.

7/23/2009 9:29 am  
Blogger Olivier said...

Lovely stuff.

Have you had a look at Quality of Teammates?

What washes out in qualcomp, as you so clearly demonstrated, may stay when it comes to qualteam... It seems to me that Qualteam's reliance on +/- exposes, say, a guy with rotten luck on a given year. He can sink his teammates's qualteam, and vice versa.

I mean, on the Habs, I look at D'Agostini with his 890 on-ice save% (the team as a whole stood at 922), and I wonder what you guys think. Am I just missing the point?

7/23/2009 12:08 pm  

Post a Comment

<< Home