Thursday, July 20, 2006

Bill James and the Pythagorean Expectation

This has always seemed madass to me. Too clean and not rooted in reality. But when smart buggers on the internet are lured in ... it gives reason for a closer look.

Interesting guy, this Bill James. I spent a couple of hours googling and reading on this fellow. I've also read a non-math interview with him in a dentist office waiting room magazine, presumably SI, and it was fascinating stuff. Clearly an insightful guy who knows and loves the game. And a while ago I heard an interview with him on the radio. He presented himself as very humble, but not regular humble, try-like-hell humble. I think I have a pretty good feel for the guy, I like him.

But for now, the Pythagorean thing. I'm pressed for time so this will be brief, but I'll get back to it for those interested (if site stats mean anything, then there are is a surprising number of people interested in this sort of thing, Whoda thunk it?) So back to point:

All Bill James is trying to tell us with the pythagorean theorem for baseball is that shit happens. That even though we humans are spiritual beasts by nature, that a lot of the magic associated with the game ... the 'clutchness' and 'finding ways to win' ... they don't matter as much as we think. No Hell below us, above us only sky.

So if you take a season for any team, say the Blue Jays from last year, the one team I give a half a shit about ... then maybe, just maybe, it was just pure coincidence that the reason they lost so many 1 run games in the first half of the season was because they were simply unlucky. Forget about the mystery and magic.

And if you believe that they were just unlucky, that runs are scored when they are scored and runs surrendered when they are surrendered ... then if you randomly shuffle the Runs-For in games against the Runs-Against in game, do it thousands of times, then you would expect that the Jays would have a second half winning percentage of .532 on average. They in fact had a winning percentage of .531 in the second half.

The reason that they are so close? Pure coincidence. Surely there are too many variables (injury, trades, luck within games themselves, changes in difficulty of schedule and so on) for it to really be that accurate.

The curious thing. Bill James, marketeer extraordinaire, found a best fit equation for this real random distribution. And for baseball, with the number of runs being scored in games ... his simple Pythagorean expectation equation: = 1/( 1 + (RA/RF)2 ) captured the imagination of baseball fans everywhere. He could have done much better, but the simplicity of this grabs us. Something about it makes sense for no reason at all. There is an inexplicable soundness to it. It's the ratio baby!

I mean sure you could rewrite it as: =RF2/(RF2 + RA2) ... but "it's the sum of squares baby!" just doesn't have the same cachet. He could have used literally dozens of methods to approximate the "shit happens" curve a smidge more accurately. But he would have lost his audience in doing so. A shame, because the message was much more important than the best fit curve, and the former seems to have been lost on most.

In any case this simple curve that he created to try and match the real one ... it's near as dammit, it would have predicted a .530 winning% for the Jays in the second half using the example above. So there isn't much to quibble about.

Or you could use the same "shit happens" thinking and build it up rationally as a few have on the net. But then the math gets so heavy that you lose your readers. BTW: Hardcore baseball nuts should read the article from a kid at Brown's University (good school btw, and sorry but I didn'tt save the link) applying Weibull's theorem to James' original "shit happens" premise. It requires some abstract thinking, because nobody actually wins a baseball game 4.3 to 3.2, at least not without the official scorer being ridiculed ;) but it makes more basic sense. And he leaves himself open to take it further. Mad shit is that, but that alone made this worth digging into.

Interesting cat, this Mr James.

21 Comments:

Anonymous Big T said...

The guy really is such an incredible writer who's just very passionate about his subject. I'm no writer, but I'm pretty confident that that passion is the key to writing a best seller.

Of all teams, his favourite team is the Kansas City freakin' Royals. That alone should tell you a lot about the guy. As a Sabremetrican, loving KC must be an excruciating kind of hell.

He's a weird cat, who I'd agree, tries to be more humble than he actually is. From reading some of his stuff, he actually seems to have a lot of contempt for those not in the know - ie: sabrementrics.


T

7/20/2006 5:12 pm  
Anonymous lowetide said...

I don't really know how anyone else feels about Bill James, but he opened my eyes to so many things. All kinds of stuff like platoon advantages and k/w ratio's and comparables.

Mostly though he showed me how to enjoy the game my father taught me as a child when I was an adult. He gave insight into things that kept the game compelling.

Many years later we can see that the logic is flawed and passed by others with bigger brains and better computers.

But for my generation, Bill James represents some very important things and opened a window of learning for me that I remain grateful for these many years later.

7/20/2006 5:26 pm  
Blogger Andy Grabia said...

I have been looking at Pythagorean Wins for hockey, and was going to do some stuff on it in a few weeks. James himself was more of a writer than a statistician. If you want to read a fantastic book, pick up "Numbers Game" by Alan Scwarz. It is a easy and accessible history of numbers in baseball.

7/20/2006 6:41 pm  
Blogger Vic Ferrari said...

You've missed the point entirely Lowetide.

My bad.

Bill can see the forest for the trees, like few of us can

7/20/2006 6:42 pm  
Blogger Andy Grabia said...

Sorry, I was all over the place with that. I'm on my way out. I just wanted to express glee over this post.

7/20/2006 6:42 pm  
Anonymous Big T said...

Michael Lewis mentioned something in 'MoneyBall' which I think sums up what I think LT and Vic are both saying about Bill James;

"[He'd rather leave] an honest mess for others to clean up than a tidy lie for them to admire."


He saw the forest, despite not having the ability to properly analyze the trees. He opened up people's eyes to a different way of looking at the game even though he never could fully explain what he saw.

That's the legacy of Bill James. That he could inspire so many to look both differently and analytically at the game they had all loved for so long.

I personally don't spend a lot of time thinking about statistics in baseball. And despite a great deal of baseball fans not givng two shits about Sabremetrics, most of the big questions regarding what happens in the game have been answered. All that's left is refinement so almost all of the low hanging fruit has already been picked.

Hockey, on the other hand, has yet to go through such a change in perspective. And to me, it's pretty damn exciting to think of what's yet to be discovered about our game.


T

7/20/2006 7:43 pm  
Anonymous lowetide said...

Vic:

There are huge gaps we can see now in James' math. His GW-RBI, ERA above team average, Hitters Won-Loss, many of these things have been improved upon and GW-RBI is basically useless and contributed to what was a horrible decade of nonsense from MLB announcer.

But what a treasure chest of real, honest to goodness logic! Range factor, linear weights (which is actually Pete Palmer although I don't know if anyone ever read his tomb, NOT actually readable), runs created, comparables, k/w as it relates to the following season, approximate value, approximate trade value, established performance levels, power/speed, brock6, platoon advantage, his railing against the waste that was stolen bases without the percentage, quality starts, similarity scores.

And his work with rookies in the mid to late 1980s is some of the most fun reading I have ever done.

But all of those things have been done better by now I'd think, and that is just with very basic following of baseball.

7/20/2006 9:09 pm  
Blogger Andy Grabia said...

And despite a great deal of baseball fans not givng two shits about Sabremetrics, most of the big questions regarding what happens in the game have been answered.

On offense maybe. But defense is a whole new ballgame. Peter Gammons speculated in a great article last year that GMs like Bean and Epstein had already moved on from things like OBP, and were instead focusing on defense.

7/21/2006 1:33 am  
Blogger Andy Grabia said...

Oops. *Beane.

7/21/2006 1:34 am  
Blogger Vic Ferrari said...

Lowetide:

Ya, the guy was a pioneer for sure. Blessed with a rational mind and the ability to see what mattered.

More than anything he seems to have opened the eyes of a bunch of people. If there were no Bill James ... would there be a Moneyball? Would anyone read this blog? I dunno.

7/21/2006 9:34 am  
Anonymous lowetide said...

Vic:

The best thing about him? He told you it was just math. Early on he wrote something like "I ran ten thousand seasons on a comupter and it looked like shit so I decided that a minor league triple was worth .85 of a MLB triple and then ran ten thousand more seasons and it looked about right."

A man can NEVER dumb it down enough for the masses, but James gor close. :-)

7/21/2006 10:39 am  
Blogger Vic Ferrari said...

LT:

Good stuff. And on a related note ... To write the script I did to compile the original post here, I looked at doing the pure math, to see what points each team would have expected to garner if the runs-scored and runs-against all just happened at random. So the runs-against stay the same for games 1 thru 162 ... and the actual runs-for get all mixed up randomly in every possible combination. And it would have been an absolute bitch because there are a zillion permutations.

So I looked to see if anybody had already done this online. And found that article from the kid at Brown university who had done the same thing but in a less obvious way. Which was cool. Seemed a little quirky though.

Then I googled for all the scores in all the games in a MLB season. And downloaded them for 2005 from www.retrosheet.org.

Then I just googled for a script to randomly distribute the runs-for around. (So if the Jays scored 4 runs in Game 1 ... that might get randomly moved to game 134 ... and if the Jays scored 7 runs in game 134 ... they'd get bumped to another random destination (say game 155) ... and the Jays runs-for for game 155 would get moved randomly somewhere else. and so on and so on.)

I cut and pasted a VBA macro from the net (I have no idea how it works, but it does)

Ran a thousand simulated seasons a few separate times ... sim to sim the expected winning % moved around a smidge. So I bumped it to 10,000 sim seasons and it never changed from sim to sim. Declared that good enough. 10,000 seems to be a bit of a magic number that way with baseball I guess. :-) Looped it through for the other 29 teams ... and voila.

Took about 20 minutes start to finish. Only thing that buggered me up was that Cincy and somebody else played an extra game in there (why?) so i just deleted a game arbitrarily.

It's stunning how close James' Pythagorean Expectation and my random-scores things worked out on the whole. Even the spread of results (standard deviation) were virtually identical.

The impressive thing isn't that Bill James came up with a best fit curve for "shit happens" results for baseball at the current level of run scoring, and with a simple little formula. Or that he gave it a catchy and profound sounding nickname ("Pythagorean Expectation" ... good Christ :D ). The impressive thing is that James watched a boatload of baseball games, ignored all the stuff that the commentators, players and managers were telling us back then ... and thought, "it's just shit happening man, no magic".

7/21/2006 11:42 am  
Anonymous Big T said...

Andy;

You should check out baseballprospectus.com - they have done a ton of stuff regarding defense. It's pretty impressive. You're right in that there is certianly more work to be done, but the big stuff has been done.


T

7/21/2006 12:51 pm  
Blogger Andy Grabia said...

I do read BP. In fact, I'm reading "Baseball Between The Numbers," a collection of essays by their staff, right now.

7/21/2006 2:10 pm  
Blogger Andy Grabia said...

The reason that they are so close? Pure coincidence. Surely there are too many variables (injury, trades, luck within games themselves, changes in difficulty of schedule and so on) for it to really be that accurate.

I am intrigued by this. Isn't the point of the Pythagorean Expectation that it really isn't that random? That what matters is run differential? No one is going to ignore that shit happens, not me, and certainly not James, but isn't the point that generally speaking, I repeat, generally speaking, over a 162 game season, the best way to look at how a team is going to perform is by looking at their run differential?

Here is a short little thing I did on James over at SportsMatters. The fact that he has about 150 new stats devised but hasn't written on them yet blows my mind. So too does the fact that he is close to developing an actual Hustle Factor.

7/21/2006 2:26 pm  
Blogger mudcrutch79 said...

No one is going to ignore that shit happens, not me, and certainly not James, but isn't the point that generally speaking, I repeat, generally speaking, over a 162 game season, the best way to look at how a team is going to perform is by looking at their run differential?

Yes but it's because shit happens that that's preferable to looking at wins.

7/21/2006 2:40 pm  
Blogger Andy Grabia said...

Right. I just wasn't clear that that was how Vic meant it.

7/21/2006 5:51 pm  
Anonymous lowetide said...

One final note here: Bill James tackled it ALL. I'm not going to go into details on it, but James laid out a stunning look at race and how it effects the development of rookies (pages 68-71, 1987).

Those pages alone told me he had one hell of an editor I don't care how good a writer he was.

7/21/2006 7:06 pm  
Blogger mudcrutch79 said...

Funny LT - he's famous for not liking to have editors look at his work.

I've probably linked to this before but there's a fantastic site that did Abstracts of the Abstracts: www.baseballanalysts.com. They're located on the left hand bar. Rich Lederer had some great interviews with James.

I don't have the 1987 Abstract unfortunately. I'll have to look for it on ebay.

7/22/2006 1:01 am  
Anonymous lowetide said...

MC:

First 3 paragraphs:

Let me say, before I start this, that nobody likes to write about race. We would all prefer to be colorblind. I was doing these studies, and I had a code for in it for player's race, and while I was studying how one group of players developed over time compared to another group, I thought I would do a run of black players against white players, fully expecting that it would show nothing in particular or nothing beyond the outside range of chance, and I would file it away and never mention that I had looked at the issue at all.

In the black/white study, there were 54 rookies in each group--54 non-duplicating white rookies and in each case the one black player who was most similar.

The results were astonishing.

In 44/54 cases, the black player went on to have a better major-league career. In on 10/54, or 18%, did the white player play more games of surpass his black counterpart in most of the major categories.

The black players appeared in 48% more games!

The black players had 66% more hits!

94% more triples.

66% more home runs.

44% more stolen bases.

As rookies, Gus Bell and Hank Aaron were excellent comps.

7/22/2006 1:07 pm  
Anonymous lowetide said...

It should read:

400% more stolen bases.

7/22/2006 1:08 pm  

Post a Comment

<< Home