Being a Dork, Perl, and Basketball Stats 

For the past year or so, I've been dorking around with some of my friends with the idea of a basketball statistic that attempts to measure what a player brings a team. You know, take his points, his assists, rebounds, blocks, etc., and throw them all into one big number. It's been a fun diversion, and an excuse to think about math and some web programming again.
The idea is based on the work done at sonicscentral.com towards something that's been called Points Created (an attempt to parallel Bill James' baseball stat "Runs Created"). It's not perfect, but I've had some dorky fun, and, quite frankly, the ratings have come out moderately ok.
Recently, when I realized I could dynamically update this from the web rather than doing it via Excel, I set out to create a Perl script that would enable me to run it, have it grab the latest stats from the invaluable dougstats.com, and then generate the stats for everybody in the NBA.

A few hours later, I had something working.

#!/usr/bin/perl
use LWP::Simple;
use CGI;
my $query= new CGI;
print $query->header;

The basics: the hash-bang, and includes for LWP (to get the data over the web) and CGI (so I can pass in parameters).

my %Data;
my @row;
my @PlayerStats;
my $PlayerName;
my $PlayerStatsString;
my $url = "http://www.dougstats.com/05-06RD.txt";

my $PointsCreated = 0;
my $PCperG = 0;
my $PCper48 = 0;

Here we set up all of the variables. A hash to contain the player data. Arrays for handling a row of data and a row of player statistics. Scalars for the player name, the string of text representing the data, the URL to get the data, and then some internal values for calculating statistics that aren't in the downloaded data.

my $sort = $query->param('sort');

my $stats = get($url);
die "Couldn't get data" unless defined $stats;

@row = split(/n/, $stats);

shift @row;

Here we get the data. We grab the sort parameter (so I can determine which value to sort on -- more on that later). We go out and get the data (or die, if we can't get it). We split the data on new lines into rows of data in the array—each array element is a full row of text data. Finally, we shift off the top row, since it's the category text and we don't want that in our stats.

foreach (@row) {
($PlayerName, $PlayerStatsString) = split(/s+/, $_, 2);
@PlayerStats = split(/s+/, $PlayerStatsString);

my $DefRebs = $PlayerStats[11] - $PlayerStats[10];

$PointsCreated = $PlayerStats[18] + (0.75 * $PlayerStats[12])
+ (1.03 * ((0.75 * $PlayerStats[10]) + (0.25 * $DefRebs)
+ $PlayerStats[13] + (0.5 * $PlayerStats[15]) - $PlayerStats[14]
- (0.71 * ($PlayerStats[5] - $PlayerStats[4]))));
$PCperG = $PointsCreated / $PlayerStats[2];
$PCper48 = ($PointsCreated / $PlayerStats[3]) * 48;

$Data{$PlayerName} = [@PlayerStats, $DefRebs, $PointsCreated, $PCperG, $PCper48];
}

delete $Data{"Player"};

Ok - here's where some of the magic happens. I iterate through each row of data, and split the row into components: the player name and then the combined player stats. Then I split the player stats into individual stat buckets. I build some of the intermediate stats that aren't in the dataset—defensive rebounds, and then the Points Created and Points Created per Game and per 48 minutes.
Toss everything into a big hash, with the hash key set as the player name, and just make sure there's not an element that is the row of column headers (the delete line). I could probably toss this last line ...

I won't get into the details of the Points Created formula right now, but there's some (limited) intelligence behind those coefficients. Basically, it's an attempt to quantify how many possessions a player creates or loses, turn that into points, and then add in the points the player actually scored to come up with a final total. I've been working on a more refined version with some other folks that better integrates assists and the fact that not all hoops are created equal.
Quite frankly, that's about it. The rest of the script is just output, dumping the data in a simple table to the screen, and throwing in some links to allow some basic sorting. If you check out the Points Created display (or the possibly improved adjusted Points Created), you can see the results of the work.

The basics: both metrics say that LeBron James has created the most overall points this season. The adjusted method has Allen Iverson edging out James for PC/G, whereas the original has James edging out Iverson. The adjusted method likes point guards a lot more than the original method. Part of me thinks it likes them too much, but what do I know.

In summary: I'm a dork, but not a big enough dork to do this stuff as anything more than a part-time hobby.