Skip to main content

Posts

Showing posts from 2013

Probability and Cumulative Dice Sums

A Stupid and Strange Way of Looking at Sports Power Ratings that could be Smart and Useful

As I've mentioned previously , a common method used in sports for estimating game outcomes known as log5 can be written \[p = \frac{p_1 q_2}{p_1 q_2+q_1 p_2}\] where \(p_i\) is the fraction of games won by team \(i\) and \(q_i\) is the fraction of games lost by team \(i\). We're assuming that there are no ties. What's the easiest way to derive this estimate? Here's one argument. Assume team \(i\) has a probability \(p_i\) of beating an average team (a team that wins half its games). Now imagine that this means for any given game the team has some "strength" sampled from [0,1] with median \(p_i\) and that the stronger team always wins. Thus, the probability that team 1 beats team 2 is \[ p = \int_0^1 \int_0^1 \! \mathrm{Pr}(p_1 > p_2) \, \mathrm{d} p_1 \mathrm{d} p_2 .\] This looks complicated, but but with probability \(p_1\) team 1 is stronger than an average team and with probability \(p_2\) team 2 is stronger than an average team. From this perspecti

Building a Personal Supercomputer

It's time for a workstation upgrade; here's what I've assembled. The massive case has plenty of space for additional drives to store the extreme amount of data that nearly every sport is now collecting together with the associated video footage. The case, power supply and motherboard allow up to 3 additional video cards to use as GPU units as your analytical needs demand (and budget can handle). Cooler Master Cosmos II - Ultra Full Tower Computer Case An absolutely massive case - plenty of room for an E-ATX motherboard, large power supply, multiple video cards and hard drives. PC Power & Cooling 1200W Silencer MK III Series Modular Power Supply An exceptionally large power supply, but platinum rated and plenty of room to spare allows it to operate nearly silenty, plus it'll handle any additional video cards you'll add later for GPU computing. ASUS Rampage IV Black Edition LGA 2011 Extended ATX Intel Motherboard Space for 4 video cards,

The Good, the Bad and the Weird: Duels and the Gentleman's Draw

As I mentioned in the previous article "The Good, the Bad and the Weird: Duels, Truels and Utility Functions" , a classic probability puzzle involves a 3-way duel (called a "truel" ). A, B and C are to fight a three-cornered pistol duel. All know that A's chance of hitting his target is 0.3, C's is 0.5, and B never misses. They are to fire at their choice of target in succession in the order A, B, C, cyclically (but a hit man loses further turns and is no longer shot at) until only one man is left unit. What should A's strategy be? There's a subtle issue involved in these types of problems in that we don't know how each participant values each outcome. If we allow duelists to deliberately miss there are \(2^3-1=7\) possible outcomes; each person may or may not be shot and at least one person will not be shot. Even if deliberate missing isn't allowed, there are still 3 possible outcomes. A, for example, could conceivably value B winning m

A Man, a Plan, a Cat, a Canal - Panama!

A man, a plan, a cat, a master, a minae, a lunula, a tora, a vela, a cola, a belga, a tola, an uta, a dona, a tiara, an aura, a kata, a mana, a nipa, a cabana, a law, a harem, a taces, a liard, a gnat, a mut, a dawed, a bura, a nim, a levator, an eh, a trams, a sall, a vacates, an up, a garb, a keets, a serin, a sored, a patamar, a gater, a lees, a strep, a lama, a pat, a lame, a haded, a wag, a naan, a kalian, a fer, a yaws, a remit, a keep, a loom, a neep, a ha, an ai, a retailor, a cabals, a subah, a slits, a pated, a callet, a xu, a faena, a tups, a sati, a rapider, a reps, a rapid, a korat, a dens, a sled, a tical, a pipages, a rats, a seeks, a defi, a wasts, a peracid, a nomas, a rosarian, a bran, a nonet, a gama, a sac, a scrota, a lab, a cain, an aid, a statin, a reel, a tiff, a haj, a talas, a nares, a mares, a laics, an allot, a til, an arena, a nits, a palace, a fanum, a yarer, a daub, a basts, a masas, a called, a slay, a galoots, a senega, a ton, a span, a mum, a dracaenas

The Good, the Bad and the Weird: Duels, Truels and Utility Functions

In the excellent (and highly recommended) book "Fifty Challenging Problems in Probability with Solution" , Frederick Mosteller poses "The Three-Cornered Duel": A, B and C are to fight a three-cornered pistol duel. All know that A's chance of hitting his target is 0.3, C's is 0.5, and B never misses. They are to fire at their choice of target in succession in the order A, B, C, cyclically (but a hit man loses further turns and is no longer shot at) until only one man is left unit. What should A's strategy be? This is problem 20 in Mosteller's book, and it also appears (with an almost identical solution) in Larsen & Marx "An Introduction to Probability and Its Applications" . Mosteller's solution: A is naturally not feeling cheery about this enterprise. Having the first shot he sees that, if he hits C, B will then surely hit him, and so he is not going to shoot at C. If he shoots at B and misses him, then B clearly shoots the

Solving a Bar Bet Using at Most Three Different Operations

A bar bet as presented in the YouTube video  The HARDEST Puzzle Yet!  involves starting with three of the same number from 0 through 9, then adding mathematical operations that result in an evaluation of 6. For example, if we start with three of the number 6, one solution could be \[ 6+6-6=6 .\] I'll demonstrate a method for solving this bar bet puzzle starting with three of any number, say \( N \), which involves using at most three different mathematical operations (although some of these may be used many, many times). If \( 0 \leq N \leq 2 \) we have the solutions \begin{eqnarray} (0! + 0! + 0!)! &= 6 \\ (1! + 1! + 1!)! &= 6 \\ 2+2+2 &= 6. \end{eqnarray} If \( N\geq 3\), concatenate the three numbers. Repeatedly applying the square-root operation we'll eventually end up with a result \( x \) with \( 3 \leq x < 9\).  If we now take the greatest integer \(\lfloor x \rfloor\) we have an integer \( n \) with \( 3 \leq n \leq 8 \). If we can exhibit solution

Solving Sir Arthur Eddington's Zoo Puzzle using Dirac Matrices

Sir Arthur Eddington posed this difficult logic puzzle to the readers of Caliban's ( Hubert Phillips ) newspaper puzzle columns, and famously (or infamously) provided a solution using Dirac matrices . Sir Eddington's zoo puzzle: I took some nephews and nieces to the Zoo, and we halted at a cage marked Tovus Slithius, male and female.  Beregovus Mimsius, male and female.  Rathus Momus, male and female.  Jabberwockius Vulgaris, male and female.  The eight animals were asleep in a row, and the children began to guess which was which. "That one at the end is Mr Tove." "No, no! It's Mrs Jabberwock," and so on. I suggested that they should each write down the names in order from left to right, and offered a prize to the one who got most names right. As the four species were easily distinguished, no mistake would arise in pairing the animals; naturally a child who identified one animal as Mr Tove identified the other animal of the same species a

Solving the TopSpin Puzzle using GAP

TopSpin is an oval-track permutation puzzle that was made by Binary Arts; similar puzzles are made and sold by other manufacturers. Here's the Binary Arts TopSpin . It's not a difficult puzzle to solve if you play around with it for a few hours and figure out how to generate various permutations. It's more interesting (and difficult) if you observe that the turntable has a distinguishable top and bottom. This suggests an interesting question - can you invert the turntable while keeping the numbers in the track in the same order? The answer is, perhaps surprisingly, yes. Here's one way to find a sequence of operations that produces precisely this outcome. GAP (Groups, Algorithms and Programming) is a freely available programming language that specializes in computational group theory, and it's perfect for solving permutation puzzles. Here's my GAP code for TopSpin. Label the top of the turntable with 21 and the bottom with 22. Flipping the turntable genera

A Slightly Less Pointless Solution to Le Monde puzzle #824

Here's Le Monde puzzle #824 : Show that, for any integer \(y\), \[(\sqrt{3}-1)^{2y}+(\sqrt{3}+1)^{2y}\] is an integer multiple of a power of two. Solution: Consider \[f(n) = (-1+\sqrt{3})^{n}+(-1-\sqrt{3})^{n}\] and observe that the two bases are the roots of the quadratic \(x^2 + 2x - 2\), hence \( f(n) \) obeys the recursion \( x_{n+2} = -2 x_{n+1} + 2 x_n \) with \( x_0=2\) and \(x_1=-2 \). It follows that \( f(n) \) is always an even integer.

Asteroids Algebra

Problem: Waldo is playing Asteroids. This game starts with 3 ships and you earn an extra ship for every 10000 points you score. At the end of his first game, Waldo noticed that he had the lowest score possible while averaging exactly 9000 points per ship. At the end of his second game, Waldo noticed that he had the highest score possible while averaging exactly 9000 points per ship. What were his two scores? Solution: Let the number of starting ships be \(S\), points for an extra ship be \(X\) and the player's average scoring rate be \(A\). Furthermore, let \(R = A/X\) and the number of bonus ships earned by the player during the game be \(B\). The number of bonus ships earned by the player during his game is the floor of his final score over the bonus value \(X\). Algebraically, we have \[ B = \lfloor (S+B)\cdot R \rfloor.\] Now let \[ (S+B)\cdot R = \lfloor (S+B)\cdot R \rfloor + f,\] where \( 0 \leq f < 1 \). This gives us \begin{aligned} (S+B)\cdot R &= B + f ,\

Interview Doubler

Problem: Find the smallest positive number that doubles when you move the last digit to the front. Solution: The answer is \( 105263157894736842 \), and the solution is similar to Twisty Temperature . Let this number be \[ x = x_{n}\cdot 10^{n-1} + ... + x_{1}\cdot 10^1 + x_{0} \] with \( 0 < x_{n} < 10 \), then after moving the last digit to the front we get \[ y = x_{0}\cdot 10^{n-1} + (x-x_{0})/10. \] We also have that \( y = 2 x \), so our equation becomes \[ 19\cdot x = x_0 \cdot (10^{n}-1). \] Now 10 is a primitive root modulo 19, and so it follows that \( x \) is integral if and only if \( n \) is of the form \( 18\cdot m \). Also note that we need \( 2 \le x_0 \le 9 \) since we require \( 0< x_{n} < 10\). When \( m=1 \) and \( x_0 = 2\) we get \( x  =105263157894736842 \); when \( m=2\) and \(x_0 = 2\) we get \[ 2 (10^{36}-1)/19 = 105263157894736842105263157894736842.\] That this number indeed doubles when you move the last digit to the front can be verified

Twisty Temperature

Problem: When Waldo recently did a conversion of a positive integral Celsius temperature \( c = 275 \) to its Fahrenheit equivalent \( f \) (which turned out to be \( 527\) ), he noticed to his amazement that he could have simply moved the last digit of \( c \) to the front to obtain \( f \). Doing some intense calculations he failed to discover the next largest such example. Does one exist, and if so, what is it? Solution: Let \( c = x_{n}\cdot 10^{n-1} + ... + x_{1}\cdot 10^1 + x_{0} \) with \( x_{n} > 0 \), then \[ f = x_{0}\cdot 10^{n-1} + (c-x_{0})/10. \] We also have that \[ f = (9/5)\cdot c + 32. \] Notice that in order for \( f \) to be integral \( c \) must be divisible by 5; this implies that \( x_0=5 \) since it cannot equal 0 (since as a number \( f>c \)). Our equation then becomes \[ (9/5)\cdot c + 32 = 5\cdot 10^{n-1} + (c-5)/10 \] implies \[ c = 5\cdot (10^n - 65)/17. \] Now it turns out that 10 is a primitive root modulo 17, and so it follows that \( c \)

Lunchtime Sports Science: Introducing tanh5

As I mentioned in a previous article on ratings systems , the log5 estimate for participant 1 beating participant 2 given respective success probabilities \( p_1, p_2 \) is \begin{align} p &= \frac{p_1 q_2}{p_1 q_2+q_1 p_2}\\ &= \frac{p_1/q_1}{p_1/q_1+p_2/q_2}\\ \frac{p}{q} &= \frac{p_1}{q_1} \cdot \frac{q_2}{p_2} \end{align} where \( q_1=1-p_1, q_2=1-p_2, q=1-p \). Where does this come from? Assume that both participants each played average opposition. In a Bradley-Terry setting, this means \begin{align} p_1 &= \frac{R_1}{R_1 + 1}\\ p_2 &= \frac{R_2}{R_2 + 1}, \end{align} where \( R_1 \) and \( R_2 \) are the (latent) Bradley-Terry ratings; the \( 1 \) in the denominators is an estimate for the average rating of the participants they've played en route to achieving their respective success probabilities. In a Bradley-Terry setting, it's true that the product of the ratings in the entire pool is taken to equal 1. But participants don'

Lunchtime Sports Science: Fitting a Bradley-Terry Model

Power rankings are game rankings that also allow you to estimate the likely outcome if two opponents were to face each other. One of the simplest of these models is known as the Bradley-Terry-Luce model (or commonly, Bradley-Terry). The idea is that each player \( i \) is assumed to have an unknown rating \( R_i \). If players \( i \) and \( j \) compete, the probability that \( i \) wins under this model is expected to be about \[ \frac{R_i}{R_i + R_j}. \] This model is very popular for hockey and other games; one commonly seen version is called KRACH . Let's fit a Bradley-Terry model to the current season of NCAA D1 men's hockey. The Frozen Four starts on Thursday, April 11, so you'll get to see how well your predictions do. You'll need to have R installed. Once R is installed, install the "BradleyTerry2" package that's freely available for R (thanks to Heather Turner and David Firth). To do this, start R and run the following command; you'll h

Lunchtime Sports Science: Cracking a New Sport

This is the first and what will be a series of relatively short pieces on sports analytics. I'll be using a variety of sports for examples, including both team sports and single-player sports, and I'll also make my code available through my GitHub account . Here are my recommended tools. If you're unfamiliar with some of these, don't worry. You'll pick them up as you go along, and they form a powerful suite that will keep you on the cutting edge even as a professional data scientist. Hardware - Ideally you want at least 4GB of RAM for larger data sets, but you'll be able to do high-level analysis with almost any modern computing hardware. Linux operating system - You can certainly do top-notch data analysis using any operating system, but Linux is an excellent (and free) working environment. There are a variety of ways to install and use Linux, but I'd recommend  Ubuntu's Windows installer . This will allow you to easily install Ubuntu alongside W

Solving Recurrences with Difference Equations

Here's an example from Robert Sedgewick's course on analytic combinatorics . Solve the recurrence \[ a_n = 3 a_{n−1} − 3 a_{n−2} + a_{n−3} \] for \( n>2 \) with \( a_0=a_1=0 \) and \( a_2=1 \). Let \( f(n) = a_n \); in the language of difference equations the above becomes simply \[ \frac{{\Delta^3} f}{{\Delta n}^3 } = 0 . \] Immediately, \[ f(n) = c_2 n^2 + c_1 n + c_0 . \] Applying the initial conditions we get \[ c_0 = 0, c_2 + c_1 = 0, 4 c_2 + 2 c_1 = 1, \] and so the solution is \( a_n =  \frac{1}{2} n^2 - \frac{1}{2} n \). Now what if the initial conditions are changed so \( a_1 = 1 \)?

Baseball, Chess, Psychology and Pychometrics: Everyone Uses the Same Damn Rating System

Here's a short summary of the relationship between common models used in baseball, chess, psychology and education. The starting point for examining the connections between various extended models in these areas. The next steps include multiple attempts, guessing, ordinal and multinomial outcomes, uncertainty and volatility, multiple categories and interactions. There are also connections to standard optimization algorithms (neural  nets, simulated annealing). Baseball Common in baseball and other sports, the log5 method provides an estimate for the probability \( p \) of participant 1 beating participant 2 given respective success probabilities \( p_1, p_2 \). Also let \( q_* = 1 -p_* \) in the following. The log5 estimate of the outcome is then: \begin{align} p &= \frac{p_1 q_2}{p_1 q_2+q_1 p_2}\\   &= \frac{p_1/q_1}{p_1/q_1+p_2/q_2}\\ \frac{p}{q} &= \frac{p_1}{q_1} \cdot \frac{q_2}{p_2} \end{align} The final form uses the odds ratio , \( \frac{p}{q} \).

Mining the First 3.5 Million California Unclaimed Property Records

As I mentioned in my previous article  the state of California has over $6 billion in assets listed in its unclaimed property database .  The search interface that California provides is really too simplistic for this type of search, as misspelled names and addresses are both common and no doubt responsible for some of these assets going unclaimed. There is an alternative, however - scrape the entire database and mine it at your leisure using any tools you want. Here's a basic little scraper written in Ruby . It's a slow process, but I've managed to pull about 10% of the full database in the past 24 hours ( 3.5 million out of about 36 million). What does the distribution of these unclaimed assets look like?  Among those with non-zero cash reported amounts: Total value - $511 million Median value - $15 Mean value - $157 90th percentile - $182 95th percentile - $398 98th percentile - $1,000 99th percentile - $1,937 99.9th percentile - $14,203 99.99th percen

Sergey Brin, Please Pick up your Paychecks

The state of California is currently holding over $6 billion  in unclaimed property belonging to millions of people. What type of property and who are the rightful owners? According to California's official unclaimed property website, these assets fall into the following categories: Bank accounts and safe deposit box contents Stocks, mutual funds, bonds, and dividends Uncashed cashier's checks or money orders Certificates of deposit Matured or terminated insurance policies Estates Mineral interests and royalty payments, trust funds, and escrow accounts People forget, people die, people move around. But $6 billion is a staggering amount of money; some of these amounts have to be really large. Let's try to find some interesting examples. This is official California UCP search form . Programmer and database types will notice one problem immediately - no fuzzy string matching . If your name or address was misspelled on the assets, or munged in the recording proce