### Introduction

Data science is a very hot, perhaps the hottest, field right now. Sports analytics has been my primary area of interest, and it's a field that has seen amazing growth in the last decade. It's no surprise that the most common question I'm asked is about becoming a data scientist. This will be a first set of rough notes attempting to answer this question from my own personal perspective. Keep in mind that this is only my opinion and there are many different ways to do data science and become a data scientist.

Data science is using data to answer a question. This could be doing something as simple as making a plot by hand, or using Excel to take the average of a set of numbers. The important parts of this process are knowing which questions to ask, deciding what information you'd need to answer it, picking a method that takes this data and produces results relevant to your question and, most importantly, how to properly interpret these results so you can be confident that they actually answer your question.

Knowing the questions requires some domain expertise, either yours or someone else's. Unless you're a data science researcher, data science is a tool you apply to another domain.

If you have the data you feel should answer your question, you're in luck. Frequently you'll have to go out and collect the data yourself, e.g. scraping from the web. Even if you already have the data, it's common to have to process the data to remove bad data, correct errors and put it into a form better suited for analysis. A popular tool for this phase of data analysis is a scripting language; typically something like Python, Perl or Ruby. These are high-level programming languages that very good at web work as well as manipulating data.

If you're dealing with a large amount of data, you'll find that it's convenient to store it in a structured way that makes it easier to access, manipulate and update in the future. This will typically be a relational database of some type, such as PostgreSQL, MySQL or SQL Server. These all use the programming language SQL.

Methodology and interpretation are the most difficult, broadest and most important parts of data science. You'll see methodology referenced as statistical learning, machine learning, artificial intelligence and data mining; these can be covered in statistics, computer science, engineering or other classes. Interpretation is traditionally the domain of statistics, but this is always taught together with methodology.

You can start learning much of this material freely and easily with MOOCs. Here's an initial list.

### MOOCs

#### Data Science Basics

Johns Hopkins: The Data Scientist’s Toolbox. Overview of version control, markdown, git, GitHub, R, and RStudio. Started January 5, 2015. Coursera.

Johns Hopkins: R Programming. R-based. Started January 5, 2015. Coursera.

#### Scripting Languages

Intro to Computer Science. Python-based. Take anytime. Udacity; videos and exercises are free.

Programming Foundations with Python. Python-based. Take anytime. Udacity; videos and exercises are free.

MIT: Introduction to Computer Science and Programming Using Python. Python-based. Class started January 9, 2015. edX.

#### Databases and SQL

Stanford: Introduction to Databases. XML, JSON, SQL; uses SQLite for SQL. Self-paced. Coursera.

#### Machine Learning

Stanford: Machine Learning. Octave-based. Class started January 19, 2015. Coursera.

Stanford: Statistical Learning. R-based. Class started January 19, 2015. Stanford OpenEdX.

### A Bayes' Solution to Monty Hall

For any problem involving conditional probabilities one of your greatest allies is Bayes' Theorem. Bayes' Theorem says that for two events A and B, the probability of A given B is related to the probability of B given A in a specific way.

Standard notation:

probability of A given B is written $$\Pr(A \mid B)$$
probability of B is written $$\Pr(B)$$

Bayes' Theorem:

Using the notation above, Bayes' Theorem can be written: $\Pr(A \mid B) = \frac{\Pr(B \mid A)\times \Pr(A)}{\Pr(B)}$Let's apply Bayes' Theorem to the Monty Hall problem. If you recall, we're told that behind three doors there are two goats and one car, all randomly placed. We initially choose a door, and then Monty, who knows what's behind the doors, always shows us a goat behind one of the remaining doors. He can always do this as there are two goats; if we chose the car initially, Monty picks one of the two doors with a goat behind it at random.

Assume we pick Door 1 and then Monty sho…

### What's the Value of a Win?

In a previous entry I demonstrated one simple way to estimate an exponent for the Pythagorean win expectation. Another nice consequence of a Pythagorean win expectation formula is that it also makes it simple to estimate the run value of a win in baseball, the point value of a win in basketball, the goal value of a win in hockey etc.

Let our Pythagorean win expectation formula be $w=\frac{P^e}{P^e+1},$ where $$w$$ is the win fraction expectation, $$P$$ is runs/allowed (or similar) and $$e$$ is the Pythagorean exponent. How do we get an estimate for the run value of a win? The expected number of games won in a season with $$g$$ games is $W = g\cdot w = g\cdot \frac{P^e}{P^e+1},$ so for one estimate we only need to compute the value of the partial derivative $$\frac{\partial W}{\partial P}$$ at $$P=1$$. Note that $W = g\left( 1-\frac{1}{P^e+1}\right),$ and so $\frac{\partial W}{\partial P} = g\frac{eP^{e-1}}{(P^e+1)^2}$ and it follows $\frac{\partial W}{\partial P}(P=1) = … ### Solving a Math Puzzle using Physics The following math problem, which appeared on a Scottish maths paper, has been making the internet rounds. The first two parts require students to interpret the meaning of the components of the formula $$T(x) = 5 \sqrt{36+x^2} + 4(20-x)$$, and the final "challenge" component involves finding the minimum of $$T(x)$$ over $$0 \leq x \leq 20$$. Usually this would require a differentiation, but if you know Snell's law you can write down the solution almost immediately. People normally think of Snell's law in the context of light and optics, but it's really a statement about least time across media permitting different velocities. One way to phrase Snell's law is that least travel time is achieved when \[ \frac{\sin{\theta_1}}{\sin{\theta_2}} = \frac{v_1}{v_2},$ where $$\theta_1, \theta_2$$ are the angles to the normal and $$v_1, v_2$$ are the travel velocities in the two media.

In our puzzle the crocodile has an implied travel velocity of 1/5 in the water …