Skip to main content

Probability and Cumulative Dice Sums

A Very Rough Guide to Getting Started in Data Science: Part I, MOOCs

Introduction


Data science is a very hot, perhaps the hottest, field right now. Sports analytics has been my primary area of interest, and it's a field that has seen amazing growth in the last decade. It's no surprise that the most common question I'm asked is about becoming a data scientist. This will be a first set of rough notes attempting to answer this question from my own personal perspective. Keep in mind that this is only my opinion and there are many different ways to do data science and become a data scientist.

Data science is using data to answer a question. This could be doing something as simple as making a plot by hand, or using Excel to take the average of a set of numbers. The important parts of this process are knowing which questions to ask, deciding what information you'd need to answer it, picking a method that takes this data and produces results relevant to your question and, most importantly, how to properly interpret these results so you can be confident that they actually answer your question.

Knowing the questions requires some domain expertise, either yours or someone else's. Unless you're a data science researcher, data science is a tool you apply to another domain.

If you have the data you feel should answer your question, you're in luck. Frequently you'll have to go out and collect the data yourself, e.g. scraping from the web. Even if you already have the data, it's common to have to process the data to remove bad data, correct errors and put it into a form better suited for analysis. A popular tool for this phase of data analysis is a scripting language; typically something like Python, Perl or Ruby. These are high-level programming languages that very good at web work as well as manipulating data.

If you're dealing with a large amount of data, you'll find that it's convenient to store it in a structured way that makes it easier to access, manipulate and update in the future. This will typically be a relational database of some type, such as PostgreSQL, MySQL or SQL Server. These all use the programming language SQL.

Methodology and interpretation are the most difficult, broadest and most important parts of data science. You'll see methodology referenced as statistical learning, machine learning, artificial intelligence and data mining; these can be covered in statistics, computer science, engineering or other classes. Interpretation is traditionally the domain of statistics, but this is always taught together with methodology.

You can start learning much of this material freely and easily with MOOCs. Here's an initial list.

MOOCs

Data Science Basics

Johns Hopkins: The Data Scientist’s Toolbox. Overview of version control, markdown, git, GitHub, R, and RStudio. Started January 5, 2015. Coursera.

Johns Hopkins: R Programming. R-based. Started January 5, 2015. Coursera.

Scripting Languages

Intro to Computer Science. Python-based. Take anytime. Udacity; videos and exercises are free.


Programming Foundations with Python. Python-based. Take anytime. Udacity; videos and exercises are free.



MIT: Introduction to Computer Science and Programming Using Python. Python-based. Class started January 9, 2015. edX.



Databases and SQL

Stanford: Introduction to Databases. XML, JSON, SQL; uses SQLite for SQL. Self-paced. Coursera.


Machine Learning

Stanford: Machine Learning. Octave-based. Class started January 19, 2015. Coursera.


Stanford: Statistical Learning. R-based. Class started January 19, 2015. Stanford OpenEdX.



Comments

  1. Hello,
    The Article on A Very Rough Guide to Getting Started in Data Science is nice .it give detail information about data science .Thanks for Sharing the information about it.
    big data scientist

    ReplyDelete

Post a Comment

Popular posts from this blog

Notes on Setting up a Titan V under Ubuntu 17.04

I recently purchased a Titan V GPU to use for machine and deep learning, and in the process of installing the latest Nvidia driver's hosed my Ubuntu 16.04 install. I was overdue for a fresh install of Linux, anyway, so I decided to upgrade some of my drives at the same time. Here are some of my notes for the process I went through to get the Titan V working perfectly with TensorFlow 1.5 under Ubuntu 17.04. Old install: Ubuntu 16.04 EVGA GeForce GTX Titan SuperClocked 6GB 2TB Seagate NAS HDD + additional drives New install: Ubuntu 17.04 Titan V 12GB / partition on a 250GB Samsung 840 Pro SSD (had an extra around) /home partition on a new 1TB Crucial MX500 SSD New WD Blue 4TB HDD + additional drives You'll need to install Linux in legacy mode, not UEFI, in order to use Nvidia's proprietary drivers for the Titan V. Note that Linux will cheerfully boot in UEFI mode, but will not load any proprietary drivers (including Nvidia's). You'll need proprietary d

Mixed Models in R - Bigger, Faster, Stronger

When you start doing more advanced sports analytics you'll eventually starting working with what are known as hierarchical, nested or mixed effects models . These are models that contain both fixed and random effects . There are multiple ways of defining fixed vs random random effects , but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean. Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository ): model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id) fit <- glmer(model, data=g, verbose=TRUE, family=poisson(link=log) ) The fixed effects are year , field (home/away/neutral), d_div (NCAA division of the defense), o_div (NCAA division of the offense) and game_length (number of overtime

A Bayes' Solution to Monty Hall

For any problem involving conditional probabilities one of your greatest allies is Bayes' Theorem . Bayes' Theorem says that for two events A and B, the probability of A given B is related to the probability of B given A in a specific way. Standard notation: probability of A given B is written \( \Pr(A \mid B) \) probability of B is written \( \Pr(B) \) Bayes' Theorem: Using the notation above, Bayes' Theorem can be written:  \[ \Pr(A \mid B) = \frac{\Pr(B \mid A)\times \Pr(A)}{\Pr(B)} \] Let's apply Bayes' Theorem to the Monty Hall problem . If you recall, we're told that behind three doors there are two goats and one car, all randomly placed. We initially choose a door, and then Monty, who knows what's behind the doors, always shows us a goat behind one of the remaining doors. He can always do this as there are two goats; if we chose the car initially, Monty picks one of the two doors with a goat behind it at random. Assume we pick Door 1 an