Exploring Data Sculptures at the MIT Museum

 

This past weekend, budding data scientists got to try their hand at communicating data through story telling.  Less bar charts, more paper and glue.  It’s easy to use Excel to create a graph, but is there a better way to grab reader’s attention?

Rahul Bhargava, a Research Scientist at the MIT Center for Civic Media, tried to get people thinking about using stories to convey numerical data. Visitors were encouraged to do quick mock-ups of one of three topics.  For example, a one-pager gave some statistics showing the rapid rise in the cost of higher education.  The images above were some of the results.

The workshop took place at the Idea Hub at The MIT Museum in Cambridge, Massachusetts.  The Idea Hub hosts a different topic each weekend day.

 

What is Overfitting, Anyway?

Kaggle.com runs data science competitions.  One of the first things you learn is to avoid overfitting.

Overfitting is when you create a model that performs very well on a certain set of data, but when you go to test the model on “real” data, it does much worse.  In this case, real data means data that you don’t have the answer to, or (in Kaggle’s case), data that you will be scored on.

I will describe a classic mistake in machine learning that I made early on involving cross validation. It is described in this post of Greg Park. The gory details are described in the excellent book Elements of Statistical Learning

1

.
In particular, read section 7.10.2 (Second Edition).

Cross-validation is the term used to describe the process of dividing your data into two sets: training and test. The idea is to train, or determine, your model from the train data, than check (test) it on the test data. Pretty simple, right? What could go wrong?

Suppose I had 1000 different signals that I was using to make predictions. Depending on the field you are in, these could be called different names like predictors, features, etc. If you are trying to predict the stock market, they could be things like a detecting a head-and-shoulders-pattern, two down days followed by an up day, things like that. They are things that you think might be predictive. In machine learning, they are generally called features.

Suppose I do the following:

  1. Choose the best features from the 1000 using the entire data set
  2. Use cross-validation to fit those features and get a model

I have just overfit my data!  Why?  This classic mistake is sometimes called survivorship-bias in economics.  Since I used the entire data set to pick the best features, by random chance, some of those 1000 features will look good, even if they are completely uncorrelated with the results you are looking for.  And then in step 2, those features will still look good – and, Bingo, I have overfit my model.  Note that this happens, even if you use cross-validation in step 1 to pick the best features.

The correct thing to do is:

  1. Select a cross-validation split
  2. Choose the best features using the train set (Do not look at the test set!)
  3. Train your model on the train set.  You may evaluate it’s performance at the end on the test set, but cannot use the test in any way to guide the training.

If you are familiar with k-fold cross-validation, these same steps apply, but you must repeat steps 1-3 for each fold.

Notes:

Boston area Computer Science meetup

Those in the Boston area might be interested in the activities of this meetup

Theoretical Computer Science Problem-Solving

Cambridge, MA
486 Theoreticians

Purpose? A joint effort to learn, discuss and tackle the fundamental problems and theorems of Theoretical Computer Science, including Program Synthesis, Theory of Computation…

Check out this Meetup Group →