Kaggle.com runs data science competitions. One of the first things you learn is to avoid overfitting.
Overfitting is when you create a model that performs very well on a certain set of data, but when you go to test the model on “real” data, it does much worse. In this case, real data means data that you don’t have the answer to, or (in Kaggle’s case), data that you will be scored on.
I will describe a classic mistake in machine learning that I made early on involving cross validation. It is described in this post of Greg Park. The gory details are described in the excellent book Elements of Statistical Learning
.
In particular, read section 7.10.2 (Second Edition).
Cross-validation is the term used to describe the process of dividing your data into two sets: training and test. The idea is to train, or determine, your model from the train data, than check (test) it on the test data. Pretty simple, right? What could go wrong?
Suppose I had 1000 different signals that I was using to make predictions. Depending on the field you are in, these could be called different names like predictors, features, etc. If you are trying to predict the stock market, they could be things like a detecting a head-and-shoulders-pattern, two down days followed by an up day, things like that. They are things that you think might be predictive. In machine learning, they are generally called features.
Suppose I do the following:
- Choose the best features from the 1000 using the entire data set
- Use cross-validation to fit those features and get a model
I have just overfit my data! Why? This classic mistake is sometimes called survivorship-bias in economics. Since I used the entire data set to pick the best features, by random chance, some of those 1000 features will look good, even if they are completely uncorrelated with the results you are looking for. And then in step 2, those features will still look good – and, Bingo, I have overfit my model. Note that this happens, even if you use cross-validation in step 1 to pick the best features.
The correct thing to do is:
- Select a cross-validation split
- Choose the best features using the train set (Do not look at the test set!)
- Train your model on the train set. You may evaluate it’s performance at the end on the test set, but cannot use the test in any way to guide the training.
If you are familiar with k-fold cross-validation, these same steps apply, but you must repeat steps 1-3 for each fold.