What is Overfitting, Anyway?

Kaggle.com runs data science competitions.  One of the first things you learn is to avoid overfitting.

Overfitting is when you create a model that performs very well on a certain set of data, but when you go to test the model on “real” data, it does much worse.  In this case, real data means data that you don’t have the answer to, or (in Kaggle’s case), data that you will be scored on.

I will describe a classic mistake in machine learning that I made early on involving cross validation. It is described in this post of Greg Park. The gory details are described in the excellent book Elements of Statistical Learning

1

.
In particular, read section 7.10.2 (Second Edition).

Cross-validation is the term used to describe the process of dividing your data into two sets: training and test. The idea is to train, or determine, your model from the train data, than check (test) it on the test data. Pretty simple, right? What could go wrong?

Suppose I had 1000 different signals that I was using to make predictions. Depending on the field you are in, these could be called different names like predictors, features, etc. If you are trying to predict the stock market, they could be things like a detecting a head-and-shoulders-pattern, two down days followed by an up day, things like that. They are things that you think might be predictive. In machine learning, they are generally called features.

Suppose I do the following:

  1. Choose the best features from the 1000 using the entire data set
  2. Use cross-validation to fit those features and get a model

I have just overfit my data!  Why?  This classic mistake is sometimes called survivorship-bias in economics.  Since I used the entire data set to pick the best features, by random chance, some of those 1000 features will look good, even if they are completely uncorrelated with the results you are looking for.  And then in step 2, those features will still look good – and, Bingo, I have overfit my model.  Note that this happens, even if you use cross-validation in step 1 to pick the best features.

The correct thing to do is:

  1. Select a cross-validation split
  2. Choose the best features using the train set (Do not look at the test set!)
  3. Train your model on the train set.  You may evaluate it’s performance at the end on the test set, but cannot use the test in any way to guide the training.

If you are familiar with k-fold cross-validation, these same steps apply, but you must repeat steps 1-3 for each fold.

Notes:

Optimal Diets

When I read that I should eat food X because it has a lot of Vitamin Q, I always wonder if you could eat all the foods that someone recommends. This food here because it has a lot of Vitamin A, this other one because it has a ton of Vitamin K, and so on. How do you know that that new food isn’t going to overload your diet in some other dimension, like too much sugar, fat, etc?

Well, let’s see how math can help answer that question.
We are going to look at an optimization method called Linear Programming and see if we can figure out an Optimal Diet.

If you want to follow along and try out some of the programs for yourself, they are available at https://gitlab.com/dgrunberg/optimal-diet.  You will need to download the USDA data separately, links are provided in the README file.

DISCLAIMER: Do not follow any diets proposed below without consulting your doctor.  You’ll probably get sick. Or die of a kale overdose.
Nutrition Data
First we have to start with a source of data.  The USDA has conveniently collected nutritional information on 8,000 foods for us at the website  www.ars.usda.gov.  They provide the data on how much of various nutrients are in a 100g serving of each food.  The data is provided in a text file format that is fairly easy to parse with your favorite programming language.

We have massaged the data into the following form, which will make it easier for us to discuss algorithms:

Let
\[
Q=[q_{ij}]
\]
where \(q_{ij}\) is the amount of nutrient \(j\) that a serving of food \(i\) contains.
Then a diet can be represented by a vector \(x=[x_i]\) where each value is the number of servings of that food that we are consuming (per day). This is what mathematicians call thinking abstractly: we can call a string of numbers \(x\) a diet without a second thought.
Then
\[
Q^T x
\]
represents the vector of nutrients in the diet.
Now one more thing – our diet has certain restrictions. Let’s start with minimums: each nutrient may have a minimum amount that we must consume. So, let \(b\)
be the vector of minimum amounts of each nutrient that we require (assume for the moment that every nutrient has a minimum). Then we can write:
\[
Q^T x \geq b
\]
If only some of the nutrients have minimums, we would only include the rows of Q^T that correspond to them in the equation.

Linear Programming

Linear Programming is an optimization problem (and solution) that looks like:
Find an x that minimizes
\[
c^T x
\]
subject to:
\[
\begin{align}
Ax &\leq b\\
x &\geq 0
\end{align}
\]

There are numerous ways to access a Linear Programming algorithm – e.g. MATLAB or Octave. In the code provided with this post, we will be using the scipy package scipy.optimize.linprog. What happens when we try it?
Results
The first attempt I did was to minimize total calories while meeting the RDAs for various nutrients.
The first attempt gave this. It has 1103 calories. You could fill in the rest of your calories with sugar, as there is no maximum on sugar calories.

food  portions  food-description
10062 1.18  Pork chop
11458 10.42 Spinach
01082 16.23 Milk

Looks appetizing, doesn’t it? Well, as you look into the solution, you see that Vitamin D ends up being a problem, which contributes to the large milk dose.
In any case, I decided to switch to making the total calories fixed at 2000. I needed find something else to minimize, so I choose total servings. There some other things that could be interesting, such as price. We would need a database of food prices to proceed there. Maybe a future post.

The first try with 2000 calorie diet gives this:

food portions food-description
11090 0.84 Broccoli, raw
11147 0.02 Chard, swiss, raw
16028 1.66 Beans, kidney, all types, mature seeds, cooked, boiled, without salt
20051 2.90 Rice, white, medium-grain, enriched, cooked
18075 0.22 Bread, whole-wheat, commercially prepared
01128 0.33 Egg, whole, cooked, fried
01082 13.33 Milk, lowfat, fluid, 1% milkfat, with added vitamin A and vitamin D
12062 0.80 Nuts, almonds, blanched
11357 1.00 Potatoes, white, flesh and skin, baked
15121 2.78 Fish, tuna, light, canned in water, drained solids

Seems reasonable, except for all the milk. When I tried to limit the milk servings, the optimizer could find a solution. Here is the result if you temporarily eliminate the Vitamin D requirement:

food portions food-description
11090 0.83 Broccoli, raw
11124 0.89 Carrots, raw
16028 5.76 Beans, kidney, all types, mature seeds, cooked, boiled, without salt
20051 0.53 Rice, white, medium-grain, enriched, cooked
18075 2.08 Bread, whole-wheat, commercially prepared
01270 0.42 Cheese, cheddar, sharp, sliced
15061 2.25 Fish, perch, mixed species, cooked, dry heat
12062 0.71 Nuts, almonds, blanched

Except for all the beans, looks like getting closer to palatable. The problem is that when it tries to replace the milk with tuna, which has Vitamin D also, it runs into the saturated fat limit. Ah, the problems constraints give optimizers.

One more try, then I will let you try things out on your own. On a list of high Vitamin D foods, I found salmon and almond milk. Let’s add those as options and see what we get. Here is the full program output so you can see it for yourself:

food id    portions   food-description
11090          0.85  Broccoli, raw
11124          0.73  Carrots, raw
16028          6.00  Beans, kidney, all types, mature seeds, cooked, boiled, without salt
20051          0.77  Rice, white, medium-grain, enriched, cooked
18075          1.62  Bread, whole-wheat, commercially prepared
01270          0.70  Cheese, cheddar, sharp, sliced
15061          1.39  Fish, perch, mixed species, cooked, dry heat
12062          0.45  Nuts, almonds, blanched
15086          1.15  Fish, salmon, sockeye, cooked, dry heat

Calories by category:
Protein                              676      33.2%
Fat                                  585      28.7%
Carbs, less fiber                    775      38.1%
TOTAL                               2035

nutrient      units       min       max     value name                   major contributor
203 PROCNT        g     50.00      0.00    168.99 Protein              (Beans, kid:    52.0)
204 FAT           g      0.00     65.00     65.00 Total lipid (fat)    (Cheese, ch:    23.8)
205 CHOCDF        g      0.00      0.00    250.74 Carbohydrate, by dif (Beans, kid:   136.9)
208 ENERC_KCAL kcal      0.00      0.00   2228.44 Energy               (Beans, kid:   762.3)
291 FIBTG         g     25.00      0.00     57.11 Fiber, total dietary (Beans, kid:    38.4)
301 CA           mg   1300.00      0.00   1300.00 Calcium, Ca          (Cheese, ch:   501.1)
303 FE           mg     18.00      0.00     23.13 Iron, Fe             (Beans, kid:    13.3)
304 MG           mg    420.00      0.00    645.05 Magnesium, Mg        (Beans, kid:   252.1)
305 P            mg   1250.00      0.00   2532.93 Phosphorus, P        (Beans, kid:   828.3)
306 K            mg   4700.00      0.00   4700.00 Potassium, K         (Beans, kid:  2430.9)
307 NA           mg    500.00   1500.00   1500.00 Sodium, Na           (Bread, who:   737.3)
309 ZN           mg     11.00      0.00     16.32 Zinc, Zn             (Beans, kid:     6.0)
312 CU           mg      0.90      0.00      2.61 Copper, Cu           (Beans, kid:     1.3)
315 MN           mg      2.30      0.00      8.79 Manganese, Mn        (Bread, who:     3.5)
317 SE           ug     55.00      0.00    140.90 Selenium, Se         (Bread, who:    41.6)
320 VITA_RAE     ug    900.00   9000.00    900.00 Vitamin A, RAE       (Carrots, r:   607.5)
323 TOCPHA       mg     15.00      0.00     18.08 Vitamin E (alpha-toc (Nuts, almo:    10.8)
324 VITD         IU    800.00      0.00    800.00 Vitamin D            (Fish, salm:   771.1)
401 VITC         mg     90.00      0.00     90.00 Vitamin C, total asc (Broccoli, :    76.1)
404 THIA         mg      1.20      0.00      2.23 Thiamin              (Beans, kid:     1.0)
405 RIBF         mg      1.30      0.00      1.85 Riboflavin           (Beans, kid:     0.3)
406 NIA          mg     16.00      0.00     29.25 Niacin               (Fish, salm:    11.7)
410 PANTAC       mg      5.00      0.00      6.53 Pantothenic acid     (Fish, salm:     1.5)
415 VITB6A       mg      1.70      0.00      2.61 Vitamin B-6          (Fish, salm:     1.0)
417 FOL          ug    400.00      0.00   1018.39 Folate, total        (Beans, kid:   780.3)
418 VITB12       ug      2.40      0.00      8.83 Vitamin B-12         (Fish, salm:     5.1)
430 VITK1        ug    120.00      0.00    161.19 Vitamin K (phylloqui (Broccoli, :    86.7)
601 CHOLE        mg      0.00    300.00    300.00 Cholesterol          (Fish, perc:   160.0)
606 FASAT         g      0.00     20.00     18.60 Fatty acids, total s (Cheese, ch:    13.6)

The Netflix Prize: Singular Values for $1 Million

Data science is hot. Really hot 1.

What can you do with data science? This post deals with one particular application, called collaborative filtering. It’s called collaborative filtering because it is a way of filtering, or processing, data from multiple sources (which are collaborating together) to do something useful. Doesn’t make sense to you? Me neither. For now, just think of collaborative filtering as a way of generating recommendations for users.

The concept of a recommender system was first brought to widespread attention by the announcement of the Netflix Prize. In 2006, Netflix announced a competition to produce a better movie rating prediction system than their current one, at the time called Cinematch. The idea is that Netflix would like to predict how well its users might like individual movies, so that it could recommend movies to them. The more accurately they would be able to predict users ratings of movies, the better for their business. The grand prize was $1 million.

Data science competitions have become quite popular now. They are a way for people to learn about and demonstrate (sometimes) their proficiency in the field. The website http://kaggle.com is a prime example of this.

Back to the Netflix Prize. Each movie that is rated by a user is given a 1 to 5 star rating, 5 being the best. One way to measure the accuracy of a movie rating prediction is to compute the Root-Mean-Square-Error, or RMSE for predictions. That means you add up the square of the errors of all its predictions, and take the square root. This is a common metric, or yardstick (meter-stick?) for errors. Cinematch had an RMSE of 0.9514 on the test data, which means that, on average, it predicted within 0.9514 stars of a users actual rating. The contest was to see (and award $1 million) to a team that could improve that accuracy by at least 10%, meaning an RMSE of 0.8572 or better.

The actual contest data looked like this. There are 480,189 users. There are 17,770 movies. If everyone rated every movie, there would be 480,189 x 17,770 ratings, but of course, not everyone watched or rated every movie. They give you 100,480,507 ratings (1-5 each), which is about 1% of all the possible ratings. The idea is that they use about 1.4 million ratings (the test data) that have NOT been provided to you to determine the winners – you have to provide your estimates for those 1.4 million entries, and the RMSE of your estimates provides your final score. There were some other details, like they had a separate 1.4 million ratings (also not provided, the qualifying data) that people could use to submit interim estimates and get scores and be posted on the leaderboard. This was to prevent people from gaming the system by constantly getting a score on the qualifying data to determine what some of the true answers were.

How would you go about solving this problem? At first glance, it might seem impossible. How could you predict how much someone would like a movie that they have never seen or rated before?

However, let’s discuss some ways that we could say things about how one viewer, let’s call her Mary, would rate a particular movie, say The Call of the Wild. That is, what should the (Mary, Call of the Wild) pair be rated?

First, some notation. Let \(A\) be the matrix of ratings, with 0 for no rating, otherwise 1-5 for the number of stars. Each row represents a single user, and each column represents a single movie. Let \(a_{ij}\) be the entry for user \(i\) and movie \(j\). Let there be \(m\) users and \(n\) movies. Also, it will be convenient to use an indicator variable \(\delta_{ij}\) to indicate if a (user, movie) pair has a rating:

\begin{align}
A=&\left[ a_{ij} \right] \\
\delta_{ij} =& \begin{cases} 1 & \text{user i rated movie j} \\
0 & \text{otherwise}
\end{cases}
\end{align}

Let’s start with the simplest possible system. If we didn’t know anything, what should we try to guess for that pair? We might guess the average rating that all users gave to all movies.
That would be

\[
\text{average rating}=\frac{\sum\limits_{i=1}^m \sum\limits_{j=1}^n a_{ij}} {\sum\limits_{i=1}^m \sum\limits_{j=1}^n \delta_{ij} }
\]

But maybe we can do better. If Mary rated every movie she saw as 5 stars, we might predict the (Mary, Call of the Wild) pair as 5 stars. Likewise, if everyone who saw the the Call of the Wild rated it 5 stars, we might predict that pair as 5 stars as well. The equation for predicting each movie’s rating as its average rating is:

\[
\text{average rating for movie k}=M_k=\frac{\sum\limits_{i=1}^m a_{ik}} {\sum\limits_{i=1}^m \delta_{ik} }
\]

It turns out that if you predict each rating as the average of each movie’s rating, you will get an RSME of 1.0540

But consider this. Suppose Mary liked movie A, B, and C and hated D. And Joe also liked A, B, and C, but hated D. Maybe Mary would like E the same as Joe would. In table form, suppose the data were:






ABCDE
5451
54512
2225

Or, in our matrix format
\[
A=
\begin{bmatrix}
5 & 4 & 5 & 1 & 0\\
5 & 4 & 5 & 1 & 2\\
2 & 2 & 2 & 5 & 0\\
\end{bmatrix}
\]

where we have set unrated entries to 0.

This would allow us to get even better. There are a number of ad hoc ways that this type of reasoning can be performed, and we will visit some of those methods in a future post.

For now, we would like to see how a more general approach using mathematics could be developed.

Singular Value Decomposition is a method a decomposing, or describing, a matrix

\[A=USV’\]

Where \(U, V\) have orthonormal columns and S is diagonal with non-negative entries. The way to think of this decomposition is that the rows \(U\) tell us how much each user likes a particular “feature” and the columns of \(V\) tell us how each movie has each “feature”. The weights in the diagonal of \(S\) give us the overall importance of those features. What are these “features” you ask. We don’t know – it’s what the SVD has determined from the data. To get a feel for how this works, imagine that the features are action, good acting, famous actors, and so on. You can see how different movies would have different amounts of these features, and how different viewers would weight these characteristics differently. The beauty of this approach is we do not have to try to determine what the features are – the SVD does it for us.

A key fact that we will use: The best k-rank approximation to a matrix A can be found by
\[
A=U_aS_aV_a’
\]
where the a subscripted variables only include the k-largest singular values. This k-rank approximation is best in the Frobenius norm sense , which should remind us of an RMSE approximation.

However, we have a small problem. If we choose a model \(A_m=USV’\) for our ratings, and we had ratings at every position,
then having a model close to our A would be good. Unfortunately, there are a number of 0’s in the A matrix (for unrated entries), and we want to minimize the RMSE (we have left off a constant which divides the total squared error by the number of entries, but that does not affect the optimization)

\[
E=\sum_i^n\sum_j^m\left(A-USV’\right)_{ij}^2\delta_{ij}
\]
where \(\delta_{ij}\) is 1 if there is a rating at position \((i,j)\), and 0 otherwise.

But it seems we are close. See my post on how to solve complex problems [post ref here].

Let us define the SVD k-rank approximation algorithm as
\[
\operatorname{SVD}_k[A] \triangleq U_kS_kV_k’ \quad \text{where } S_k \text{ has been reduced to k-largest singular values}
\]
Then consider the following algorithm to generate a series of linear models \(X^{(k)}\)
\begin{align}
X^{(0)} =& \operatorname{SVD}_k \\
X^{(n+1)} =& \operatorname{SVD}_k \left[ \begin{cases} A & \text{where }a_{ij} > 0 \\
X^{(n)} & \text{elsewhere}
\end{cases}
\right]
\end{align}

Let’s analyze what is happening here. If one of the models \(X^{(n)}\) fit the data \(A\) exactly, then the next model \(X^{(n+1)}\) would be the same – the \(X^{(n)}\) would be a fixed point of the algorithm.
The convergence of this algorithm is not guaranteed theoretically, but in practice it converges.

A Big Matrix
One problem you will immediately run into if you try the above solution is that \(A\) has about 8,500,000,000 entries. Most SVD solvers won’t like this. However, all is not lost. There is a method, called the Arnoldi method, for computing SVD (eigenvalues, really, but that helps us do an SVD), that does not require the actual matrix A, but instead only requires computing \(Av_i\) for a (small) series of vectors \(v_i\). That is, the Arnoldi method gives you a \(v_i\) and you give it back \(Av_i\), never exposing the full matrix \(A\). Reliable code to do this is ARPACK and has been (and can be) easily integrated into various languages, like C/C++ or Python.

 

Notes:

  1. Harvard Business Review, Data Science: The Sexiest Job of the 21st Century, October 2012

Linear Solve, Part 1

Note: This article assumes you know about matrix and vector math. If not, you might want to peruse an introduction to the subject [note self: write one].

A set of linear equations in 2 unknowns \(x_1, x_2\)

\begin{align}
b_1& =a_{11} x_1 + a_{12} x_2 \\
b_2& =a_{21} x_1 + a_{22} x_2 \\
b_3& =a_{31} x_1 + a_{32} x_2
\end{align}
can be written as
\begin{align}
Ax=b
\end{align}
where
\begin{equation}
A =\begin{bmatrix}
a_{11}& a_{12}\\
a_{21}& a_{22}\\
a_{31}& a_{32}\\
\end{bmatrix}
\end{equation}
\begin{equation}
x =\begin{bmatrix}
x_1\\
x_2
\end{bmatrix}
\end{equation}

\begin{equation}
b =\begin{bmatrix}
b_1\\
b_2\\
b_3
\end{bmatrix}
\end{equation}

In general, let a be an \(m \mathrm{x} n \) matrix (that means m rows and n columns). Then we are given the A and b and are asked what values of x “work” here. There are a number of things that x could mean, and various configurations, depending on the sizes of the b’s and x’s

  • There might be no solution, i.e. no value of x makes Ax=b. We might want to know what value of x makes this as close as possible. This is the famous least squares problem, or linear regression.
  • There might be many solutions; we would be interested in characterizing those solutions
  • There might be exactly one solution, we would be interested in finding it.

Note: In MATLAB or octave, this would be called solving for x using the command: x=A\b

The linear equation \(Ax=b\) has been called the “most important equation in the world” 1. This book is an excellent reference for learning linear algebra – no need for another book. The free online course, MIT Introduction to Linear Algebra, taught by Gilbert Strang, uses this book.

Want to use linear equations for approximations? See Using Linear Equation for Approximations.

 

Notes:

  1. Strang, Gilbert. Introduction to Linear Algebra, Fifth Ed. Wellesley: Wellesley-Cambridge Press., 2016.