All my homies hate LASSO
I fell out of love with LASSO. it's not you it's me baby :(
💡 lesson from a macro HF PM in building +150 forecasting models
🏰 (lack of) invariance of regularisation under transformation/change of basis
(this post is meant to be free, so if you were to RT my tweet for this post I'll be thankful and give you access 😄)
Intuition behind regression regularisation is super simple.
- Standard OLS coefficients might be too big (large variance)
- We introduce a penalty term (bias)
- Hope we will get lower mean squared error
But we need to go deeper. LASSO is considered a sparse estimation method, while Ridge is considered a dense estimation. That sounds obvious, but rarely I ever thought about it in that framework before. Rarely I ever interrogate myself asking:
Checkout this exchange between @choffstein and Vivek, about using LASSO/Ridge/PCA/PLS in estimation of stock returns:
Now, let’s talk about why those perform poorly, because they all try to collapse the feature space. As a general rule trying to collapse the feature space underperforms methods that use the full range of features. Linear Ridge, on the other hand, will share loadings across collinear signals. If you have collinear signals, you don’t want to collapse them into one, which PCR and PLS effectively do or to push one out which LASSO effectively does. Linear Ridge shares the loaded between collinear features. And that helps when you’re predicting noisy variables, and of course, stock returns are noisy.
Effectively what Vivek was arguing is that stock returns, are more likely to be represented by a dense model over a sparse one. And if this is true, collapsing the feature space/dropping variable is likely to be very detrimental.
Do I have an opinion on this issue? Not really, to be honest I barely did any work on stock returns forecasting, so what do I know.
Regardless, in this article, we will try answer these questions:
❓ When is LASSO effective and when does it fail?
❓ Is Ridge good enough to be your L̶o̶r̶d̶ and Saviour?
Most of the time my instinct is to go for whichever regularisation method that has better error profile etc. If there's one thing I took away from writing this article, is not to do that. And as we will see later in this article, not thinking deeply about the data generating process (DGP) definitely going to be detrimental for our estimation task.