We are at the halfway mark! Six weeks left until the end of DSI.
I missed the blog update on Week 5, but these last two weeks were spent on Project 3, where the goal was to try and predict housing prices using the Ames housing dataset found on Kaggle. Compared to the last two projects, this one was by far the hardest, requiring us to use many types of regression models and squeezing every ounce of our knowledge gained from the lessons in class.
We also had Garrett come in as a guest lecturer to talk about Kaggle competitions (and how to optimise to win them). He knows his stuff – after all, his Kaggle rank is impressive! But after learning about ensemble models and how the top Kagglers spend so much time tuning their parameters just to improve their score by 0.0001, I went away with a slightly negative view about participating in a Kaggle competition. The main question at the back of my mind was, “what is the real tangible benefit of your model beating the last one by just 0.0001? Would that help the company make more money? Would it help diagnose a problem that much quicker, vis-a-vis the time spent on tuning the model?” I am interested in practical applications with observable results in the real world, not bragging rights about how a model is ‘highly-predictive’ to the nth decimal place. So I think, for me at least, doing Kaggle competitions will be more for fun as a hobby than anything else.
A quick whirlwind summary of what we have been through after six weeks:
- Foundations, beginning with statistics and programming in Python
- EDA, data cleaning and manipulation
- Databases (SQL)
- Classical statistical models in regression and classification
- Web scraping and APIs
It is amazing how these topics can go down into significant detail – to be honest, I think every single one of them merits further study and revision for a better understanding, but a bootcamp will not give you much time to do so. I will probably have to revisit them after the course ends to consolidate what I have learnt.