Made with Python and CatBoost
View notebook

This project was for a machine learning applications course. It was completed within a week and was the first ML model I've worked on alone. The project goal was the same for everyone: given a dataset of real music lyrics, classify them into one of three genres (Rock, Hip Hop, and Pop).
The Story
I needed to build a classifier, but I lacked domain knowledge in lyric composition. While there may be beautiful patterns that help sort the lyrics into a genre, I do not know any of that. So, it was hard to do any useful feature extraction from the lyrics. Things like lyrics length and number of verses vary a lot so simply finding quantities would not be enough. With limited time to work on this, it'd be difficult to gain a deeper knowledge of lyrical composition.
The challenge here was in processing the lyrics. It was a (long) text feature, and so as a categorical data type is a bit tricky to convert into a numerical feature for use in other regression methods. That's when I began to wonder if I really needed to work that hard in converting the categorical data to numerical. Surely, there must be something already out there to handle text features easily.
While exploring some ML libraries, I came across CatBoost. CatBoost employs gradient descent and was especially interesting for its focus on working with categorical data. A little digging deeper into the CatBoost docs (which was also a challenge), I found that CatBoost actually has support for text features.
Implementing CatBoost on the lyrics was then super easy:
#Build the model
model = CatBoostClassifier(iterations=100,
learning_rate=3,
depth=6,
loss_function='MultiClass')
The model's accuracy: ~68-70%
This accuracy turned out to be really good, comparatively speaking. But, I didn't realize how good when I submitted it. In fact, I had fumbled my first submission, using testing the model on the holdout set rather than the testing set to my embarrassment. So when I submitted this for a second time at 3 AM on the day before the deadline, I was really just hoping I at least had done ok.
Draft 1 accuracy:
From: ML Professor
The ABSOLUTE distance between your predicted accuracy and the real accuracy is: 0.125
This is quite a lot.
When I submitted my first draft for testing, it turns out I had made quite some mistakes. CatBoost has extensive docs but lacks in greater details of its capabilities. This can sometimes make it hard to understand what CatBoost can do and how. Regardless, after the poor results of the first submission, I had to decide on what my next approach would be.
CatBoost was not explicitly taught in the course. The professor encouraged us to use libraries outside those he had demonstrated in class, so I was immediately eager to find something new. I don't think I could carefully explain to you how gradient boosted descent works, but nor did I need to for the context of the project. But still, here I was staring at my broken CatBoost model, wondering if I should throw it all out the window and use something more tried and true like K-Fold and Naive Bayes. Instead, I chose to keep going with CatBoost. I believed if I could research it more, I'd understand how to properly implement the text feature.
Draft 2 accuracy:
From: ML Professor
The ABSOLUTE distance between your predicted accuracy and the real accuracy is: 0.001
You have a very good accuracy, congratulations!
For the second draft, I looked further into CatBoost and outside the docs. After watching this very helpful video, I was able to find and fix the mistakes I had made prior. Checking the accuracy of the model, this time it looked much more promising!
After submitting it for the final time, I later found out during a class that the submission with the highest accuracy was one that used CatBoost. There was also only one person apparently who used CatBoost, so I guess we can infer whose submission that was ;)