We’re group of tech-enthusiasts who are studying Machine Learning together. Our backgrounds are diverse, and so are our interests, ranging from robotics and NLP to finance. We’re meeting once a week for a whole day and, well, we study.
It took about a month to understand, apply and finish the course material on data pre-processing, classification and clustering. So we were all set to dive right into our first Kaggle challenge, which was predicting the survivors of the Titanic. I wrote about it earlier.
NaN. Not our favorite.
We downloaded the dataset and examined. Of course, we had some prior knowledge which helps us selecting features. We know that women and children, as well as the upper class were much more likely to have survived. So sex, age, fare were the most important features in our opinion. But we didn’t want to exclude any features, because there might have been some sort of correlation that our machines can figure out much better than we ever could. So we decided to keep everything, except for the obvious irrelevant features, such as Passenger ID.
It took about six hours in total, about 4-5 hours only for pre-processing and figuring out things. We clearly made some mistakes and probably didn’t chose the best model at the end (since our knowledge and skills are limited to some basic classification and clustering). One of the problems we had was the replacement of NaN within multi-class categorical data. Since NaN is some kind of different datatype is was a hassle to get rid of it. We lost quite a lot of time on data pre-processing and small problems like that. Once the dataset was ready to train, we tried different things and experimented with tweaking parameters (like number of neighbours or increasing C in svm.svc) and comparing algorithms.
We also analyzed the features as such in multiple ways and came up with some ideas for feature engineering. Finally, we were ready to submit our models. The accuracy was far away from perfect, but not too bad for newbies I guess.
Our first milestone. We’re ready for the next.