Extensive compilation of topics on machine learning and data science
Overview of models
- Scikit-Learn (or sklearn) library
- Overview of k-NN (sklearn’s documentation)
- Overview of Linear Models (sklearn’s documentation)
- Overview of Decision Trees (sklearn’s documentation)
- Overview of algorithms and parameters in H2O documentation
Feature preprocessing
- Preprocessing in Sklearn
- Andrew NG about gradient descent and feature scaling
- Feature Scaling and the effect of standardization for machine learning algorithms
Feature generation
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It
- Discussion of feature engineering on Quora
Feature extraction from text
Bag of words
Word2vec
- Tutorial to Word2vec
- Tutorial to word2vec usage
- Text Classification With Word2Vec
- Introduction to Word Embedding Models with Word2Vec
NLP Libraries
Feature extraction from images
Pretrained models
Finetuning
- How to Retrain Inception’s Final Layer for New Categories in Tensorflow
- Fine-tuning Deep Learning Models in Keras
Stack and packages
- Basic SciPy stack (ipython, numpy, pandas, matplotlib)
- Jupyter Notebook
- Stand-alone python tSNE package
- Libraries to work with sparse CTR-like data: LibFM
- Libraries to work with sparse CTR-like data: LibFFM
- Another tree-based method: RGF
- Python distribution with all-included packages: Anaconda
- Blog “datas-frame” (contains posts about effective Pandas usage)
- Vowpal Wabbit repository
- XGBoost repository
- LightGBM repository
- Interactive demo of simple feed-forward Neural Net
- Framework for Neural Nets: Keras
- Framework for Neural Nets: PyTorch
- Framework for Neural Nets: TensorFlow
- Framework for Neural Nets: MXNet
- Framework for Neural Nets: Lasagne
- Example from sklearn with different decision surfaces
- Arbitrary order factorization machines
Visualization tools
Validation
Classification metrics
- Evaluation Metrics for Classification Problems: Quick Examples + References
- Decision Trees: “Gini” vs. “Entropy” criteria
- Understanding ROC curves
Ranking
- Learning to Rank using Gradient Descent – original paper about pairwise method for AUC optimization
- Overview of further developments of RankNet
- RankLib (implemtations for the 2 papers from above)
- Learning to Rank Overview
Clustering
Hyperparameter tuning
- Tuning the hyper-parameters of an estimator (sklearn)
- Optimizing hyperparameters with hyperopt
- Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python
Matrix Factorization
t-SNE
- Multicore t-SNE implementation
- Comparison of Manifold Learning methods (sklearn)
- How to Use t-SNE Effectively (distill.pub blog)
- tSNE homepage (Laurens van der Maaten)
- Example: tSNE with different perplexities (sklearn)
Feature interactions:
- Facebook Research’s paper about extracting categorical features from trees
- Example: Feature transformations with ensembles of trees (sklearn)
Ensembling
- Kaggle ensembling guide at MLWave.com (overview of approaches)
- StackNet — a computational, scalable and analytical meta modelling framework (by KazAnova)
- Heamy — a set of useful tools for competitive data science (including ensembling)
Kaggle past solutions
- http://ndres.me/kaggle-past-solutions/
- https://www.kaggle.com/wiki/PastSolutions
- http://www.chioka.in/kaggle-competition-solutions/
- https://github.com/ShuaiW/kaggle-classification/