Automatic model selection: H2O AutoML

In this post, we will use H2O AutoML for auto model selection and tuning. This is an easy way to get a good tuned model with minimal effort on the model selection and parameter tuning side.

We will use the Titanic dataset from Kaggle and apply some feature engineering on the data before using the H2O AutoML.

Load Dataset

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# get titanic & test csv files as a DataFrame
train = pd.read_csv("../input/train.csv")
test    = pd.read_csv("../input/test.csv")

Feature Engineering

First thing is to remove two features from the data. We remove the ‘Cabin’ and ‘Ticket’ features just because more complicated feature engineering is necessary and it is not the context of this post.

train.pop('Cabin')
test.pop('Cabin')

train.pop('Ticket')
test.pop('Ticket')

We extract the passenger title from the name feature and group the titles in 4 categories.

dataset_title = [i.split(',')[1].split('.')[0].strip() for i in train['Name']]
train['Title'] = dataset_title
train['Title'].head()

dataset_title = [i.split(',')[1].split('.')[0].strip() for i in test['Name']]
test['Title'] = dataset_title
test['Title'].head()

# Convert to categorical values Title
train["Title"] = train["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train["Title"] = train["Title"].map({"Master":'0', "Miss":'1', "Ms":'1', "Mme":'1', "Mlle":'1', "Mrs":'1', "Mr":'2', "Rare":'3'})


test["Title"] = test["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
test["Title"] = test["Title"].map({"Master":'0', "Miss":'1', "Ms":'1', "Mme":'1', "Mlle":'1', "Mrs":'1', "Mr":'2', "Rare":'3'})

train.pop('Name')
test.pop('Name')

Filling missing values

We fill missing values with the mean value for numerical features and the most frequent value for categorical features.

train['Age'] = train['Age'].fillna(train['Age'].mean())
test['Age'] = test['Age'].fillna(test['Age'].mean())

train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])
test['Embarked'] = test['Embarked'].fillna(test['Embarked'].mode()[0])

train['Fare'] = train['Fare'].fillna(train['Fare'].mean())
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())

Mean target encoding

means = train.groupby('Age').Survived.mean()

train['Age_mean_target'] = train['Age'].map(means)
test['Age_mean_target'] = test['Age'].map(means)

means = train.groupby('Pclass').Survived.mean()

train['PClass_mean_target'] = train['Pclass'].map(means)
test['PClass_mean_target'] = test['Pclass'].map(means)

means = train.groupby('Title').Survived.mean()

train['Title_mean_target'] = train['Title'].map(means)
test['Title_mean_target'] = test['Title'].map(means)

means = train.groupby('Embarked').Survived.mean()
train['Embarked_mean_target'] = train['Embarked'].map(means)
test['Embarked_mean_target'] = test['Embarked'].map(means)

Log transformation for Fare

train["Fare"] = train["Fare"].apply(log1p)
test["Fare"] = test["Fare"].apply(log1p)

Convert numerical feature to categorical

def num2cat(x):
    return str(x)

train['Pclass_cat'] = train['Pclass'].apply(num2cat)
test['Pclass_cat'] = test['Pclass'].apply(num2cat)
train.pop('Pclass')
test.pop('Pclass')

Family size feature

We extract the family size for each passenger.

train['Family'] = train['SibSp'] + train['Parch'] + 1
test['Family'] = test['SibSp'] + test['Parch'] + 1

train.pop('SibSp')
test.pop('SibSp')

train.pop('Parch')
test.pop('Parch')

Getting Dummies from all other categorical features

Apply one hot encoding of categorical features

for col in train.dtypes[train.dtypes == 'object'].index:
    for_dummy = train.pop(col)
    train = pd.concat([train, pd.get_dummies(for_dummy, prefix=col)], axis=1)

for col in test.dtypes[test.dtypes == 'object'].index:
    for_dummy = test.pop(col)
    test = pd.concat([test, pd.get_dummies(for_dummy, prefix=col)], axis=1)    

Model selection and tuning

This is the core of this post. We will use H2O AutoML for model selection and tuning.

import h2o
from h2o.automl import H2OAutoML

h2o.init()

H2O cluster uptime:	22 secs
H2O cluster version:	3.16.0.3
H2O cluster version age:	10 days
H2O cluster name:	H2O_from_python_unknownUser_ogu663
H2O cluster total nodes:	1
H2O cluster free memory:	25.14 Gb
H2O cluster total cores:	32
H2O cluster allowed cores:	32
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:
H2O internal security:	False
H2O API Extensions:	XGBoost, Algos, AutoML, Core V3, Core V4
Python version:	3.6.4 final

We load the train and test data on H2O and select the training features and target feature.

htrain = h2o.H2OFrame(train)
htest = h2o.H2OFrame(test)

x =htrain.columns
y ='Survived'
x.remove(y)

# This line is added in the case of classification
htrain[y] = htrain[y].asfactor()
#htest[y] = htest[y].asfactor()

For the AutoML function, we just specify how long we want to train for and we’re set. For this example, we will train for 120 seconds.

aml = H2OAutoML(max_runtime_secs = 120)
aml.train(x=x, y =y, training_frame=htrain)

lb = aml.leaderboard
print (lb)

print('Generate predictions...')

test_y = aml.leader.predict(htest)
test_y = test_y.as_data_frame()

model_id	auc	logloss

StackedEnsemble_AllModels_0_AutoML_20180119_093938	0.878817	0.399037
StackedEnsemble_BestOfFamily_0_AutoML_20180119_093938	0.878523	0.400079
GBM_grid_0_AutoML_20180119_093938_model_0	0.877174	0.419855
GBM_grid_0_AutoML_20180119_093938_model_3	0.877033	0.414869
DRF_0_AutoML_20180119_093938	0.874259	0.535281
GBM_grid_0_AutoML_20180119_093938_model_2	0.872272	0.418382
GBM_grid_0_AutoML_20180119_093938_model_1	0.871849	0.419048
GLM_grid_0_AutoML_20180119_093938_model_0	0.868148	0.416836
DeepLearning_0_AutoML_20180119_093938	0.866239	0.419353
XRT_0_AutoML_20180119_093938	0.866177	0.786824

[14 rows x 3 columns]

In 120 seconds, AutoML trained 14 models. Some of these models are Gradient Boosting, Extra trees, Random Forest and Deep learning models. Also, it performed stacking of these models to get better AUC score.

This very powerful and saves a lot of time when first deciding on the model choice and parameters and can put you on the right direction.