May Merkle-Tan (MMT) -- Sept 2018¶
REwrite general analysis described in http://rpubs.com/hengrumay/Predict_PerformanceMode in Python.

general task and analysis descriptions available in link
• Synopsis :
Wearable devices that monitor physical activity are on the rise and provide a wealth of useful information. Apart from measuring quantity of activity, assessing the manner in which an activity is performed could also improve remote human activity monitoring. The Weight Lifting Exercise Dataset (Velloso, Bulling, Gellersen, Ugulino, Fuks, 2013) provides a means to derive a “proof-of-concept” in decoding the mode of weight lifting performance. Data was acquired from sensors on the belt, forearm, arm, and dumbell worn by 6 participants as they performed barbell lifts either correctly and incorrectly in 5 different ways. Further information is available from http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises.

Some differences:

removed extreme data points (cf. Rcode)

data-scaling (cf. Rcode)

includes helper-functions for Plots, ML assessments as separate py-file

This version:

no explicit grid-search | hypertuning involved...

Requirements :¶
python --version : Python 3.6.4

Version of Modules used:¶
ipython==6.2.1
ipython-genutils==0.2.0
nose==1.3.7
pycosat==0.6.3
tqdm==4.19.1.post1
matplotlib==2.2.2
seaborn==0.9.0
numpy==1.14.3
numpydoc==0.7.0
pandas==0.23.3
python-dateutil==2.6.1
scipy==1.0.0
scikit-learn==0.19.1

Preliminaries:¶

- Indicate if saving generated figures¶

Choose savefig = True if running without displaying figures in notebook
--> generates high-res pdf or png of figures in a figFolder

Else savefig = False will print relevant figures in notebook...

import subprocess

save_fig = {'Yes': True, 'No': False}

Response = input('Save generated figures in this analysis? \n\nRespond with Yes|No ..... :  ')

savefig = save_fig[Response]

if savefig :
    print('\n\n >>> Saving figures ......')
    
    #!mkdir -p figFolder    
    subprocess.call(['mkdir', '-p', 'figFolder']);

Save generated figures in this analysis? 

Respond with Yes|No ..... :  No

IMPORT/LOAD libraries and set Settings :¶

## Display Settings: 

from IPython.display import display, HTML

display(HTML(data= """
                    <style>
                        div#notebook-container    { width: 80%; }
                        div#menubar-container     { width: 80%; }
                        div#maintoolbar-container { width: 80%; }
                        
                        .output_png {
                                     display: table-cell;
                                     text-align: center;
                                     vertical-align: middle;
                                    }
                    </style>
                   """
            )
       )

## IMPORT/LOAD libraries and set Settings :

# import subprocess
import os
import glob

## Plotting -------------------------------------------------------------------------------------
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
#from matplotlib.backends.backend_pdf import PdfPages

import seaborn as sns
sns.set_style(style = 'white')
sns.set_context("notebook", 
                font_scale=1.125, 
                rc={"lines.linewidth": 2.5})

## Arrays | ETL | DataFrames -------------------------------------------------------------------
import numpy as np
np.seterr(all='ignore')

import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
# pd.set_option('display.max_colwidth',30)
pd.set_option('display.width', 1800)

from datetime import datetime #as dt
from dateutil.parser import parse

# ----------------------------------------------------------------------------------------------    

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

    
# ML|Stats-related -----------------------------------------------------------------------------
# from scipy import stats

from sklearn import preprocessing # for scaling

#from sklearn.linear_model import LogisticRegression
#from sklearn.neural_network import MLPClassifier
# from sklearn import tree 
from sklearn import ensemble 

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import matthews_corrcoef 
from sklearn import cross_validation

# ### ML modeling -- load helper functions 
import MMT_MLStatsPlotFuncs_PredictExCat_v0 as MMTfuncs

Load DATA Sources¶

http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises

# DATA Source | http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises

trainURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

df_train = MMTfuncs.ReadRAWdataFromSource(trainURL)
# df_train.head()#.T 
df_train.info()
# df_train.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19622 entries, 0 to 19621
Columns: 159 entries, user_name to classe
dtypes: datetime64[ns](1), float64(94), int64(26), object(38)
memory usage: 23.8+ MB

df_test = MMTfuncs.ReadRAWdataFromSource(testURL)
# df_test.head()#.T 
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Columns: 159 entries, user_name to problem_id
dtypes: datetime64[ns](1), float64(124), int64(30), object(4)
memory usage: 24.9+ KB

General Exploration -- brief summary:¶

- Missing Data & Value Counts etc.¶

## check NaNs...
df_train.isna().sum()

user_name                       0
raw_timestamp_part_1            0
raw_timestamp_part_2            0
cvtd_timestamp                  0
new_window                      0
num_window                      0
roll_belt                       0
pitch_belt                      0
yaw_belt                        0
total_accel_belt                0
kurtosis_roll_belt          19216
kurtosis_picth_belt         19216
kurtosis_yaw_belt           19216
skewness_roll_belt          19216
skewness_roll_belt.1        19216
skewness_yaw_belt           19216
max_roll_belt               19216
max_picth_belt              19216
max_yaw_belt                19216
min_roll_belt               19216
min_pitch_belt              19216
min_yaw_belt                19216
amplitude_roll_belt         19216
amplitude_pitch_belt        19216
amplitude_yaw_belt          19216
var_total_accel_belt        19216
avg_roll_belt               19216
stddev_roll_belt            19216
var_roll_belt               19216
avg_pitch_belt              19216
stddev_pitch_belt           19216
var_pitch_belt              19216
avg_yaw_belt                19216
stddev_yaw_belt             19216
var_yaw_belt                19216
gyros_belt_x                    0
gyros_belt_y                    0
gyros_belt_z                    0
accel_belt_x                    0
accel_belt_y                    0
accel_belt_z                    0
magnet_belt_x                   0
magnet_belt_y                   0
magnet_belt_z                   0
roll_arm                        0
pitch_arm                       0
yaw_arm                         0
total_accel_arm                 0
var_accel_arm               19216
avg_roll_arm                19216
stddev_roll_arm             19216
var_roll_arm                19216
avg_pitch_arm               19216
stddev_pitch_arm            19216
var_pitch_arm               19216
avg_yaw_arm                 19216
stddev_yaw_arm              19216
var_yaw_arm                 19216
gyros_arm_x                     0
gyros_arm_y                     0
gyros_arm_z                     0
accel_arm_x                     0
accel_arm_y                     0
accel_arm_z                     0
magnet_arm_x                    0
magnet_arm_y                    0
magnet_arm_z                    0
kurtosis_roll_arm           19216
kurtosis_picth_arm          19216
kurtosis_yaw_arm            19216
skewness_roll_arm           19216
skewness_pitch_arm          19216
skewness_yaw_arm            19216
max_roll_arm                19216
max_picth_arm               19216
max_yaw_arm                 19216
min_roll_arm                19216
min_pitch_arm               19216
min_yaw_arm                 19216
amplitude_roll_arm          19216
amplitude_pitch_arm         19216
amplitude_yaw_arm           19216
roll_dumbbell                   0
pitch_dumbbell                  0
yaw_dumbbell                    0
kurtosis_roll_dumbbell      19216
kurtosis_picth_dumbbell     19216
kurtosis_yaw_dumbbell       19216
skewness_roll_dumbbell      19216
skewness_pitch_dumbbell     19216
skewness_yaw_dumbbell       19216
max_roll_dumbbell           19216
max_picth_dumbbell          19216
max_yaw_dumbbell            19216
min_roll_dumbbell           19216
min_pitch_dumbbell          19216
min_yaw_dumbbell            19216
amplitude_roll_dumbbell     19216
amplitude_pitch_dumbbell    19216
amplitude_yaw_dumbbell      19216
total_accel_dumbbell            0
var_accel_dumbbell          19216
avg_roll_dumbbell           19216
stddev_roll_dumbbell        19216
var_roll_dumbbell           19216
avg_pitch_dumbbell          19216
stddev_pitch_dumbbell       19216
var_pitch_dumbbell          19216
avg_yaw_dumbbell            19216
stddev_yaw_dumbbell         19216
var_yaw_dumbbell            19216
gyros_dumbbell_x                0
gyros_dumbbell_y                0
gyros_dumbbell_z                0
accel_dumbbell_x                0
accel_dumbbell_y                0
accel_dumbbell_z                0
magnet_dumbbell_x               0
magnet_dumbbell_y               0
magnet_dumbbell_z               0
roll_forearm                    0
pitch_forearm                   0
yaw_forearm                     0
kurtosis_roll_forearm       19216
kurtosis_picth_forearm      19216
kurtosis_yaw_forearm        19216
skewness_roll_forearm       19216
skewness_pitch_forearm      19216
skewness_yaw_forearm        19216
max_roll_forearm            19216
max_picth_forearm           19216
max_yaw_forearm             19216
min_roll_forearm            19216
min_pitch_forearm           19216
min_yaw_forearm             19216
amplitude_roll_forearm      19216
amplitude_pitch_forearm     19216
amplitude_yaw_forearm       19216
total_accel_forearm             0
var_accel_forearm           19216
avg_roll_forearm            19216
stddev_roll_forearm         19216
var_roll_forearm            19216
avg_pitch_forearm           19216
stddev_pitch_forearm        19216
var_pitch_forearm           19216
avg_yaw_forearm             19216
stddev_yaw_forearm          19216
var_yaw_forearm             19216
gyros_forearm_x                 0
gyros_forearm_y                 0
gyros_forearm_z                 0
accel_forearm_x                 0
accel_forearm_y                 0
accel_forearm_z                 0
magnet_forearm_x                0
magnet_forearm_y                0
magnet_forearm_z                0
classe                          0
dtype: int64

## check NaNs...
df_test.isna().sum()

user_name                    0
raw_timestamp_part_1         0
raw_timestamp_part_2         0
cvtd_timestamp               0
new_window                   0
num_window                   0
roll_belt                    0
pitch_belt                   0
yaw_belt                     0
total_accel_belt             0
kurtosis_roll_belt          20
kurtosis_picth_belt         20
kurtosis_yaw_belt           20
skewness_roll_belt          20
skewness_roll_belt.1        20
skewness_yaw_belt           20
max_roll_belt               20
max_picth_belt              20
max_yaw_belt                20
min_roll_belt               20
min_pitch_belt              20
min_yaw_belt                20
amplitude_roll_belt         20
amplitude_pitch_belt        20
amplitude_yaw_belt          20
var_total_accel_belt        20
avg_roll_belt               20
stddev_roll_belt            20
var_roll_belt               20
avg_pitch_belt              20
stddev_pitch_belt           20
var_pitch_belt              20
avg_yaw_belt                20
stddev_yaw_belt             20
var_yaw_belt                20
gyros_belt_x                 0
gyros_belt_y                 0
gyros_belt_z                 0
accel_belt_x                 0
accel_belt_y                 0
accel_belt_z                 0
magnet_belt_x                0
magnet_belt_y                0
magnet_belt_z                0
roll_arm                     0
pitch_arm                    0
yaw_arm                      0
total_accel_arm              0
var_accel_arm               20
avg_roll_arm                20
stddev_roll_arm             20
var_roll_arm                20
avg_pitch_arm               20
stddev_pitch_arm            20
var_pitch_arm               20
avg_yaw_arm                 20
stddev_yaw_arm              20
var_yaw_arm                 20
gyros_arm_x                  0
gyros_arm_y                  0
gyros_arm_z                  0
accel_arm_x                  0
accel_arm_y                  0
accel_arm_z                  0
magnet_arm_x                 0
magnet_arm_y                 0
magnet_arm_z                 0
kurtosis_roll_arm           20
kurtosis_picth_arm          20
kurtosis_yaw_arm            20
skewness_roll_arm           20
skewness_pitch_arm          20
skewness_yaw_arm            20
max_roll_arm                20
max_picth_arm               20
max_yaw_arm                 20
min_roll_arm                20
min_pitch_arm               20
min_yaw_arm                 20
amplitude_roll_arm          20
amplitude_pitch_arm         20
amplitude_yaw_arm           20
roll_dumbbell                0
pitch_dumbbell               0
yaw_dumbbell                 0
kurtosis_roll_dumbbell      20
kurtosis_picth_dumbbell     20
kurtosis_yaw_dumbbell       20
skewness_roll_dumbbell      20
skewness_pitch_dumbbell     20
skewness_yaw_dumbbell       20
max_roll_dumbbell           20
max_picth_dumbbell          20
max_yaw_dumbbell            20
min_roll_dumbbell           20
min_pitch_dumbbell          20
min_yaw_dumbbell            20
amplitude_roll_dumbbell     20
amplitude_pitch_dumbbell    20
amplitude_yaw_dumbbell      20
total_accel_dumbbell         0
var_accel_dumbbell          20
avg_roll_dumbbell           20
stddev_roll_dumbbell        20
var_roll_dumbbell           20
avg_pitch_dumbbell          20
stddev_pitch_dumbbell       20
var_pitch_dumbbell          20
avg_yaw_dumbbell            20
stddev_yaw_dumbbell         20
var_yaw_dumbbell            20
gyros_dumbbell_x             0
gyros_dumbbell_y             0
gyros_dumbbell_z             0
accel_dumbbell_x             0
accel_dumbbell_y             0
accel_dumbbell_z             0
magnet_dumbbell_x            0
magnet_dumbbell_y            0
magnet_dumbbell_z            0
roll_forearm                 0
pitch_forearm                0
yaw_forearm                  0
kurtosis_roll_forearm       20
kurtosis_picth_forearm      20
kurtosis_yaw_forearm        20
skewness_roll_forearm       20
skewness_pitch_forearm      20
skewness_yaw_forearm        20
max_roll_forearm            20
max_picth_forearm           20
max_yaw_forearm             20
min_roll_forearm            20
min_pitch_forearm           20
min_yaw_forearm             20
amplitude_roll_forearm      20
amplitude_pitch_forearm     20
amplitude_yaw_forearm       20
total_accel_forearm          0
var_accel_forearm           20
avg_roll_forearm            20
stddev_roll_forearm         20
var_roll_forearm            20
avg_pitch_forearm           20
stddev_pitch_forearm        20
var_pitch_forearm           20
avg_yaw_forearm             20
stddev_yaw_forearm          20
var_yaw_forearm             20
gyros_forearm_x              0
gyros_forearm_y              0
gyros_forearm_z              0
accel_forearm_x              0
accel_forearm_y              0
accel_forearm_z              0
magnet_forearm_x             0
magnet_forearm_y             0
magnet_forearm_z             0
problem_id                   0
dtype: int64

df_train.classe.value_counts()

A    5580
B    3797
E    3607
C    3422
D    3216
Name: classe, dtype: int64

df_train.user_name.value_counts()

adelmo      3892
charles     3536
jeremy      3402
carlitos    3112
eurico      3070
pedro       2610
Name: user_name, dtype: int64

df_train.new_window.value_counts()

no     19216
yes      406
Name: new_window, dtype: int64

Pre-processing:¶

i) Convert ExerciseCategories into NumericalCategories:¶

Define Dict --> helpful for retrieving predicted cases later

Could also use LabelEncoder:

from sklearn.preprocessing import LabelEncoder
  lb_make = LabelEncoder()
  df_train[["classe_num"]] = df_train.apply(lambda x: lb_make.fit_transform(x))
  df_train.head()

# Convert ExerciseCategories into NumericalCategories -- use Dict later to find predicted cases: 

EXcat=dict()
for i, c in enumerate(df_train.classe.unique()):
    EXcat[c]=i    
    
def cat2num(x):    
    return(EXcat[x])

df_train['classe_num'] = df_train.classe.apply(lambda x: cat2num(x))

Many column variables of the data had over 50% NaNs --- making it hard to impute.¶

ii) Keep variables with less than 15% NaNs...¶

There are many Variables with > 50% Nulls/Nans

# keep variables with less than 15% NaNs... 
tmpCols = df_train.columns[(df_train.isna().sum(axis=0)/len(df_train))<0.15].tolist()

iii) A better prediction will (ideally) not rely on 'user' or 'time of the day' etc.¶

Drop these other variables
Create a list of Cols2use:

# a better prediction will (ideally) not rely on user or time of the day etc. 
# -- drop these other variables: 
Cols2use = [c for c in tmpCols if c not in ['user_name',
                                            'raw_timestamp_part_1',
                                            'raw_timestamp_part_2',
                                            'cvtd_timestamp',
                                            'new_window',
                                            'num_window',
                                            ]]

# Cols2use

iv) Subset Data with Cols2use & Exclude extreme values (outliers) :¶

### data Subset with Cols2use -----------------------------------------------------------------------

df_train2 = df_train[Cols2use].copy()

%time df_train2 = MMTfuncs.ExcludeOutliers(Cols2use, df_train, df_train2, sd=4)

CPU times: user 2.6 s, sys: 78.7 ms, total: 2.67 s
Wall time: 2.69 s

Some Checks & Visualizations post pre-processing :¶

- Raw and subset data sizes¶

print('RAW_TrainData_shape : ', df_train.shape) 
print('subset_TrainData_shape : ', df_train2.shape) 
print('%_VarColsUsed_from_TrainData :', format(df_train2.shape[1]/df_train.shape[1]*100) +'%')

# ((19622, 160), (19622, 54), #54/160= 0.3375)

RAW_TrainData_shape :  (19622, 160)
subset_TrainData_shape :  (19622, 54)
%_VarColsUsed_from_TrainData : 33.75%

- Checking overall % of NaNs post pre-processing¶

Less than 1% across Vars

## Exclusion of extreme values --> NaNs : Checking overall % of excluded data ----- less than 1% across Vars

def plot_checkNaNs(df_train2):
    (df_train2.isna().sum()/len(df_train2)*100).plot(kind='bar', figsize=(14,4), color='lightblue') 
    plt.title('Percentage of Variable\'s Values as NaNs \n'+ 
              '-- post eliminating extreme values', 
              size=16)
    plt.ylabel('%-tage NaNs')
    plt.show()
    
if savefig==False:    
    plot_checkNaNs(df_train2)

- Assess variable correlations:¶

Some variables are highly correlated -- as expected for using multiple sensors.
- Typically would drop one of the correlated pairs in modeling;
- Ensemble models might be able to deal with this...
- Could also project dataspace into lower dimensions.

# Doesn't show plot if savefig==True 

MMTfuncs.plotVarCorr_heatmap(df_train2, 
                             Cols2use, 
                             filename='figFolder/TrainData_Variable_correlations.pdf', 
                             savefig=savefig)

- Assess associations between variables within sub-category by sensor-types:¶

forearmC
armC
beltC
dumbbellC

# Doesn't show plots if savefig==True 

forearmC = MMTfuncs.getSimilarColNames("(.*).(_forearm)", df_train2[Cols2use])
armC = MMTfuncs.getSimilarColNames("(.*).(_arm)", df_train2[Cols2use])
beltC = MMTfuncs.getSimilarColNames("(.*).(_belt)", df_train2[Cols2use])
dumbbellC = MMTfuncs.getSimilarColNames("(.*).(_dumbbell)", df_train2[Cols2use])


## Takes a while to generate the pairplots ....
MMTfuncs.ExploreVariableAssoc_byEXcat(df_train2, forearmC, armC, beltC, dumbbellC,
                                     filename='figFolder/VariableAssociations_', 
                                     savefig=savefig)

- Visualize the range of variables by exercise class-type & corresponding test-data variable¶

MMTfuncs.plotDataRange_byExcat(Cols2use, df_train2, df_test, 
                      filename='figFolder/Variable_BoxplotNRange_TrainTestCSVData.pdf', 
                      savefig=savefig )

Structure data for Modeling:¶

- Drop NaNs¶

df_train3 = df_train2.dropna(how='any').copy()

- Add DataType Column; Merge Train & Test CSV data¶

df_train3['dataCSV'] = 'train'
# df_train3

df_test1 = df_test[Cols2use[:-2]].copy()
df_test1['dataCSV'] = 'test'
# df_test1

dfALL = pd.concat([df_train3, df_test1], 
                  axis=0, sort=False)[Cols2use + ['dataCSV'] ].reset_index(drop=True)

# dfALL.dataCSV.value_counts()
# train    18877
# test        20

- Scale Merged Data prior to modeling¶

### SCALING DATA -- both train/test CSV data

## StandardScaler | MinMaxScaler
# from sklearn import preprocessing
# # make sure variable values.astype(float)

df_scaled0 = dfALL.copy()

stand_scaler = preprocessing.StandardScaler() 
scaled0 = stand_scaler.fit_transform(dfALL[Cols2use[:-2]])

df_scaled0[Cols2use[:-2]] = pd.DataFrame(scaled0)

- Retrieve Test | Train Data sets & Define Target `train0_y` & Feature `train0_x` Variables:¶

# Separate Data: 

## ---> split to TRAIN and TEST again...
test0 = df_scaled0[df_scaled0.dataCSV=='test']


train0 = df_scaled0[df_scaled0.dataCSV=='train']

train0_y = train0[Cols2use[-2:] ] 
# train0_y.info()

train0_x = train0[Cols2use[:-2]] 
# train0_x.info()

- Split Train (Target & Features) Dataset into subsets for modeling:¶

training
devtesting
holdout

Sometimes imbalanced data is dealt with using SMOTE for training subset -- but not implemented in this analysis...

(xtrain, xdev, xhold, 
ytrain, ydev, yhold) = MMTfuncs.splitData2TrainTestHoldout(train0_x, 
                                                           train0_y.classe_num, 
                                                           trainSize=0.8, 
                                                           random_state=128, 
                                                           holdout=True, 
                                                           holdoutSize=0.2)

print('Train_dataset -- split proportions','\n',
      '------------------------------------------------')
print('training : ') 
print(ytrain.value_counts()/len(ytrain),'\n',ytrain.value_counts().sum()) 
print('devtest : ')
print(ydev.value_counts()/len(ydev),'\n',ydev.value_counts().sum()) 
print('holdout : ')
print(yhold.value_counts()/len(yhold),'\n',yhold.value_counts().sum())
print('')


# quick check re NaNs --------------
# xtrain.isna().any().any()
# False

Train_dataset -- split proportions 
 ------------------------------------------------
training : 
0.0    0.291951
1.0    0.195797
4.0    0.180461
2.0    0.170156
3.0    0.161636
Name: classe_num, dtype: float64 
 12324
devtest : 
0.0    0.269306
1.0    0.188838
2.0    0.187541
4.0    0.184945
3.0    0.169371
Name: classe_num, dtype: float64 
 3082
holdout : 
0.0    0.288941
1.0    0.204050
4.0    0.177051
2.0    0.173416
3.0    0.156542
Name: classe_num, dtype: float64 
 3852

Modeling Assessments :¶

Initial assessments included:
- Logistic_Regressions (with L1/L2 regularization) : sklearn.linear_model.LogisticRegression
- MLPClassifier : sklearn.neural_network.MLPClassifier
- AdaboostedTrees : sklearn.ensemble.AdaBoostClassifier
- RandomForest : sklearn.ensemble.RandomForestClassifier
- GradientboostedTrees : sklearn.ensemble.GradientBoostingClassifier

Current assessments use ensemble Tree models -- similar to previous analysis performed with R (see Rcode):
- RandomForest
- GradientboostedTrees

Default Model Settings used
- otherwise typical process after initial default settings --> grid-search and hyper-tuning not performed here

- Initialize Models¶

## Initialize Models -- default parameter settings
randstate = 898
models1= {}

models1['RandomForest'] = ensemble.RandomForestClassifier(random_state=randstate) #, verbose=1) 
#default: (n_estimators=50,criterion='gini',max_depth=5,max_features=3)


models1['gradboostedTrees'] = ensemble.GradientBoostingClassifier(random_state=randstate) #, verbose=1)
#default: (learning_rate=0.025, n_estimators=50, subsample=0.75, max_depth=5,max_features=5)

- Assess Models with Cross-Validation and Prediction Outcome Metrics:¶

Gradient Boosted Tree models take longer to run ...

The helper function also saves quite a few variables for use later:
- Model's corresponding confusion matrix
- Model's scores
for devtest | holdout predictions

C_out1 = MMTfuncs.XV_ScoresCMatYpred2(models1, 
                                      xtrain, ytrain, 
                                      xdev, ydev, 
                                      xhold, yhold,
                                      cv=5)

CrossValidating using -- RandomForest..........................................

Model: RandomForest-------------------------------------------------------------
Avg-Train Score: 0.9828792926991434
Avg-DevTest Score: 0.9244092697893314
Avg-HoldOut Score: 0.9358824376643338

Devtest-set
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00       830
        1.0       0.98      0.99      0.98       582
        2.0       0.97      0.97      0.97       578
        3.0       0.99      0.97      0.98       522
        4.0       1.00      1.00      1.00       570

avg / total       0.99      0.99      0.99      3082

Matthews_corrcoef : 0.9832542069008131

holdOut-set
             precision    recall  f1-score   support

        0.0       0.99      0.99      0.99      1113
        1.0       0.97      0.98      0.98       786
        2.0       0.97      0.97      0.97       668
        3.0       0.99      0.98      0.98       603
        4.0       1.00      0.99      0.99       682

avg / total       0.99      0.98      0.98      3852

Matthews_corrcoef : 0.9809190447950571

CrossValidating using -- gradboostedTrees..........................................

Model: gradboostedTrees-------------------------------------------------------------
Avg-Train Score: 0.9635696256141225
Avg-DevTest Score: 0.9386797869428701
Avg-HoldOut Score: 0.9506715903849546

Devtest-set
             precision    recall  f1-score   support

        0.0       0.98      0.99      0.99       830
        1.0       0.94      0.93      0.94       582
        2.0       0.93      0.96      0.94       578
        3.0       0.98      0.96      0.97       522
        4.0       0.99      0.97      0.98       570

avg / total       0.96      0.96      0.96      3082

Matthews_corrcoef : 0.9546507126498572

holdOut-set
             precision    recall  f1-score   support

        0.0       0.98      0.98      0.98      1113
        1.0       0.94      0.95      0.95       786
        2.0       0.94      0.96      0.95       668
        3.0       0.97      0.97      0.97       603
        4.0       0.99      0.96      0.98       682

avg / total       0.97      0.97      0.97      3852

Matthews_corrcoef : 0.9582324972543749

Modeling Outcomes | Visualizations:¶

- Compute | Display Confusion Matrix¶

### Compute confusion matrix
ModelNames = ['RandomForest','gradboostedTrees']
cmapC = [plt.cm.Blues,plt.cm.Greens]

MMTfuncs.ShowConfMatrix(C_out1, ModelNames, train0_y, 
                        cmapC, savefig=savefig)

RandomForest

Normalized confusion matrix
[[9.95e-01 4.49e-03 8.98e-04 0.00e+00 0.00e+00]
 [1.15e-02 9.85e-01 3.82e-03 0.00e+00 0.00e+00]
 [0.00e+00 2.40e-02 9.69e-01 7.49e-03 0.00e+00]
 [0.00e+00 0.00e+00 1.99e-02 9.80e-01 0.00e+00]
 [1.47e-03 2.93e-03 4.40e-03 1.47e-03 9.90e-01]]


gradboostedTrees

Normalized confusion matrix
[[0.98 0.01 0.   0.   0.  ]
 [0.02 0.95 0.02 0.   0.  ]
 [0.   0.03 0.96 0.01 0.  ]
 [0.   0.   0.02 0.97 0.01]
 [0.   0.02 0.01 0.01 0.96]]

- Display Coefs | Features of Relative Importance¶

if savefig:
    MMTfuncs.ScoresNFeaturesPlot(models1, xtrain, ytrain, plotFeatures=0)
    
else: 
    plotC = ['C0','C2']

    for i, M in enumerate(['RandomForest','gradboostedTrees']):     
        MMTfuncs.ScoresNFeaturesPlot({M:models1[M]}, xtrain, ytrain, 
                                     plotFeatures=1, figsize=(4,8), plotColor=plotC[i])

Features: Top 20 -- Relative Importance

Features: Top 20 -- Relative Importance

Predict Categorical Class of Exercise Activities with features from Unlabeled Test CSV dataset:¶

- Verified predictions for 20 exercise activites from Rcode with some hyperparameter assessments¶

ActualTestLabels = ['B', 'A', 'B', 'A', 'A',
                    'E', 'D', 'B', 'A', 'A', 
                    'B', 'C', 'B', 'A', 'E', 
                    'E', 'A', 'B', 'B', 'B']

print('Verified TestCSV EX_Categories : \n', ActualTestLabels)

Verified TestCSV EX_Categories : 
 ['B', 'A', 'B', 'A', 'A', 'E', 'D', 'B', 'A', 'A', 'B', 'C', 'B', 'A', 'E', 'E', 'A', 'B', 'B', 'B']

- Use Trained Models to Predict Test Data & Assess Prediction Accuracy:¶

for M in ['RandomForest','gradboostedTrees']:
    modelRF = models1[M] 
    
    mFit = modelRF.fit(xtrain, ytrain)

    yTest_pred = mFit.predict(test0[Cols2use[:-2]])

    yTest_predC = []
    for i in yTest_pred:
        yTest_predC.extend([k[0] for k,v in EXcat.items() if v==i])

    print(M + ' -- Predicted EX_Categories : \n', yTest_predC) 
    
    PredictAccuracy = np.sum(np.array(yTest_predC) == np.array(ActualTestLabels))/len(ActualTestLabels)*100
    
    print('Prediction Accuracy : \n', str(PredictAccuracy) + '%')
    print()

RandomForest -- Predicted EX_Categories : 
 ['B', 'A', 'B', 'A', 'A', 'E', 'D', 'B', 'A', 'A', 'B', 'C', 'B', 'A', 'E', 'E', 'A', 'B', 'B', 'B']
Prediction Accuracy : 
 100.0%

gradboostedTrees -- Predicted EX_Categories : 
 ['B', 'A', 'B', 'A', 'A', 'E', 'D', 'B', 'A', 'A', 'B', 'C', 'B', 'A', 'E', 'E', 'A', 'B', 'B', 'B']
Prediction Accuracy : 
 100.0%

Open Saved Figures : ¶

# cwd = os.getcwd()

# for f in [f for f in os.listdir(cwd+'/figFolder/') if f.endswith(".pdf")] : 
#     #print(f)
#     #subprocess.call(['open','figFolder/RandomForest_ConfusionMatrix.pdf'])
#     subprocess.call(['open','figFolder/'+ f])

Postscript notes/thoughts --¶

Caveat: One does not always have the luxury of knowing what the actual category labels are for data you wish to predict. However, hopefully with enough training, development-testing and holdout data subsets for assessing and tuning models, one will be able to develop appropriate models for predictive usage.

Typically, default model parameters are used for initial assessments followed by further tuning of hyperparameters using grid-search etc. In this case, the prediction metrics were consistently high for the data assessed, so further tuning was not performed.

May Merkle-Tan (MMT) -- Sept 2018¶

Requirements :¶

Version of Modules used:¶

Preliminaries:¶

- Indicate if saving generated figures¶

IMPORT/LOAD libraries and set Settings :¶

Load DATA Sources¶

General Exploration -- brief summary:¶

- Missing Data & Value Counts etc.¶

Pre-processing:¶

i) Convert ExerciseCategories into NumericalCategories:¶

Many column variables of the data had over 50% NaNs --- making it hard to impute.¶

ii) Keep variables with less than 15% NaNs...¶

iii) A better prediction will (ideally) not rely on 'user' or 'time of the day' etc.¶

iv) Subset Data with Cols2use & Exclude extreme values (outliers) :¶

Some Checks & Visualizations post pre-processing :¶

- Raw and subset data sizes¶

- Checking overall % of NaNs post pre-processing¶

- Assess variable correlations:¶

- Assess associations between variables within sub-category by sensor-types:¶

- Visualize the range of variables by exercise class-type & corresponding test-data variable¶

Structure data for Modeling:¶

- Drop NaNs¶

- Add DataType Column; Merge Train & Test CSV data¶

- Scale Merged Data prior to modeling¶

- Retrieve Test | Train Data sets & Define Target train0_y & Feature train0_x Variables:¶

- Split Train (Target & Features) Dataset into subsets for modeling:¶

Modeling Assessments :¶

- Initialize Models¶

- Assess Models with Cross-Validation and Prediction Outcome Metrics:¶

Modeling Outcomes | Visualizations:¶

- Compute | Display Confusion Matrix¶

- Display Coefs | Features of Relative Importance¶

Predict Categorical Class of Exercise Activities with features from Unlabeled Test CSV dataset:¶

- Verified predictions for 20 exercise activites from Rcode with some hyperparameter assessments¶

- Use Trained Models to Predict Test Data & Assess Prediction Accuracy:¶

Open Saved Figures : ¶

Postscript notes/thoughts --¶

- Retrieve Test | Train Data sets & Define Target `train0_y` & Feature `train0_x` Variables:¶