Authors: Daksh Trehan, Sumit Chahal
ML Models in Dataiku:
There are several machine learning models available in Dataiku that we can train and use for prediction. Also, Dataiku offers to code your own model and use that in the Dataiku flow for predictions. There is also an option to code deep learning models in TensorFlow and use them in Dataiku.
Here we will train several models to predict the price of used cars and will compare their scores for the same. A dataset which we will use to train the models has several attributes like manufacturer, fuel type, gear type, the horsepower of the car, etc.
Since we need to predict the price of cars, it comes under a regression problem and for the same, we will train KNN, Ridge, Lasso, and Random Forest.
Creating ML Model in Dataiku:
We will create an ML model in Dataiku, where data for training is available in Snowflake. First, we will transform the data as per requirements and then train the model and finally, we will predict with the help of trained models.
Steps:
- We have uploaded a dataset for used cars prices in Snowflake.
- Connect Dataiku to Snowflake and get access to the USED_CARS database in Dataiku. (For this part, please refer to the following blog link – Dataiku with Snowflake )
- Before training, we will transform the data. First records are removed with a data preparation recipe which has a null value for any feature.
- Now with the help of a data preparation recipe, column ‘model’ is removed.
- The flow diagram will look like this after two recipes.
- Now we will analyze the attributes and will remove the noise from them. Like we club two child manufacturers mentioned separately into one parent value. Another one is remove outliers i.e remove records where count of class attributes is very less and in case of continuous attributes, remove records which have values very far from mean.
- The flow diagram will look like this after these steps.
- Split the dataset into train and test datasets. We will use an 80:20 ratio which results as 80% records in training dataset and after training of models with this dataset, we will evaluate the models with remainig 20% records which are available in test dataset.
- Train three models – KNN, Ridge, and Random Forest on training dataset.
- All three models are used to score on test dataset.
- Finally models are used to predict ‘Price’ for the whole dataset.
Results are as following: