ML_students/proj_predicting_PV_diffusionab71824feb99master
proj_predicting_PV_diffusion
README.md
CS-433: Machine Learning - Project 2
Getting Started
Prerequisites
To run the project, you will need python3 along with the libraries *numpy*, *matplotlib*, *datetime*, *pandas* and *sklearn* along with *Anaconda Navigator* (tested with versions 1.7.2 on Windows 10, 5.7.4 on MacOS 10.14.6, 1.7.2 Ubuntu 16.04) . All the mentionned libraries are embedded in Anaconda -- and have been updated at their latest version (up-to-date 19/12/2019) . To install *Anaconda Navigator* follow *this link*, download the version corresponding to your OS. Make sure the version you choose is greater than the previously mentionned tested versions.
How to Use Jupyter Notebook (skip if familiar with Jupyter Notebook)
To run *Jupyter notebook* launch *Anaconda Navigator*. In the window it opens, clic Launch under Jupyter Notebook. This should open a file system navigation page in your default web browser. Navigate through the folder down to this project folder and clic on any of the *.ipynb* file you desire to open. To run a cell, select it either with your cursor or up/down arrows and clic Execute at the top of the page.
Data Preprocessing
To process the data run the script *clean_data.py* or open the jupyter notebook *data_merge.ipynb* and follow instructions of the notebook. This will process the data as described in the report. Data will also be cleaned when running the file run.py.
Running the Training and Producing Predictions
Once the data has been processed (read above section to clean the data), the tuning and training of the three methods can be performed. To do so refer to the file *Logistic_regression_analysis.ipynb* to tune and train ridge regression models, to the file *FINAL NAME OF RF FILE HERE PLEASE* and to the file *neural_net.ipynb* to tune and train the neural network model.
To run imediately the best training and produce the best prediction run *run.py* from the *Anaconda prompt*. The prediction is saved in the file *pred_2018_ridge_regression_best.csv*.
Structure of the code
The code is split across multiple files. Please, also note that the data cleaning must be run before using the jupyter notebooks as explained above under Data Preprocessing. Here is the list and use of each file:
NOTE: apart from runnning the run.py script, we would strongly advise the reader to go over the following notebooks : Ridge_Regression_analysis.ipynb, neural_net.ipynb, randomForest_withRandomSearchGridSearch_Final.ipynb and randomForest_Final.ipynb. These notebook encapsulate all the strategies that were chosen for this projet. Also, the last two mentioned notebook and their methods are not used in the run.py and readers would miss the oportunity to explore this large part of the project.
- data_merge.ipynb
This file is a Jupyter Notebook and should be open as described in Getting Started: How to Use Jupyter Notebook. It allows to run the full data cleaning or only part of it. This file contains instruction on how to use itself.
- Ridge_Regression_analysis.ipynb
This file is a Jupyter Notebook and should be open as described in Getting Started: How to Use Jupyter Notebook. In this notebook, a special pre-processing of the data for Ridge regression is explained (it includes the reduction of the feature space and removing outliers) A quick justification (with some graphs, and heatmap) on what drove the pre-processing strategy is provided in the part "Data Exploration". Afterwards, polynomial basis expansion, natural cubic spline expansion and other Kernel expansion are tried. For each expansion, we are calling the function compute_stats in order to compute the score of each method. For further detail on this function see section about compute_stats.py.
- neural_net.ipynb
This file is a Jupyter Notebook and should be open as described in Getting Started: How to Use Jupyter Notebook. This file contains all information and codes to run the tuning and the training of the neural network model.
- randomForest_Final.ipynb
This file is a Jupyter Notebook and should be open as described in Getting Started: How to Use Jupyter Notebook. This file contains all information and codes to run the tuning and training of the random forest with cross validation on hyperparameters: *n_estimator*, *max_depths*, *max_features*. Please have a look at the notebook for further information.
- randomForest_withRandomSearchGridSearch_Final.ipynb
This file is a Jupyter Notebook and should be open as described in Getting Started: How to Use Jupyter Notebook. This file contains all information and codes to run the tuning and training of the random forest using *RandomizedSearchCV* and *GridSearchCV* as methods for cross validation. Please have a look at the notebook for further information.
- clean_data.py
This file contains all function required to clean and merge the data. This processing is designed for the files used in this project but it can be adapted to any new data file. This file is also a script that can be run to produce the three data set corresponding to data from 2016, 2017 and 2018.
- compute_stats.py
This script provided a methiod compute_stats that given the real target and the predicted targets computes :
- the absolute error, its mean, variance, max value, min value, first quantile, second quanitle (median) and the 3rd quantile
- the Normalized RMSE is computed
- the $R^2$ score is also computed (a.k.a determiniation of coefficient)
- finaly, if desired, the function can return a box-plot of the abs-error to get the confidence interval of each prediction
- basis_expansions.py
This file has been provided by Matthew Drury; it provided function to compute the Natural Cubic Spline. Here is the License of the script : BSD 3-Clause License
Copyright (c) 2017, Matthew Drury All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- - utils_randomforest.py
- *split_data(X,y,ratio_t,ratio_v)*: This method splits data into three different sets.*ratio_t* is the propotion of the test set,*ratio_v* is the propotion of the validation set. The default proportion for data points X and target y: 70% for training set, 20% for validation set and 10% for test set.
- *plot_importance(cols, model)*: This method will plot features with their corresponding feature importance computed by model.
- *pre_processing (X_train,X_val,X_test, norm=False, std=False)*: Pre-process the data, apply the normalizer or/and the standarizer on the train data then transform validation and test data.
- *prepare_combined_data_forCV(file_names,ratio_t=0.1, ratio_v=0.2, norm=False, std=False)*: This methon prepare data from csv to format. Return *X_train, X_val, X_test, y_train, y_val, y_test,cols* where *cols* is the name of features.
- *prepare_data_drop (file_names,cols_to_drop,norm=False, std=False)*: This is a variant of the previous *prepare_combined_data*, with addtionally a list of columns to drop.
- *prepare_data (file_name,norm=False, std=False)*: prepare data for training, without spliting data into different sets. Namely it returns only *X, y, cols*. This method is for data in 2016 and 2017, which are used as training set. Data in 2018 is used as testing set.
- *rf_cv(X_train, X_val, y_train, y_val, max_depths, n_estimators)*: This method performs a cross validation for Random Forest with hyperparameters to tune is "n_estimators" and "max_depths".
- *rf_cv_2f(X_train, X_val, y_train, y_val, bestD,bestN,max_features_range)*: Based on the best "n_estimators" and "max_depths" found by *rf_cv*, this method performs a second cross validation on hyperparameter "max_features".
- Authors
- Gachoud Sébastien
- Mansat Paul
- Zou Xiaoyan