ML_students/proj_anomaly_detection_timeseriesab71824feb99master
proj_anomaly_detection_timeseries
README.md
- Machine Learning Project 2 - EPFL, AY 2019/20
- Authors: Margherita Guido, Daniele Hamm, Michele Vidulis
In this folder we collected the code used to produce results for the second project of EPFL Machine Learning course. We addressed a problem of unsupervised anomaly detection in time series representing energy consumption of supermarket buildings. The pipeline proposed is implemented in the notebook ` main.ipynb `, where the sections are named consistently with the organization of the report.
The folder encloses scripts used to preprocess the data, generate and select features, train and select models, implement our pipeline (based on clustering and regression), plot the results and produce the output consisting of the number of anomalies detected in every building. Two types of anomalies are detected during the procedure:
- Week scale resolution anomalies: the number of days in every week that shows a severely anomalous pattern are collected.
- Point-wise anomalies: their number is collected for every week.
The code is implemented in Python 3 and the libraries needed to compile it are Numpy, Pandas, Scikit.
Warning: To make the code work, the folder named '/data' must be unzipped in our folder.
It follows a brief description of what every file consists of.
Notebooks:
- ` main.ipynb `
This notebook contains the complete analysis of a whole time series, using the method described in our report. The building (i.e. the time series) to analyse can be fixed in section `1.2 Load desired building`. The building code can vary from 1 to 105, number of total buildings considered after the data cleaning and preprocessing. All the steps performed during the analysis are carefully described in the notebook and in the report.
The output consists in a csv `/output/output_building_<building_number>` file where the weekly number of both types of anomalies described before are listed for the selected building.
- ` main_baseline.ipynb `
This notebook contains the complete analysis of a whole time series, using the baseline method chosen (plain linear regression) as described in the report.
- `feature_selection_process.ipynb`
Notebook used for selecting features to be used in our model. This is achieved training Random Forest regressor on different time series. More explanation can be found in our report. The collected results can be found in `/output/RF_feature_importances`
- ` grid_search_DBSCAN.ipynb`
Notebook used for selecting the parameter epsilon for DBSCAN method applied in our pipeline. To do it, grid search is performed.
- ` output.jpynb`
Notebook that produces the files contained in `/output` folder:- ` output.csv` csv file where the weekly number of both types of anomalies described before are listed for every building.
- `errors.csv ` csv file where for every building are reported:
- Percentage relative error computed on the non atypical days
- Percentage of anomalous measurements on non atypical days
- Percentage of anomalous days on the total number of days
To generate the analysis only on some buildings, the chosen indexes have to be fixed in the variable `building_indexes. To obtain the plots of the clustering and final results, show_plots = True ` has to be set.
- ` output_analysis.jpynb`
Notebook used to anlyse the results, in particular the percentage of anomalies and of atypical days are computed for every building, in order to assess the quality of our method.
Helpers files (all the functions implemented in the .py files are provided with a precise description):
- `building_routine.py`
Contains two functions that execute all the anlysis for a given building. One is used inside the notebook that generates the output, while the second one is used in feature selection phase.
- `helpers_preprocessing.py`
Functions used to preprocess the dataset, including NaN handling, selection of the time series representing our real source of interest, formatting of the final dataset.
- `helpers_clustering.py`
Functions used to perform the clustering phase of our model, involving Fourier Transform, PCA and DBSCAN.
- `helpers_modelweek.py`
Construction of the model week used to regularize the time series.
- `helpers_linear_regression.py`
Functions that contribute to construct the modified linear regression model described in our report, aimed to detect anomalies in the considered time series.
- `helpers_postprocessing.py`
Contains the function that create the output table at the end of the analysis of a time series, as well as the one that computes the error of the model.
- `helpers_plot.py`
Functions used to show the results of our analysis, from the plot of the original time series, to the clustering analysis and anomaly detection results.