# # Extreme Learning Machine Ensemble with uncertainties
# The script makes use of the class **ELM_ensemble**, which is designed in analogy to the scikit-learn structure.
# An ensemble of extreme learning machines is fitted with the training data (and the validation data if 'V' mode is selected). While training, the out-of-bag samples are stored for later reconstruction of the out-of-bag prediction. After fitting, the model is used to predict on the training data and on the query data (storing the outputs of each run). After prediction, the residuals based on the out-of-bag data are computed, and a new ELM ensemble is trained on the residuals. The variance of the prediction represents the model uncertainty, while the predicted residuals estimate the data uncertainty. [See FG poster]
# _**Notes**_:
# - **Should the validation set be different (random) for each ELM?**
# - We should find a better method to implement the bootstrap - this method does not work for **BIG DATA**
# - Computation of residuals is based on above assumption and requires all data to be loaded into memory
# - Parallelising across the different ELMs (embedded in *ELM_ensemble.py*) should bring significant improvements to expected performance
# - Find out how HPELM implements the parallelisation of the *hdf5_tools* - so that we can use the same approach for the functions in *norms.py*
# - We need a solution for handling very large query datasets!
# Split data into training, validation and testing before writing to files in order to obtain test dataset to check model quality:
N=res.shape[0]
ind=np.random.permutation(N)
# keep 10% for testing, and 10% of the remaining data for validation
ind_tr=int(0.81*N)
ind_val=int(0.9*N)
# In order to be able to train the residual model, we need to write the residuals to a file and compute their normalisation - this is the core functionality of the **Table_Writer** class. The class writes the file and computes mean, norm, max and min, and returns a pandas DataFrame with it, that is then stored in memory:
# The norm computed above can be loaded into a **Normalizer** object, which contains the functions *normalize* and *rescale* that are useful when working with the data. Below, the residual norm is loaded, and the normalized residuals for training and validation as well as the targets for training, validation and testing are saved in the file given by the path *res_T_XX*. The targets for testing are never needed from memory and do hence not need to be stored.