Page MenuHomec4science

guide_features.py
No OneTemporary

File Metadata

Created
Tue, Apr 29, 04:30

guide_features.py

# coding: utf-8
# In[27]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import pandas as pd
import xarray as xr
import os
import time
from features import Training, Testing
from tables import open_file, Atom, Filters
# # Setup
# The _features_-file is more high-level than the meteo_data file, and has a well-defined file structure. In the given directory (_data_path_), there should be one subfolder with _raw_data_, and one subfolder with _datasets_. All models are stored here, together with the feature and target tables for training, validation, testing and query (which is what I call a set of __unlabelled__ features, for which we cannot compute any error but which gives us the output that we are finally interested in).
# In[2]:
data_path = os.path.abspath("/Users/alinawalch/Documents/EPFL/Code/meteo") # folder in which raw data is stored
# # Training, validation and test data
# In[3]:
# List of features and tables
ftr_list = ['x','y','z','month','hour']
lbl_list = ['SIS']
# In[4]:
start_date = '20120101' # Format: 'yyyymmdd'
end_date = '20121231'
# In[5]:
# set location masks for the training and test
train_locs = "locations/grid100_train.txt"
test_locs = "locations/grid100_test.txt"
# ## Create feature table
# In[6]:
modelname = "mytest"
# In[25]:
new_set = Training(data_path, modelname, ftr_list, lbl_list)
# The new set is now a new folder, with the subfolders as specified:
# In[8]:
print(new_set.train_path) # this will contain both training and validation feature tables
print(new_set.test_path)
# So far they are empty; in order to make a new feature table, we must run *make_dataset*. We can either create training and testing at the same time (i.e. only run through the data once), or we can run *make_testset* seperately.
# In[9]:
new_set.make_dataset(start_date, end_date, sample_name = train_locs, test_name = test_locs)
# This is redundant now, but if test_name was not set, then we could create the testset as follows (now it will overwrite the existing files):
# In[10]:
new_set.make_testset(start_date, end_date, sample_name = test_locs)
# You can now find the following files in the model folder:<br>
# - **norm_features.csv** and **norm_targets.csv** are excel files containing the normalisation data (mean, std, min, max)
# - Inside *train*: **features.hdf5** and **targets.hdf5** as well as csv files with training locations and training dates
# - Inside *test*: similar information labelled with **_test...**
# ## Normalisation
# The __normalise_all__ function checks if training and testing features exist, and creates a copy of the files with normalised data. The training data is randomly split into __training__ and __validation__ data. The inputs are the following:
# - **feature_norm** and **target_norm**: Set to *'mean'* to standardize the data to zero mean and unit variance, and to *'range'* to scale to [0,1]. Anything else will just copy the data and not perform any normalisation
# - **val_ratio**: Ratio, according to which the tables are split into training and validation sets (default 0.8, i.e. 80% training and 20% testing)
# - **force_normalization**: Overwrites files if they already exists
# In[11]:
new_set.normalize_all(feature_norm = 'mean', target_norm = 'range', val_ratio = 0.8)
# Now, some files were added with the appendix **_norm_mean** for the features and **_norm_range** for the targets and the prefix __train___ and __val___ in the *train*-folder. Alternatively, you can also normalise training and testing data seperately:
# In[12]:
new_set.normalize_train(feature_norm = 'mean', target_norm = 'range', val_ratio = 0.8)
# In[13]:
new_set.normalize_test(feature_norm = 'mean', target_norm = 'range')
# # Query data (the actual modelling)
# I have created the fishnets that gives the locations for the final grid in ArcGIS, so all I am doing here is to convert the locations from csv into a list of features and adding the additional information that I want to add (e.g. hour and month). The query locations are expected to be in the subfolder _locations_ in the data path.
# In[16]:
query_locs = 'query_points_1600.csv'
# In[17]:
hours = list(range(3,20))
months = list(range(1,13))
# In[18]:
myquery = Testing(data_path, modelname, query_name = 'grid1600')
# In[19]:
myquery.make_query(loc = query_locs, hour = hours, month = months)
# In[20]:
# The new file is now found in:
myquery.features_query
# Again, the data can be normalised (and also re-scaled on the other end):
# In[21]:
myquery.normalize_input(norm_type = 'mean')
# In order to get back to an original data format so we can plot it, we need to rescale the targets. These must be computed by the model, but just for demo we can __copy the targets__ from the test set (*test/test_targets_norm_range*) and then run this:
# In[23]:
# we can actually rescale any file:
myquery.rescale_output(h5file = 'test_targets_norm_range.hdf5', target_name = 'query_features_rescaled', norm_type = 'range')
# # Reconverting to xarray
# To reconvert the tables to xarray, we first need to concatenate again the features and targets. This example takes for this the original feature and target tables:
# In[30]:
ftrs = open_file(os.path.join(new_set.train_path, 'features.hdf5'), "r")
for f_node in ftrs.walk_nodes():
pass # find a node with whatever name
dt = f_node.dtype
Nf, df = f_node.shape # HDF5 files are transposed, for Matlab compatibility
# In[ ]:
tgts = open_file(os.path.join(new_set.train_path, 'targets.hdf5'), "r")
for t_node in tgts.walk_nodes():
pass # find a node with whatever name
Nt, dt = t_node.shape # HDF5 files are transposed, for Matlab compatibility
# In[35]:
X = pd.DataFrame(data = f_node[:,:], columns = new_set.features.cols)
Y = pd.DataFrame(data = t_node[:,:], columns = new_set.targets.cols)
# In[47]:
tbl = pd.concat([X, Y], axis = 1)
tbl
# In[48]:
tbl.set_index(['x','y','month','hour'], inplace = True)
# As some information, such as the year or the day in this case, is not present in the data, it means that if we trying to convert anything else than the **query**-data back to an xarray (which actually we should not do or want to do anyways), then there may be duplicates - multiple entries with the exact same index. This causes trouble and cannot be converted to an xarray. What we can do to make sure that there are no duplicates is to take the mean over all index variables:
# In[49]:
unique_table = tbl.groupby(['x','y','month','hour']).mean()
unique_table
# In[50]:
# et voila:
unique_table.to_xarray()

Event Timeline