# The _features_-file is more high-level than the meteo_data file, and has a well-defined file structure. In the given directory (_data_path_), there should be one subfolder with _raw_data_, and one subfolder with _datasets_. All models are stored here, together with the feature and target tables for training, validation, testing and query (which is what I call a set of __unlabelled__ features, for which we cannot compute any error but which gives us the output that we are finally interested in).
# In[2]:
data_path=os.path.abspath("/Users/alinawalch/Documents/EPFL/Code/meteo")# folder in which raw data is stored
# The new set is now a new folder, with the subfolders as specified:
# In[8]:
print(new_set.train_path)# this will contain both training and validation feature tables
print(new_set.test_path)
# So far they are empty; in order to make a new feature table, we must run *make_dataset*. We can either create training and testing at the same time (i.e. only run through the data once), or we can run *make_testset* seperately.
# You can now find the following files in the model folder:<br>
# - **norm_features.csv** and **norm_targets.csv** are excel files containing the normalisation data (mean, std, min, max)
# - Inside *train*: **features.hdf5** and **targets.hdf5** as well as csv files with training locations and training dates
# - Inside *test*: similar information labelled with **_test...**
# ## Normalisation
# The __normalise_all__ function checks if training and testing features exist, and creates a copy of the files with normalised data. The training data is randomly split into __training__ and __validation__ data. The inputs are the following:
# - **feature_norm** and **target_norm**: Set to *'mean'* to standardize the data to zero mean and unit variance, and to *'range'* to scale to [0,1]. Anything else will just copy the data and not perform any normalisation
# - **val_ratio**: Ratio, according to which the tables are split into training and validation sets (default 0.8, i.e. 80% training and 20% testing)
# - **force_normalization**: Overwrites files if they already exists
# Now, some files were added with the appendix **_norm_mean** for the features and **_norm_range** for the targets and the prefix __train___ and __val___ in the *train*-folder. Alternatively, you can also normalise training and testing data seperately:
# I have created the fishnets that gives the locations for the final grid in ArcGIS, so all I am doing here is to convert the locations from csv into a list of features and adding the additional information that I want to add (e.g. hour and month). The query locations are expected to be in the subfolder _locations_ in the data path.
# Again, the data can be normalised (and also re-scaled on the other end):
# In[21]:
myquery.normalize_input(norm_type='mean')
# In order to get back to an original data format so we can plot it, we need to rescale the targets. These must be computed by the model, but just for demo we can __copy the targets__ from the test set (*test/test_targets_norm_range*) and then run this:
# To reconvert the tables to xarray, we first need to concatenate again the features and targets. This example takes for this the original feature and target tables:
# As some information, such as the year or the day in this case, is not present in the data, it means that if we trying to convert anything else than the **query**-data back to an xarray (which actually we should not do or want to do anyways), then there may be duplicates - multiple entries with the exact same index. This causes trouble and cannot be converted to an xarray. What we can do to make sure that there are no duplicates is to take the mean over all index variables: