guide_features.py
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Thu, Jul 10, 01:49

guide_features.py
View Options


	# coding: utf-8

	# In[27]:


	import numpy as np
	import matplotlib.pyplot as plt
	from matplotlib.colors import LinearSegmentedColormap
	import pandas as pd
	import xarray as xr
	import os
	import time
	from features import Training, Testing
	from tables import open_file, Atom, Filters


	# # Setup

	# The _features_-file is more high-level than the meteo_data file, and has a well-defined file structure. In the given directory (_data_path_), there should be one subfolder with _raw_data_, and one subfolder with _datasets_. All models are stored here, together with the feature and target tables for training, validation, testing and query (which is what I call a set of __unlabelled__ features, for which we cannot compute any error but which gives us the output that we are finally interested in).

	# In[2]:


	data_path = os.path.abspath("/Users/alinawalch/Documents/EPFL/Code/meteo") # folder in which raw data is stored


	# # Training, validation and test data

	# In[3]:


	# List of features and tables
	ftr_list = ['x','y','z','month','hour']
	lbl_list = ['SIS']


	# In[4]:


	start_date = '20120101' # Format: 'yyyymmdd'
	end_date = '20121231'


	# In[5]:


	# set location masks for the training and test
	train_locs = "locations/grid100_train.txt"
	test_locs = "locations/grid100_test.txt"


	# ## Create feature table

	# In[6]:


	modelname = "mytest"


	# In[25]:


	new_set = Training(data_path, modelname, ftr_list, lbl_list)


	# The new set is now a new folder, with the subfolders as specified:

	# In[8]:


	print(new_set.train_path) # this will contain both training and validation feature tables
	print(new_set.test_path)


	# So far they are empty; in order to make a new feature table, we must run make_dataset. We can either create training and testing at the same time (i.e. only run through the data once), or we can run make_testset seperately.

	# In[9]:


	new_set.make_dataset(start_date, end_date, sample_name = train_locs, test_name = test_locs)


	# This is redundant now, but if test_name was not set, then we could create the testset as follows (now it will overwrite the existing files):

	# In[10]:


	new_set.make_testset(start_date, end_date, sample_name = test_locs)


	# You can now find the following files in the model folder:<br>
	# - norm_features.csv and norm_targets.csv are excel files containing the normalisation data (mean, std, min, max)
	# - Inside train: features.hdf5 and targets.hdf5 as well as csv files with training locations and training dates
	# - Inside test: similar information labelled with _test...

	# ## Normalisation

	# The __normalise_all__ function checks if training and testing features exist, and creates a copy of the files with normalised data. The training data is randomly split into __training__ and __validation__ data. The inputs are the following:
	# - feature_norm and target_norm: Set to 'mean' to standardize the data to zero mean and unit variance, and to 'range' to scale to [0,1]. Anything else will just copy the data and not perform any normalisation
	# - val_ratio: Ratio, according to which the tables are split into training and validation sets (default 0.8, i.e. 80% training and 20% testing)
	# - force_normalization: Overwrites files if they already exists

	# In[11]:


	new_set.normalize_all(feature_norm = 'mean', target_norm = 'range', val_ratio = 0.8)


	# Now, some files were added with the appendix _norm_mean for the features and _norm_range for the targets and the prefix __train___ and __val___ in the train-folder. Alternatively, you can also normalise training and testing data seperately:

	# In[12]:


	new_set.normalize_train(feature_norm = 'mean', target_norm = 'range', val_ratio = 0.8)


	# In[13]:


	new_set.normalize_test(feature_norm = 'mean', target_norm = 'range')


	# # Query data (the actual modelling)

	# I have created the fishnets that gives the locations for the final grid in ArcGIS, so all I am doing here is to convert the locations from csv into a list of features and adding the additional information that I want to add (e.g. hour and month). The query locations are expected to be in the subfolder _locations_ in the data path.

	# In[16]:


	query_locs = 'query_points_1600.csv'


	# In[17]:


	hours = list(range(3,20))
	months = list(range(1,13))


	# In[18]:


	myquery = Testing(data_path, modelname, query_name = 'grid1600')


	# In[19]:


	myquery.make_query(loc = query_locs, hour = hours, month = months)


	# In[20]:


	# The new file is now found in:
	myquery.features_query


	# Again, the data can be normalised (and also re-scaled on the other end):

	# In[21]:


	myquery.normalize_input(norm_type = 'mean')


	# In order to get back to an original data format so we can plot it, we need to rescale the targets. These must be computed by the model, but just for demo we can __copy the targets__ from the test set (test/test_targets_norm_range) and then run this:

	# In[23]:


	# we can actually rescale any file:
	myquery.rescale_output(h5file = 'test_targets_norm_range.hdf5', target_name = 'query_features_rescaled', norm_type = 'range')


	# # Reconverting to xarray

	# To reconvert the tables to xarray, we first need to concatenate again the features and targets. This example takes for this the original feature and target tables:

	# In[30]:


	ftrs = open_file(os.path.join(new_set.train_path, 'features.hdf5'), "r")
	for f_node in ftrs.walk_nodes():
	pass # find a node with whatever name
	dt = f_node.dtype
	Nf, df = f_node.shape # HDF5 files are transposed, for Matlab compatibility


	# In[ ]:


	tgts = open_file(os.path.join(new_set.train_path, 'targets.hdf5'), "r")
	for t_node in tgts.walk_nodes():
	pass # find a node with whatever name
	Nt, dt = t_node.shape # HDF5 files are transposed, for Matlab compatibility


	# In[35]:


	X = pd.DataFrame(data = f_node[:,:], columns = new_set.features.cols)
	Y = pd.DataFrame(data = t_node[:,:], columns = new_set.targets.cols)


	# In[47]:


	tbl = pd.concat([X, Y], axis = 1)
	tbl


	# In[48]:


	tbl.set_index(['x','y','month','hour'], inplace = True)


	# As some information, such as the year or the day in this case, is not present in the data, it means that if we trying to convert anything else than the query-data back to an xarray (which actually we should not do or want to do anyways), then there may be duplicates - multiple entries with the exact same index. This causes trouble and cannot be converted to an xarray. What we can do to make sure that there are no duplicates is to take the mean over all index variables:

	# In[49]:


	unique_table = tbl.groupby(['x','y','month','hour']).mean()
	unique_table


	# In[50]:


	# et voila:
	unique_table.to_xarray()

guide_features.pyNo OneTemporaryActions

File Metadata

guide_features.pyView Options

Event Timeline

guide_features.py
No OneTemporary
Actions

guide_features.py
View Options