matplotlib.use('Agg')# Must be before importing matplotlib.pyplot or pylab!
importmatplotlib.pyplotasplt
importpandasaspd
importnumpyasnp
pd.set_option('display.max_columns',100)
# In[ ]:
# import scikit learn packages
fromsklearn.model_selectionimporttrain_test_split
fromsklearn.metricsimportmean_squared_error
fromsklearn.metricsimportr2_score
fromsklearn.linear_modelimportRidgeCV
fromsklearn.preprocessingimportStandardScaler
fromsklearn.preprocessingimportPolynomialFeatures
fromsklearn.pipelineimportmake_pipeline
# In[ ]:
fromhelpersimport*
# **Settings**
#
# *Boleans*
# - ``seasonwise``: If set to ``True`` perform 4 different regression splitting the dataset season by season. Otherwise perform one single regression.
# - ``feature_selection``: If set to ``True`` filter only season selected by random forest features selection (80% importance, 13 out of 20 features). Otherwise utilise all the 20 features.
# - ``output_y_only``: If set to ``True`` perform regression considering only ``u_y`` as output. Otherwise consider speed in both $x$ and $y$.
#
# *Parameters*
# - ``rnd_state``: Seed state used in the splitting of the dataset. Default in all algorithms is $50$.
# - ``alphas``: Possible choice of regularization parameter optimized by the ridge regression. Default is `` np.logspace(-10,5,200)``.
# - ``deg``: Degree of the polynomial expansion. Default is $3$.
# - ``train_dim``,``test_dim``,``validate_dim``: Dimension of the splitting. Default are respectively $0.6$, $0.2$ and $0.2$.
#
# *Memory problem:* If ``MemoryError`` arise (with current parameters and 32GB of ram would be very unlikely), different changes can be done to make the script less RAM heavy. With `` seasonwise = True`` the regression is performed seasonally and the dataset on which the regression in performed is $1/4$ in dimension. Other matrix dimension reduction can be done maintaining the regression yearly lowering the polynomial degree of expansion (``deg``) or lowering the dimension of training dataset (``train_dim``). The latter operations reduce overall performance of the algorithm.
# Plot names and title on the picture itself explain their content. An interval is chosen randomly to visualize the behaviour on the true value and the prediction. The MSE, $R^2$, average prediction, average true values and average standard deviations are all saved on ``.txt`` format. The last entry of each line identifies the period (season or all year), while every number not considering the first one is referring to the mast anemometer from 1 to 6 in this order.