"Using `pandas`, we can compute the correlation between each pair of variables. To improve the readability of the resulting table, we will style the table with red colors for positive values, and blue colors for negative values:"
"- a. The corrlation matrix shows a strong negative correlation between the y-coordinate of the roof (roof_y) and SIS/SISDIR. Can you explain why? Why do we see no such correlation for roof_x?\n",
"- b. The correlation matrix also shows two correlated \"blocks\" of the horizon values. What causes these blocks and what does that imply for the model?\n",
"- c. Which are the five variables with the strongest correlation to the target variable? Which are the five variables with the smallest correlation?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### *SOLUTION*\n",
"- a. The strong correlation is due to the fact that the y-coordinate represents the northern *latitude*, which determines the maximum solar angle and the total daylight hours; this influences the global and direct horizontal radiation - the further north we are, the lower the horizontal radiation. The roof_x is the *longitude*, which impacts the annual radiation to a much lesser extent.\n",
"- b. The two \"blocks\" are formed around the east and west horizon values, i.e. the horizon values for eastern azimuths are strongly positively correlated, and are slightly negatively correlated to the horizon values for western azimuths. The strong positive correlation suggests that we don't need *all* the horizon values, but a subset would suffice.\n",
"- c. The strongest correlation with the target is observed for the horizon values for **S, SSE, SSW, SEE and SWW**. The smallest correlation is seen for **roof_x, horizon_NWW, roof_area, DHI and roof_aspect**.<br>**NOTE**: Just because the 5 horizons have the highest correlation, it does not mean that these are the best features to select!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1.3 - Visualising features vs. targets "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To start, we set up the matplotlib interactive plotting:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib notebook\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and define the list of features and the target, to know what to plot against each other:"
" data.plot.scatter(x = feature, y = target_name, ax = ax[i], s = 2) # Plot feature against target\n",
" # Format figure:\n",
" ax[i].set_xlabel('')\n",
" ax[i].set_title(feature)\n",
" ax[i].grid(ls = '--')\n",
" \n",
" # For the roof area only, show the x-axis with logarithmic values:\n",
" if feature == 'roof_area': ax[i].set_xscale('log')\n",
"for j in range(i+1, len(ax)): ax[j].axis('off') # deactivate unused sub-axes\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## *Questions*\n",
"- a. Based on the scatterplots, which variable shows the strongest relatioship with the target?\n",
"- b. What is the correlation coefficient between this varialbe and the target?\n",
"- c. How do you explain the discrepancy, and what does this mean for the modelling?\n",
"- d. Can you think of a way to transform this variable in order to increase its correlation coefficient?\n",
"- e. For the other 4 features with the smallest correlation to the target from Step 1.2, do you observe a similar discrepancy or are the correlation coefficients confirmed by the plots?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### *SOLUTION*\n",
"- a. The scatterplots suggests that the tilted radiation depends strongly on the *__roof aspect__*. This does not come as a surprise; north-facing roofs receive less solar radiation than south-facing roofs.\n",
"- b. The correlation coefficient of the roof aspect is **-0.036**, making it one of the *lowest* coefficients in the dataset.\n",
"- c. The relationship of the roof_aspect to the tilted radiation is highly **non-linear** (rather quadratic). Consequently, the measure of *linear correlation* is low. This is an important pitfall of using correlation coefficients to measure the importance of features for the modelling. We should hence always cross-check the scatterplots.\n",
"- d. We could transform the roof aspect by dividing it into two features: *abs(roof_aspect)* and *sign(roof_aspect)*. This would help to *linearize* thia feature, and can significantly improve the performance of the linear regression that we have seen in Tutorial 1. Some estimators do not require linear features, but in some cases this kind of **feature transformation** can significantly improve model performance.\n",
"- e. In the other 4 cases (roof_x, horizon_NWW, roof_area, DHI), the low correlation coefficient is confirmed by a near-random scatterplot."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1.4 - Feature transformation\n",
"Let's transform the data in this way by computing first the sign and then replacing the roof aspect by it's absolute value:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# SOLUTION\n",
"data['aspect_sign'] = np.sign(data['roof_aspect']) ## Compute the sign of the roof_aspect\n",
"data['roof_aspect'] = abs(data['roof_aspect']) ## Compute the absolute value of the roof_aspect"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can check the correlation coefficient of the new features with the tilted radiation by using the `corrwith` function, and will see that it is much improved and that the roof_aspect becomes the *most important feature*:"
"cv_score = cross_val_score(knn, X_reduced, y, cv = N_FOLDS)\n",
"cv_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `cv_score` shows the results for all folds, but often it is more informative to obtain the *mean* and *standard deviation* instead:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.7629847749956069\n",
"0.043022680058039615\n"
]
}
],
"source": [
"print( cv_score.mean() )\n",
"print( cv_score.std() )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Comparison with all features\n",
"As a comparison, let's look at the score obtained from the default knn with *all* features:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.6614220209767506\n",
"0.056629430840869355\n"
]
}
],
"source": [
"# Solution\n",
"knn = KNeighborsRegressor() # Initiate model\n",
"cv_score = cross_val_score(knn, X_scaled, y, cv = N_FOLDS)\n",
"\n",
"print( cv_score.mean() )\n",
"print( cv_score.std() )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Comparison with default feature set (before feature transformation)\n",
"Another comparison that we can perform is to the default dataset, i.e. before applying the feature transformation from Step 1.4. We have wrapped this analysis in the function `cross_val_score_default_knn()` (found in the lib_file.py). The default CV score (all features, no transformation) is:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.505666303468904\n",
"0.03045349442565308\n"
]
}
],
"source": [
"cv_score = cross_val_score_default_knn()\n",
"print( cv_score.mean() )\n",
"print( cv_score.std() )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## *Questions*\n",
"- a. Which performance metric is used in the `cross_val_score` method? What does a score of 1 represent? What about a score of 0?\n",
"- b. How is the performance of the model with the default feature set improved through feature transformation (with all features) and subsequently feature selection (keeping 8 features)?\n",
"- c. What does this very first glance suggest about the importance of feature engineering (transformation, selection, etc.)?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### *SOLUTION*\n",
"- a. The cross_val_score uses the default `score`-method of the passed algorithm, i.e. the R2-coefficient of determination, which characterises the model's goodness of fit (i.e. the percentage of variance explained). A score of 1 represents a perfect model, with all variance explained. A score of 0 (or near-zero) represents complete randomness.\n",
"- b. By feature transformation, the score is improved from around 0.5 to around 0.65, and by feature selection, it is further improved to above 0.75 - this is a significant improvement, **just by improving the data used for modelling, not by changing the model itself.**\n",
"- c. Feature engineering is at least as important as the modelling process itself!"
"- a. What is the optimal number of features? **NOTE**: These results can vary due to the re-shuffling of the data!\n",
"- b. How much of the improvement compared to the default model from tutorial 2 can be attributed to the feature transformation of the `roof_aspect`? How do you measure that?\n",
"- c. How much of the improvement can be attributed to the feature selection?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### *SOLUTION*\n",
"- a. The optimal number of features should lie somewhere between 4 - 9 features.\n",
"- b. If we compare the results with all features, we see that the score is increased from around 0.4 to 0.65, simply by transforming the `roof_aspect`.\n",
"- c. If we compare the score for the optimal number of features (around 0.75-0.8) to the score for all features, we see that the feature selection increases the average cv-score by another +0.1 approximately."