Prepare the data for Ridge Regression. A discussion in the report is provided to explain each method used to do the supression of outliers.
In the report and in the jupyter notebook one can find the heatmap of the data that explains one some features are dropeed and other are kept.
We strongly encourage the reader to have a look at the notebook Ridge_Regression_analysis, more specifically at part about "Data Exploratory Analysis".
"""
# load many possible data files (for instance year 2016 and 2017)
iftype(file_names)==str:
file_names=[file_names]
dfs=[]
forfile_nameinfile_names:
local_df=pd.read_csv(file_name)
dfs.append(local_df)
df=pd.concat(dfs,sort=False)
# prepares the target
y=df['Produktion [kWh]**'].values
emplacement=df['Anlage_Ort # Emplacement de installation'].values
# drop the target from the data
df=df.drop(columns=["Produktion [kWh]**",'Anlage_Ort # Emplacement de installation'])
# remove features selected by hand & using cross validation
df=df.drop(columns=['Total Anlage','natürliche Personen','Population: Habitants','Répartition par âge en %: 0-19 ans','Mouvement de la population (en ‰): Taux brut de nuptialité',
'Mouvement de la population (en ‰): Taux brut de divortialité','Mouvement de la population (en ‰): Taux brut de natalité','Economie: Secteur primaire',
df=df.drop(columns=["Constructions et logements: Taux de logements vacants",'Ménage: Ménages privés','Economie: Emplois total','Surface: Variation en ha.1','C5','C2'])