Building machine learning models to predict biotransformation half-lives of micropollutants in activated sludge
- Key Words:
- *Machine Learning, Bayesian Inference, Combined Learning, Biotransformation, Organic Micropollutants*
This project involved curating the Eawag-Sludge package in enviPath database and using the curated data to develop machine learning models that can predict biotransformation half-lives of organic micropollutants in activated sludge.
To overcome the limited size of training data, we combined the kinetics data of the Eawag-Soil package, which contained 895 data points, with the 160 half-lives of the Eawag-Sludge package. To support the theoretical basis for this combined learning approach, we performed a read-across analysis on 27 compounds that were common to both data sets. Our results indicate that using the Bayesian mean is more effective in capturing the relationship between the half-lives of sludge and soil.
How to use these scripts?
Step 1: Fetch the data from enviPath database
extract_all_compounds.py input: (none) output: sludge_Original_raw.tsv sludge_Leo_raw.tsv sludge_Rich_raw.tsv
package_id: The url of the Eawag-Sludge package or expanded packages created by other contributors. For example:
- Original sludge package:
- Rich sludge package:
- Leo sludge package: https://envipath.org/package/195bc500-f0c6-4bcb-b2fe-f1602b5f20a2
Columns of data were extracted from enviPath including the molecular identifier, kinetics data, biomass related data, bioreactor, the addition of nutrients, oxygen uptake rate, location, and WWTP purpose. First, all the pathways were extracted by the get_pathway() method of the Package class. Secondly, each pathway could have multiple nodes, and every node has its own dictionary that collects distinct scenarios. These scenario objects include the kinetics data of the compound measured under distinct experimental conditions. The id of the scenario is one of the arguments of the scenario class. By instantiating the scenario class, the get_additional_information() method can access to multiple objects such as halflife objects or rate constant objects. The get_values() and get_unit() attributes can eventually fetch the value and unit of the object.
The node also has the get_default_structure() attribute which can provide the molecular identifier such as the molecular id, molecular name and SMILES representation. These original SMILES representations were further transformed to canonicalize SMILES by the built-in attribute of rdkit package to ensure the bijective mapping between the molecular representation and the molecular structure.
Step 2: Convert the rate constant into half-life
calculation_new.py input: slduge_Original_raw.tsv slduge_Leo_raw.tsv slduge_Rich_raw.tsv output: sludge_raw_bay3_check.tsv
The purpose of this Python script is to combine various sources of the Eawag-Sludge package. Since some of the packages only have either rate constants or half-lives, we can utilize the reaction order formula to convert kinetic data between them. The default reaction type assumed is a first-order reaction. In addition to this, the script also incorporates biomass for each half-life. Finally, the relevant kinetic data is collected and subjected to log calculations, including:
Step 3: Get the data distribution of the calculated target variables and decide the reasonable one.
calculate_target_variables.py input: sludge_raw_bay2.tsv output: sludge_biomass_bay2.tsv
Markdown the part of the calculation of BI mean.
This script aim to understand the data distribution of the target variables.
Step 4: find the prior distribution of the target variable for subsequent calculation of Bayesian Inference mean (BI mean)
Chowag.ipynb input: sludge_biomass_bay2.tsv output: (None) mean of biomass_hl_log_gmean std of biomass_hl_log_gmean mean of biomass_hl_log_std std of biomass_hl_log_std
This step use the Visual Studio Code console to visualize the data distribution of each type of target variables.
First group by the "reduced_smiles", "biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", then calculate the mean and std of the target variables which would be set up as the parameter of Bayesian.py.
The experimental half-life values could locate outside the range of the limit of quantification (LOQ). Removing these data points is not a wise decision in the data scarcity case since reducing the data set limits the ML model performance. Bayesian Inference can address the problem of data points beyond the LOQ and estimate their true mean value and true standard deviation, so the data volume will not be compressed and the values that lie beyond the LOQ can benefit to the model as well.
The goal of Markov Chain Monte Carlo (MCMC) is to construct a Markov chain that converges to a stationary distribution, which approximates the posterior distribution of interest in Bayesian inference.
Step 5: Use the MCMC to get the prosterior distribution and the BI mean of the target variable
calculate_target_variables.py input: sludge_biomass_bay2.tsv output: sludge_bay_PriorMuStd_bay2.tsv
Markup the part of the calculation of BI mean as we want to calculate the mean of Bayesian Inference of the target variable.
Calculate the Bayesian Inference mean of the target variables such as loh_hl_biomass_corrected.
bayesian.set_prior_mu(mean=-0.08, std=2.0) mean = mean of biomass_hl_log_gmean std = std of biomass_hl_log_gmean
bayesian.set_prior_sigma(mean=0.23, std=1.0) mean = mean of biomass_hl_log_std std = std of biomass_hl_log_std
Chowag.ipynb input: sludge_bay_PriorMuStd_bay2.tsv output: sludge_bay_PriorMuStd_bay3.tsv
Group by the following columns: "hl_log_std", "hl_log_median", "hl_log_gmean", "hl_log_bayesian_mean","biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", "biomass_hl_log_bayesian_mean".
Step 7: Try Machine Learning with multiple feature representations (molecular descriptor, molecular fingerprint, enviPath biotransformation rules)
combined_learning.py input: soil_model_input_all_full.tsv sludge_bay_PriorMuStd_bay3. soil_bayesian_merge_padel_reduced.tsv output: (RMSE and R2 figures.)
This section take not only include the target variables but also the multiple combinations of the molecular descriptor or molecular fingerprint, such as:
- PaDEL (Calculated by OPERA model, using the set of SMILES as the input.)
- MACCS (Calculated by Kunyang)
- btrules (Calculated by Jasmin)
- PaDEL + MACCS
- PaDEL + btrules
- MACCS + btrules
- PaDEL + MACCS + btrules
These descriptor or fingerprint were preprocess and merge together before the model training
In the beginning of the script, there are 7 global variables that can be set up. For example, when modleing on the sludge dataset, we can set the COMBINED_LEARNING to False. ITERATIONS can set to 20 or even more. DESCRIPTORS can use the above mentioned 7 name, this global variables simply deliever the string to the title of the output figures and the file name. DENDROGRAM set to False as we don't need to plot the dendrogram everytime during the random forest feature selection. CLUSTER_THRESHOLD were set to 0.001 or 0.01 to remove the collinearity among the features. TOPS represent the number of selected top features for subsequent modelling.
I extend my heartfelt thanks to Kunyang Zhang for his unwavering help on teaching me enviPath API and calculation of MACCS fingerprint during my research, offering valuable insights and suggestions that significantly improved the quality of my work.
The Eawag-Soil dataset, Bayesian.py and the calculation of enviPath biotransformation rules (btrules) were supported by Jasmin Hafner.