Diffusion master_thesis_Lee (master)

Edit
master_thesis_Lee
Restricted Project
ActivePublic

Recent Commits

Commit	Author	Details	Committed
a2d193807e57	leetseng	ADD: updated combined learning script	Apr 14 2023
2feabe6d2f76	leetseng	final version of script with biomass corrected	Apr 14 2023
2cf247a26823	leetseng	ADD: additional description in README.md	Apr 14 2023
92a0c881f403	leetseng	ADD: some figures about the Bayesian mean	Apr 14 2023
b0bcda2829e3	leetseng	ADD: Readme file	Apr 14 2023
23aee68e2125	leetseng	MTN: Data clean-up	Apr 13 2023
fda46937f138	leetseng	FIX: getpass	Mar 7 2023
a9c07acbd4b1	leetseng	clean the code	Mar 7 2023
94e0ea8ada2e	leetseng	FIX: get_pass	Mar 7 2023
98581e97506e	leetseng	clean the code	Mar 7 2023
ca61d0195ddc	leetseng	FIX: clean the code	Mar 7 2023
a90a721428e2	leetseng	ADD: Used the new BI mean	Mar 7 2023
e02d68890ec3	leetseng	remove .smi files	Mar 7 2023
9142bfbf8f10	leetseng	remove .smi files	Mar 7 2023
178c452abe84	leetseng	remove files	Mar 7 2023

README.md

Building machine learning models to predict biotransformation half-lives of micropollutants in activated sludge

Key Words:
1. *Machine Learning, Bayesian Inference, Combined Learning, Biotransformation, Organic Micropollutants*

Project Description:

This project involved curating the Eawag-Sludge package in enviPath database and using the curated data to develop machine learning models that can predict biotransformation half-lives of organic micropollutants in activated sludge.

To overcome the limited size of training data, we combined the kinetics data of the Eawag-Soil package, which contained 895 data points, with the 160 half-lives of the Eawag-Sludge package. To support the theoretical basis for this combined learning approach, we performed a read-across analysis on 27 compounds that were common to both data sets. Our results indicate that using the Bayesian mean is more effective in capturing the relationship between the half-lives of sludge and soil.

How to use these scripts?

Step 1: Fetch the data from enviPath database

extract_all_compounds.py
input: (none)
output: sludge_Original_raw.tsv
        sludge_Leo_raw.tsv
        sludge_Rich_raw.tsv

package_id: The url of the Eawag-Sludge package or expanded packages created by other contributors. For example:

Original sludge package:

https://envipath.org/package/4a3cd0f4-4d2b-4f00-b3e6-a29e721f7038

Rich sludge package:

https://envipath.org/package/8d3d7ca2-ae4e-4779-a6a2-d3539237c439

Leo sludge package: https://envipath.org/package/195bc500-f0c6-4bcb-b2fe-f1602b5f20a2

![](https://i.imgur.com/1Uf84kq.png)

Columns of data were extracted from enviPath including the molecular identifier, kinetics data, biomass related data, bioreactor, the addition of nutrients, oxygen uptake rate, location, and WWTP purpose. First, all the pathways were extracted by the get_pathway() method of the Package class. Secondly, each pathway could have multiple nodes, and every node has its own dictionary that collects distinct scenarios. These scenario objects include the kinetics data of the compound measured under distinct experimental conditions. The id of the scenario is one of the arguments of the scenario class. By instantiating the scenario class, the get_additional_information() method can access to multiple objects such as halflife objects or rate constant objects. The get_values() and get_unit() attributes can eventually fetch the value and unit of the object.

The node also has the get_default_structure() attribute which can provide the molecular identifier such as the molecular id, molecular name and SMILES representation. These original SMILES representations were further transformed to canonicalize SMILES by the built-in attribute of rdkit package to ensure the bijective mapping between the molecular representation and the molecular structure.

https://www.rdkit.org/docs/source/rdkit.Chem.MolStandardize.rdMolStandardize.html

Step 2: Convert the rate constant into half-life

calculation_new.py
input: slduge_Original_raw.tsv
       slduge_Leo_raw.tsv
       slduge_Rich_raw.tsv
output: sludge_raw_bay3_check.tsv

The purpose of this Python script is to combine various sources of the Eawag-Sludge package. Since some of the packages only have either rate constants or half-lives, we can utilize the reaction order formula to convert kinetic data between them. The default reaction type assumed is a first-order reaction. In addition to this, the script also incorporates biomass for each half-life. Finally, the relevant kinetic data is collected and subjected to log calculations, including:

k_combined
k_biomass_corrected
hl_combined
hl_biomass_corrected
log_k_combined
log_k_biomass_corrected
log_hl_combined
log_hl_biomass_corrected

Step 3: Get the data distribution of the calculated target variables and decide the reasonable one.

calculate_target_variables.py
input: sludge_raw_bay2.tsv
output: sludge_biomass_bay2.tsv

Markdown the part of the calculation of BI mean.

This script aim to understand the data distribution of the target variables.

Step 4: find the prior distribution of the target variable for subsequent calculation of Bayesian Inference mean (BI mean)

Chowag.ipynb
input: sludge_biomass_bay2.tsv
output: (None)
        mean of biomass_hl_log_gmean
        std of biomass_hl_log_gmean
        mean of biomass_hl_log_std
        std of biomass_hl_log_std

This step use the Visual Studio Code console to visualize the data distribution of each type of target variables.

First group by the "reduced_smiles", "biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", then calculate the mean and std of the target variables which would be set up as the parameter of Bayesian.py.

![](https://i.imgur.com/ArMhfpa.png)

The experimental half-life values could locate outside the range of the limit of quantification (LOQ). Removing these data points is not a wise decision in the data scarcity case since reducing the data set limits the ML model performance. Bayesian Inference can address the problem of data points beyond the LOQ and estimate their true mean value and true standard deviation, so the data volume will not be compressed and the values that lie beyond the LOQ can benefit to the model as well.

![](https://i.imgur.com/H75DmZQ.png)

The goal of Markov Chain Monte Carlo (MCMC) is to construct a Markov chain that converges to a stationary distribution, which approximates the posterior distribution of interest in Bayesian inference.

Step 5: Use the MCMC to get the prosterior distribution and the BI mean of the target variable

calculate_target_variables.py
input: sludge_biomass_bay2.tsv
output: sludge_bay_PriorMuStd_bay2.tsv

Markup the part of the calculation of BI mean as we want to calculate the mean of Bayesian Inference of the target variable.

Calculate the Bayesian Inference mean of the target variables such as loh_hl_biomass_corrected.

bayesian.set_prior_mu(mean=-0.08, std=2.0) mean = mean of biomass_hl_log_gmean std = std of biomass_hl_log_gmean

bayesian.set_prior_sigma(mean=0.23, std=1.0) mean = mean of biomass_hl_log_std std = std of biomass_hl_log_std

Step 6:

Chowag.ipynb
input: sludge_bay_PriorMuStd_bay2.tsv
output: sludge_bay_PriorMuStd_bay3.tsv

Group by the following columns: "hl_log_std", "hl_log_median", "hl_log_gmean", "hl_log_bayesian_mean","biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", "biomass_hl_log_bayesian_mean".

Step 7: Try Machine Learning with multiple feature representations (molecular descriptor, molecular fingerprint, enviPath biotransformation rules)

combined_learning.py
input: soil_model_input_all_full.tsv 
       sludge_bay_PriorMuStd_bay3.
       soil_bayesian_merge_padel_reduced.tsv
output: (RMSE and R2 figures.)

This section take not only include the target variables but also the multiple combinations of the molecular descriptor or molecular fingerprint, such as:

PaDEL (Calculated by OPERA model, using the set of SMILES as the input.)
MACCS (Calculated by Kunyang)
btrules (Calculated by Jasmin)
PaDEL + MACCS
PaDEL + btrules
MACCS + btrules
PaDEL + MACCS + btrules

These descriptor or fingerprint were preprocess and merge together before the model training

In the beginning of the script, there are 7 global variables that can be set up. For example, when modleing on the sludge dataset, we can set the COMBINED_LEARNING to False. ITERATIONS can set to 20 or even more. DESCRIPTORS can use the above mentioned 7 name, this global variables simply deliever the string to the title of the output figures and the file name. DENDROGRAM set to False as we don't need to plot the dendrogram everytime during the random forest feature selection. CLUSTER_THRESHOLD were set to 0.001 or 0.01 to remove the collinearity among the features. TOPS represent the number of selected top features for subsequent modelling.

Acknowledgement:

I extend my heartfelt thanks to Kunyang Zhang for his unwavering help on teaching me enviPath API and calculation of MACCS fingerprint during my research, offering valuable insights and suggestions that significantly improved the quality of my work.

The Eawag-Soil dataset, Bayesian.py and the calculation of enviPath biotransformation rules (btrules) were supported by Jasmin Hafner.

.gitignore
.idea/
CL_sludge_padel_feature_selected.ipynb
Chowag.ipynb
EAWAG-SLUDGE.json
Modeling/
New Text Document.txt
README.md
analyse_dataset.py
calculate_target_variables.py
calculation.py
calculation_new.py
combined_learning.py
combined_learning_last.py
combined_learning_padel.py
expand_dataset_.py
extract_all_compounds_.py
ouputfile.txt
output/
sludge_caculate_PaDEL_descriptor.py
sludge_dataset.xlsx
sludge_rule_feature_selected.ipynb

Editmaster_thesis_LeeRestricted ProjectActivePublic

master_thesis_Lee (master)