## **Building machine learning models to predict biotransformation half-lives of micropollutants in activated sludge**

### **Key Words:** 
#### *Machine Learning, Bayesian Inference, Combined Learning, Biotransformation, Organic Micropollutants*

### **Project Description:** 
This project involved curating the Eawag-Sludge package in enviPath database and using the curated data to develop machine learning models that can predict biotransformation half-lives of organic micropollutants in activated sludge. 

To overcome the limited size of training data, we combined the kinetics data of the Eawag-Soil package, which contained 895 data points, with the 160 half-lives of the Eawag-Sludge package. To support the theoretical basis for this combined learning approach, we performed a read-across analysis on 27 compounds that were common to both data sets. Our results indicate that using the Bayesian mean is more effective in capturing the relationship between the half-lives of sludge and soil.


### **How to use these scripts?**


#### Step 1: Fetch the data from enviPath database
    extract_all_compounds.py
    input: (none)
    output: sludge_Original_raw.tsv
            sludge_Leo_raw.tsv
            sludge_Rich_raw.tsv

`package_id`: The url of the Eawag-Sludge package or expanded packages created by other contributors. For example:

- Original sludge package:
https://envipath.org/package/4a3cd0f4-4d2b-4f00-b3e6-a29e721f7038
- Rich sludge package:
https://envipath.org/package/8d3d7ca2-ae4e-4779-a6a2-d3539237c439
- Leo sludge package:
 https://envipath.org/package/195bc500-f0c6-4bcb-b2fe-f1602b5f20a2

![](https://i.imgur.com/1Uf84kq.png)


Columns of data were extracted from enviPath including the molecular identifier, kinetics data, biomass related data, bioreactor, the addition of nutrients, oxygen uptake rate, location, and WWTP purpose. First, all the pathways were extracted by the `get_pathway()` method of the Package class. Secondly, each pathway could have multiple nodes, and every node has its own dictionary that collects distinct scenarios. These scenario objects include the kinetics data of the compound measured under distinct experimental conditions. The id of the scenario is one of the arguments of the scenario class. By instantiating the scenario class, the `get_additional_information()` method can access to multiple objects such as halflife objects or rate constant objects. The `get_values()` and `get_unit()` attributes can eventually fetch the value and unit of the object.

The node also has the `get_default_structure()` attribute which can provide the molecular identifier such as the molecular id, molecular name and SMILES representation. These original SMILES representations were further transformed to canonicalize SMILES by the built-in attribute of rdkit package to ensure the bijective mapping between the molecular representation and the molecular structure.

https://www.rdkit.org/docs/source/rdkit.Chem.MolStandardize.rdMolStandardize.html 

#### Step 2: Convert the rate constant into half-life
    calculation_new.py
    input: slduge_Original_raw.tsv
           slduge_Leo_raw.tsv
           slduge_Rich_raw.tsv
    output: sludge_raw_bay3_check.tsv

The purpose of this Python script is to combine various sources of the Eawag-Sludge package. Since some of the packages only have either rate constants or half-lives, we can utilize the reaction order formula to convert kinetic data between them. The default reaction type assumed is a first-order reaction. In addition to this, the script also incorporates biomass for each half-life. Finally, the relevant kinetic data is collected and subjected to log calculations, including:

- k_combined
- k_biomass_corrected
- hl_combined
- hl_biomass_corrected
- log_k_combined
- log_k_biomass_corrected
- log_hl_combined
- log_hl_biomass_corrected


#### Step 3: Get the data distribution of the calculated target variables and decide the reasonable one.
    calculate_target_variables.py
    input: sludge_raw_bay2.tsv
    output: sludge_biomass_bay2.tsv
Markdown the part of the calculation of BI mean.

This script aim to understand the data distribution of the target variables.

#### Step 4: find the prior distribution of the target variable for subsequent calculation of Bayesian Inference mean (BI mean)
    Chowag.ipynb
    input: sludge_biomass_bay2.tsv
    output: (None)
            mean of biomass_hl_log_gmean
            std of biomass_hl_log_gmean
            mean of biomass_hl_log_std
            std of biomass_hl_log_std
This step use the Visual Studio Code console to visualize the data distribution of each type of target variables. 

First group by the "reduced_smiles", "biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", then calculate the mean and std of the target variables which would be set up as the parameter of Bayesian.py.

![](https://i.imgur.com/ArMhfpa.png)

The experimental half-life values could locate outside the range of the limit of quantification (LOQ). Removing these data points is not a wise decision in the data scarcity case since reducing the data set limits the ML model performance.
Bayesian Inference can address the problem of data points beyond the LOQ and estimate their true mean value and true standard deviation, so the data volume will not be compressed and the values that lie beyond the LOQ can benefit to the model as well.


![](https://i.imgur.com/H75DmZQ.png)

The goal of Markov Chain Monte Carlo (MCMC) is to construct a Markov chain that converges to a stationary distribution, which approximates the posterior distribution of interest in Bayesian inference.

#### Step 5: Use the MCMC to get the prosterior distribution and the BI mean of the target variable
    calculate_target_variables.py
    input: sludge_biomass_bay2.tsv
    output: sludge_bay_PriorMuStd_bay2.tsv

Markup the part of the calculation of BI mean as we want to calculate the mean of Bayesian Inference of the target variable.

Calculate the Bayesian Inference mean of the target variables such as `loh_hl_biomass_corrected`.

`bayesian.set_prior_mu(mean=-0.08, std=2.0)`
mean = mean of biomass_hl_log_gmean
std = std of biomass_hl_log_gmean

`bayesian.set_prior_sigma(mean=0.23, std=1.0)`
mean = mean of biomass_hl_log_std
std = std of biomass_hl_log_std

#### Step 6:
    Chowag.ipynb
    input: sludge_bay_PriorMuStd_bay2.tsv
    output: sludge_bay_PriorMuStd_bay3.tsv
Group by the following columns:
"hl_log_std", "hl_log_median", "hl_log_gmean", "hl_log_bayesian_mean","biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", "biomass_hl_log_bayesian_mean".

#### Step 7: Try Machine Learning with multiple feature representations (molecular descriptor, molecular fingerprint, enviPath biotransformation rules)
    combined_learning.py
    input: soil_model_input_all_full.tsv 
           sludge_bay_PriorMuStd_bay3.
           soil_bayesian_merge_padel_reduced.tsv
    output: (RMSE and R2 figures.)

This section take not only include the target variables but also the multiple combinations of the molecular descriptor or molecular fingerprint, such as:

- PaDEL (Calculated by OPERA model, using the set of SMILES as the input.)
- MACCS (Calculated by Kunyang)
- btrules (Calculated by Jasmin)
- PaDEL + MACCS
- PaDEL + btrules
- MACCS + btrules
- PaDEL + MACCS + btrules

These descriptor or fingerprint were preprocess and merge together before the model training

In the beginning of the script, there are 7 global variables that can be set up. For example, when modleing on the sludge dataset, we can set the `COMBINED_LEARNING` to False. `ITERATIONS` can set to 20 or even more. `DESCRIPTORS` can use the above mentioned 7 name, this global variables simply deliever the string to the title of the output figures and the file name. `DENDROGRAM` set to False as we don't need to plot the dendrogram everytime during the random forest feature selection. `CLUSTER_THRESHOLD` were set to 0.001 or 0.01 to remove the collinearity among the features. `TOPS` represent the number of selected top features for subsequent modelling.

### **Acknowledgement:**
I extend my heartfelt thanks to Kunyang Zhang for his unwavering help on teaching me enviPath API and calculation of MACCS fingerprint during my research, offering valuable insights and suggestions that significantly improved the quality of my work.

The Eawag-Soil dataset, Bayesian.py and the calculation of enviPath biotransformation rules (btrules) were supported by Jasmin Hafner.