Diffusion MOLEKUEHL (master)

Edit
MOLEKUEHL
ActivePublic

Project on algorithms for computational chemistry

Recent Commits

Commit	Author	Details	Committed
385c016290e9	matt	Forgot to fix merge issues.	Jan 10 2022
932081c4a349	matt	Merge branch 'master' of https://c4science.ch/diffusion/11301/molekuehl	Jan 10 2022
93eb3e78a418	matt	Added support for vector representations of qm7 database.	Jan 10 2022
7361b4b17ae7	puckvg	seems ok	Jan 7 2022
90e3ba17ab56	puckvg	acm ok?	Jan 7 2022
1c17dba2d764	matt	Changed preprocess.py for old qm7 data to have same structure as the new ones.	Jan 6 2022
1392955adabe	puckvg	work	Jan 6 2022
9d7adea85e44	puckvg	fix CM	Jan 4 2022
16d8a316e929	puckvg	aCM doesnt work	Dec 21 2021
7fdebda55844	puckvg	merge conflict	Dec 21 2021
7459716bca1e	puckvg	aCM seems better but obj value still huge	Dec 21 2021
d00e36abbefe	matt	Added modifiable constant for fragmentation penalty	Dec 21 2021
fa27a77d2a83	matt	Support for 3 new vector representations. Model parameters to avoid getting…	Dec 21 2021
b3236b2c31ff	puckvg	add lots of new reps to try	Dec 20 2021
500a60d21a77	puckvg	updated database and search is ok	Dec 20 2021

README.md

Data
1. Structures

The matrices for 3 target structures (to synthesize) and a database of 7165 query structures (to combine to build the target) are compressed in data.npz

Within python, it can be read like:

data = np.load("data.npz", allow_pickle=True)

where data.files will return the names of the numpy arrays (should be target_labels, target_CMs, target_ncharges, database_labels, database_CMs, database_ncharges) where CMs are the matrices (of target and database respectively) and the corresponding arrays can be accessed like:

data["target_labels"]

For more details see the documentation: https://het.as.utexas.edu/HET/Software/Numpy/reference/generated/numpy.savez.html

Connectivity / functional group information

Adjacency matrices and functional group information derived from the connectivity are compressed in connectivity_data.npz.

Within python, it can be read like:

connectivity_data = np.load("connectivity_data.npz")

the corresponding keys are fg_counts_targets for the functional group counts of each of the 3 target molecules,fg_counts_frags for the functional group counts of each of the fragment molecules, frag_adj_matrices for the adjacency matrices of the fragments and target_adj_matrices for the adjacency matrices of the target molecules. The order is the same as those in data containing the structures.

Optimal databases

Dedicated databases of small molecules are saved for each target, all compressed in the file amons_data.npz.

data = np.load("amons_data.npz")

contains the same information as in the original data, but now specific to each target. Target 0 (qm9) has the data:

qm9_amons_labels
qm9_amons_ncharges
qm9_amons_CMs

where the CMs are the representation matrices.

Similarly, target 1 (vitc) has the same data with the prefix vitc_. Same for vitd. These databases are much smaller, making the search faster.

Optimal databases and vector data

Rather than using symmetric matrices to represent our molecules where each row/column index represents an atom index, we can directly use a vector of the same length for each atom index. In other words, we have an asymmetric matrix of dimensions N_atoms x V_dim where V_dim is the length of the vector. We can access the representation for each atom as the appropriate index of the asymmetric matrix. V_dim will vary based on the atoms present in the target system, but will be consistent between the target and database candidates.

Now we have datasets for 4 different asymmetric representations: aCM, SLATM, SOAP and FCHL, all named like target_repname_data.npz for the target and amons_repname_data.npz for the fragments.

.gitignore
.ipynb_checkpoints/
GetCM.ipynb
GetCMAmons.ipynb
GetFCHL.ipynb
GetFCHLAmons.ipynb
GetSLATM.ipynb
GetSLATMAmons.ipynb
GetSOAP.ipynb
GetSOAPAmons.ipynb
GetaCM.ipynb
GetaCMAmons.ipynb
Project.pdf
Project.tex
README.md
__pycache__/
amons-qm9/
amons-vitc/
amons-vitd/
amons_CM_data.npz
amons_FCHL_data.npz
amons_SLATM_data.npz
amons_SOAP_data.npz
amons_aCM_data.npz
amons_vector_data.npz
connectivity_data.npz
connectivity_datafilter.npz
data.npz
datafilter.npz
gurobi.py
matrix.py
namingconvention
onepass.py
preprocess.py
qm7/
qm7_CM_data.npz
target_CM_data.npz
target_FCHL_data.npz
target_SLATM_data.npz
target_SOAP_data.npz
target_aCM_data.npz
target_qm7_data.npz
target_vector_data.npz
targets/
utf8math.sty
xyz2mol.py
xyz2sdf.ipynb

EditMOLEKUEHLActivePublic

MOLEKUEHL (master)