Project on algorithms for computational chemistry
Recent Commits
Commit | Author | Details | Committed | ||||
---|---|---|---|---|---|---|---|
385c016290e9 | matt | Forgot to fix merge issues. | Jan 10 2022 | ||||
932081c4a349 | matt | Merge branch 'master' of https://c4science.ch/diffusion/11301/molekuehl | Jan 10 2022 | ||||
93eb3e78a418 | matt | Added support for vector representations of qm7 database. | Jan 10 2022 | ||||
7361b4b17ae7 | puckvg | seems ok | Jan 7 2022 | ||||
90e3ba17ab56 | puckvg | acm ok? | Jan 7 2022 | ||||
1c17dba2d764 | matt | Changed preprocess.py for old qm7 data to have same structure as the new ones. | Jan 6 2022 | ||||
1392955adabe | puckvg | work | Jan 6 2022 | ||||
9d7adea85e44 | puckvg | fix CM | Jan 4 2022 | ||||
16d8a316e929 | puckvg | aCM doesnt work | Dec 21 2021 | ||||
7fdebda55844 | puckvg | merge conflict | Dec 21 2021 | ||||
7459716bca1e | puckvg | aCM seems better but obj value still huge | Dec 21 2021 | ||||
d00e36abbefe | matt | Added modifiable constant for fragmentation penalty | Dec 21 2021 | ||||
fa27a77d2a83 | matt | Support for 3 new vector representations. Model parameters to avoid getting… | Dec 21 2021 | ||||
b3236b2c31ff | puckvg | add lots of new reps to try | Dec 20 2021 | ||||
500a60d21a77 | puckvg | updated database and search is ok | Dec 20 2021 |
README.md
- Data
- Structures
The matrices for 3 target structures (to synthesize) and a database of 7165 query structures (to combine to build the target) are compressed in data.npz
Within python, it can be read like:
data = np.load("data.npz", allow_pickle=True)
where data.files will return the names of the numpy arrays (should be target_labels, target_CMs, target_ncharges, database_labels, database_CMs, database_ncharges) where CMs are the matrices (of target and database respectively) and the corresponding arrays can be accessed like:
data["target_labels"]
For more details see the documentation: https://het.as.utexas.edu/HET/Software/Numpy/reference/generated/numpy.savez.html
Connectivity / functional group information
Adjacency matrices and functional group information derived from the connectivity are compressed in connectivity_data.npz.
Within python, it can be read like:
connectivity_data = np.load("connectivity_data.npz")
the corresponding keys are fg_counts_targets for the functional group counts of each of the 3 target molecules,fg_counts_frags for the functional group counts of each of the fragment molecules, frag_adj_matrices for the adjacency matrices of the fragments and target_adj_matrices for the adjacency matrices of the target molecules. The order is the same as those in data containing the structures.
Optimal databases
Dedicated databases of small molecules are saved for each target, all compressed in the file amons_data.npz.
data = np.load("amons_data.npz")
contains the same information as in the original data, but now specific to each target. Target 0 (qm9) has the data:
qm9_amons_labels qm9_amons_ncharges qm9_amons_CMs
where the CMs are the representation matrices.
Similarly, target 1 (vitc) has the same data with the prefix vitc_. Same for vitd. These databases are much smaller, making the search faster.
Optimal databases and vector data
Rather than using symmetric matrices to represent our molecules where each row/column index represents an atom index, we can directly use a vector of the same length for each atom index. In other words, we have an asymmetric matrix of dimensions N_atoms x V_dim where V_dim is the length of the vector. We can access the representation for each atom as the appropriate index of the asymmetric matrix. V_dim will vary based on the atoms present in the target system, but will be consistent between the target and database candidates.
Now we have datasets for 4 different asymmetric representations: aCM, SLATM, SOAP and FCHL, all named like target_repname_data.npz for the target and amons_repname_data.npz for the fragments.