Page MenuHomec4science

convert_cisbp_to_meme.md
No OneTemporary

File Metadata

Created
Thu, Mar 28, 11:13

convert_cisbp_to_meme.md

---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.2'
jupytext_version: 1.3.3
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Formatting Cis-BP probability weight matrices to MEME
## Goal:
Prototype and convert probability weight matrices (PWMs) from Cis-BP to MEME format, for use with FIMO (to annotate gene and TE promoters) at a later stage.
Documentation for the MEME format can be found [here](http://meme-suite.org/doc/meme-format.html).
## Data provenance:
The complete "Homo Sapiens" species archive was downloaded from <http://cisbp.ccbr.utoronto.ca/bulk.php>, on 2020.06.15.
The zipped directory was exported to the newly created directory `cisbp_homo_sapiens`.
## New fields required in the MEME PWM format
For MEME formatted PWMs, find the documentation [here](http://meme-suite.org/doc/meme-format.html).
### Version number
The version number of the MEME suite version we are using. We are using MEME version 5.1.1, therefore the field shoud be:
`MEME version 5.1.1`
### Alphabet
It is possible to specify a custom alaphabet (for example with special characters other than ACGT for DNA), but we don't need it there. We use standard DNA alphabet.
`ALPHABET= ACGT`
### Background frequencies
It is possible to specify the background frequency at two stages: either in the MEME PWMs or in a separate Markov Background Model Format text file, where only the 0-order is taken into account. We will use that option and set uniform background frequencies of 0.25 per nucleotide.
- [documentation for Markov Background Model Format](http://meme-suite.org/doc/bfile-format.html)
- [fasta-get-markov](http://meme-suite.org/doc/fasta-get-markov.html) generates Markov Background Model Format of any order.
To set uniform background frequencies:
`A 0.25 C 0.25 G 0.25 T 0.25`
### Motif name line
The unique motif identifier, we take it from the cisbp PMW filename. If the PWM is named `motif123.txt`, then the Motif name line contains:
`MOTIF motif123`
### Motif letter probability matrix lines
Rows are positions in the motif, columns are letters in the alphabet (here the DNA, in the order specified in the alphabet)
We do not define any optional value. Therefore, we only need to declare the matrix after the line:
`letter-probability matrix:`
... matrix goes here
## Prototyping the conversion on a single PWM
### Getting all the information from a Cis-BP-formatted PWM
Parsing the PWM matrix text file.
We sanity check the order of the columns to be `A C G T` to avoid bad surprises.
```python
import os
# relative paths
project_rootdir = os.path.join(*os.getcwd().split(os.sep)[:-1])
project_rootdir = '/'+project_rootdir # there must bet a cleaner way to add the first slash...
pwms_cisbp_path = os.path.join(project_rootdir, 'cisbp_homo_sapiens/pwms_all_motifs/')
# Random test motif: HOCOMOMCO FOXA2 motif
cisbp_motif_filename = 'M09104_2.00.txt'
# removing the extension (and the path) from the filename
# code from https://stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python
from pathlib import Path
motif_name = Path(cisbp_motif_filename).resolve().stem
pwm_path = os.path.join(pwms_cisbp_path, cisbp_motif_filename)
print(motif_name, pwm_path)
```
```python
import pandas as pd
pwm_df = pd.read_csv(pwm_path, header=0, sep='\t')
pwm_df
```
The matrix section of the MEME formatted PWM is the content of this table without the position column (it is implicit) and without the header (it is specified in at the `ALPHABHET` line.
```python
# sanity checking that columns are always in the same order
assert all(pwm_df.columns == pd.Index(['Pos', 'A', 'C', 'G', 'T'])),\
'Motif {}: the letters of the DNA alphabet are not in the correct order [A, C, G, T]'.format(motif_name)
```
```python
# sanity checking that rowsums of the PWM sum up to 1 (within tolerance)
import numpy as np
assert all(pwm_df.drop(columns='Pos').apply(axis=1, func=lambda x: np.isclose(sum(x), 1))), \
'Motif {}: some rows in the PWM do not sum up to 1'.format(motif_name)
```
### Writing the Cis-BP PWM to MEME-format
```python
meme_version_number = '5.1.1'
meme_version_line = 'MEME version ' + meme_version_number
alphabet_line = 'ALPHABET= ACGT'
strands_line = 'strands: + -'
background_letter_frequencies_lines = ['Background letter frequencies', 'A 0.25 C 0.25 G 0.25 T 0.25']
motif_line = 'MOTIF ' + motif_name
letter_probability_line = 'letter-probability matrix:'
lines_head = [meme_version_line, alphabet_line, strands_line]
lines_head.extend(background_letter_frequencies_lines)
lines_before_matrix = [motif_line, letter_probability_line]
[l for l in lines_before_matrix]
```
To write the PWM in MEME format:
- create an output directory and file for all PWMs
- write the `lines_before_matrix`
- for each motif:
- write the motif line
- append `pmw_df` without index, header and the `Pos` column
```python
pwms_meme_output_dir = os.path.join(project_rootdir, 'pwms_MEME')
import shutil
if os.path.exists(pwms_meme_output_dir):
shutil.rmtree(pwms_meme_output_dir, ignore_errors=True)
os.mkdir(pwms_meme_output_dir)
motif_source = 'cisbp'
outfile_path = os.path.join(pwms_meme_output_dir, 'pwms_'+motif_source+'_meme.txt')
# head of the file (only written once for all motifs)
with open(outfile_path, 'w') as f:
[f.write(l+'\n') for l in lines_head]
# specific to each motif
with open(outfile_path, 'a') as f:
[f.write(l+'\n') for l in lines_before_matrix]
pwm_df[['A', 'C', 'G', 'T']]\
.round(6)\
.applymap('{:.6f}'.format)\
.to_csv(outfile_path, sep='\t', index=None, header=None, mode='a')
```
## Testing the new MEME matrix with FIMO
To make sure that the matrix behaves as expected, we will scan a mock fasta file (random DNA sequence + the consensus TF motif at base 7, counting with 1-indexing, and at position final base -7 -length of the motif for the reverse complement).
The random DNA sequence was generated as a fasta file [here](https://www.bioinformatics.org/sms2/random_dna.html).
The expected match is the following sequence (in capital letters): CCTTGTTTACTTT
```python
motif_test_consensus = 'CCTTGTTTACTTT'
motif_test_consensus_rev = 'AAAGTAAACAAGG'
# deleting the fimo outdir if it already exists
fimo_outdir = os.path.join(project_rootdir, 'fimo_out')
if os.path.exists(fimo_outdir):
shutil.rmtree(fimo_outdir, ignore_errors=True)
# is fimo installed ?
assert shutil.which('fimo') is not None, 'Error: please install FIMO 5.1.1 on this machine'
# Running a test instance of FIMO:
os.chdir(project_rootdir)
os.system('fimo pwms_MEME/pwms_cisbp_meme.txt test_fimo/test_sequence.fa')
os.chdir(os.path.join(project_rootdir, 'notebooks'))
```
```python
fimo_test_out = pd.read_csv(os.path.join(project_rootdir, 'fimo_out/fimo.tsv'), sep='\t', header=0)
fimo_test_out
```
```python
assert all(fimo_test_out['matched_sequence'].dropna() == ['CCTTGTTTACTTT', 'CCTTGTTTACTTT']),\
'Error: we do not find the expected number of motifs (2 motifs, one in reverse and one in forward)'
```
# Converting all Cis-BP PWMs to MEME format (into a single file)
## Filtering on open-source human PWMs with direct experimental evidence
Because all PWMs are always included, no matter the selection on Cis-BP, we need to filter out motifs:
- Not from human TFs
- Not with direct experimental evidence
- from Transfac
```python
cisbp_metadata = pd.read_csv(os.path.join(project_rootdir, 'cisbp_homo_sapiens/TF_Information.txt'), sep='\t', header=0)
cisbp_metadata
```
```python
cisbp_metadata.columns
```
```python
cisbp_metadata = cisbp_metadata[(cisbp_metadata['TF_Species']=='Homo_sapiens') & \
(cisbp_metadata['TF_Status']=='D') & \
(cisbp_metadata['Motif_Type']!='Transfac')].reset_index()
len(cisbp_metadata)
```
```python
motifs_selected = set(cisbp_metadata['Motif_ID'].unique())
```
```python
def cisbp_to_meme(pmws_cisbp_path, motifs_selected, outfile_path, meme_version_number='5.1.1'):
"""
convert cis-BP-formatted position weight matrices text files to a single MEME PWM text file.
pws_cisbp_path: path of directory where all files are cisbp-formatted PWMS (str)
outfile_path: path of target file (str)
meme_version_number: version number (str)
"""
# header of the MEME formatted file
meme_version_line = 'MEME version ' + meme_version_number
alphabet_line = 'ALPHABET= ACGT'
strands_line = 'strands: + -'
background_letter_frequencies_lines = ['Background letter frequencies', 'A 0.25 C 0.25 G 0.25 T 0.25']
lines_head = [meme_version_line, alphabet_line, strands_line]
lines_head.extend(background_letter_frequencies_lines)
with open(outfile_path, 'w') as f:
[f.write(l+'\n') for l in lines_head]
cisbp_motif_files = [f for f in os.listdir(pmws_cisbp_path) if (os.path.isfile(os.path.join(pwms_cisbp_path, f)) and Path(f).resolve().stem in motifs_selected)]
cisbp_pwms_stats = pd.DataFrame({'motif_filename':cisbp_motif_files})
cisbp_pwms_max_prob_errors = np.zeros(len(cisbp_motif_files))
for idx, cisbp_motif_filename in enumerate(cisbp_motif_files):
print(idx, cisbp_motif_filename)
pwm_path = os.path.join(pwms_cisbp_path, cisbp_motif_filename)
pwm_df = pd.read_csv(pwm_path, header=0, sep='\t')
motif_name = Path(cisbp_motif_filename).resolve().stem
# sanity checking that columns are always in the same order
assert all(pwm_df.columns == pd.Index(['Pos', 'A', 'C', 'G', 'T'])),\
'Motif {}: the letters of the DNA alphabet are not in the correct order [A, C, G, T]'.format(motif_name)
# sanity checking that rowsums of the PWM sum up to 1 (within tolerance)
prob_error = pwm_df.drop(columns='Pos').apply(axis=1, func=lambda x: abs(sum(x)-1))
cisbp_pwms_max_prob_errors[idx] = max(prob_error)
if not all(np.isclose(prob_error, 0, 1e-3, 1e-3)):
print('Motif {} warning: some rows do not sum up to 1 within 1e-3 absolute and relative precision'.format(motif_name))
motif_line = 'MOTIF ' + motif_name
letter_probability_line = 'letter-probability matrix:'
lines_before_matrix = [motif_line, letter_probability_line]
with open(outfile_path, 'a') as f:
[f.write(l+'\n') for l in lines_before_matrix]
pwm_df[['A', 'C', 'G', 'T']]\
.round(6)\
.applymap('{:.6f}'.format)\
.to_csv(outfile_path, sep='\t', index=None, header=None, mode='a')
cisbp_pwms_stats['max_prob_error'] = cisbp_pwms_max_prob_errors
return cisbp_pwms_stats
```
```python
cisbp_pwms_stats = cisbp_to_meme(pmws_cisbp_path=pwms_cisbp_path, motifs_selected=motifs_selected, outfile_path= os.path.join(project_rootdir, 'pwms_MEME/pwms_cisbp_meme.txt'))
```
## Problems met when scaling up to converting all PWMs from Cis-BP:
The transfac files are present in the `pwms_all_motifs` directory, but they are empty, therefore they make the conversion crash. Example: `M10930_2.00.txt`. This is now taken care of with an additional filtering step.
There are some PWMs where the rowsums do not sum up to 1 even within 1e-2 precision... Therefore, we keep track of the maximum error (that is the max value between rowsums subtracted from 1, the theoretical value of each rowsum).
How bad is the problem ?
```python
cisbp_pwms_stats.sort_values(by='max_prob_error', ascending=False).head(10)
```
The largest error is 0.2, for motif `M05847_2.00.txt` for ZKCAN1, that is clearly an error due to some glitch in the computing of the motif PWM by the curators of Cis-BP. The corresponding site in the logo file is empty as well. We redo the analysis by excluding that motif.
```python
motifs_selected.remove('M05847_2.00')
cisbp_pwms_stats = cisbp_to_meme(pmws_cisbp_path=pwms_cisbp_path, motifs_selected=motifs_selected, outfile_path= os.path.join(project_rootdir, 'pwms_MEME/pwms_cisbp_meme.txt'))
```
## Testing the integrity of the new PWM file
```python
# Running a test instance of FIMO:
os.chdir(project_rootdir)
os.system('fimo pwms_MEME/pwms_cisbp_meme.txt test_fimo/test_sequence.fa')
os.chdir(os.path.join(project_rootdir, 'notebooks'))
```
```python
fimo_test_out = pd.read_csv(os.path.join(project_rootdir, 'fimo_out/fimo.tsv'), sep='\t', header=0)
fimo_test_out
```
The integrity of the matrix is fine.
However, we find > 1000 hits within 1000 bp of DNA, for about 4000 motifs. That nicely illustrates the curse of the number of false positives we find with this method. The linear regression step is really crucial.
```python
fimo_test_out['motif_id'].nunique()
```

Event Timeline