convert_cisbp_to_meme.md
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Sat, Jul 27, 20:16

convert_cisbp_to_meme.md
View Options

	---
	jupyter:
	jupytext:
	formats: ipynb,md
	text_representation:
	extension: .md
	format_name: markdown
	format_version: '1.2'
	jupytext_version: 1.3.3
	kernelspec:
	display_name: Python 3
	language: python
	name: python3
	---

	# Formatting Cis-BP probability weight matrices to MEME

	## Goal:

	Prototype and convert probability weight matrices (PWMs) from Cis-BP to MEME format, for use with FIMO (to annotate gene and TE promoters) at a later stage.

	Documentation for the MEME format can be found [here](http://meme-suite.org/doc/meme-format.html).

	## Data provenance:

	The complete "Homo Sapiens" species archive was downloaded from <http://cisbp.ccbr.utoronto.ca/bulk.php>, on 2020.06.15.
	The zipped directory was exported to the newly created directory `cisbp_homo_sapiens`.

	## New fields required in the MEME PWM format

	For MEME formatted PWMs, find the documentation [here](http://meme-suite.org/doc/meme-format.html).

	### Version number

	The version number of the MEME suite version we are using. We are using MEME version 5.1.1, therefore the field shoud be:
	`MEME version 5.1.1`

	### Alphabet

	It is possible to specify a custom alaphabet (for example with special characters other than ACGT for DNA), but we don't need it there. We use standard DNA alphabet.
	`ALPHABET= ACGT`

	### Background frequencies
	It is possible to specify the background frequency at two stages: either in the MEME PWMs or in a separate Markov Background Model Format text file, where only the 0-order is taken into account. We will use that option and set uniform background frequencies of 0.25 per nucleotide.

	- [documentation for Markov Background Model Format](http://meme-suite.org/doc/bfile-format.html)
	- [fasta-get-markov](http://meme-suite.org/doc/fasta-get-markov.html) generates Markov Background Model Format of any order.

	To set uniform background frequencies:
	`A 0.25 C 0.25 G 0.25 T 0.25`

	### Motif name line

	The unique motif identifier, we take it from the cisbp PMW filename. If the PWM is named `motif123.txt`, then the Motif name line contains:
	`MOTIF motif123`

	### Motif letter probability matrix lines

	Rows are positions in the motif, columns are letters in the alphabet (here the DNA, in the order specified in the alphabet)
	We do not define any optional value. Therefore, we only need to declare the matrix after the line:
	`letter-probability matrix:`
	... matrix goes here

	## Prototyping the conversion on a single PWM

	### Getting all the information from a Cis-BP-formatted PWM
	Parsing the PWM matrix text file.
	We sanity check the order of the columns to be `A C G T` to avoid bad surprises.
	```python
	import os

	# relative paths
	project_rootdir = os.path.join(*os.getcwd().split(os.sep)[:-1])
	project_rootdir = '/'+project_rootdir # there must bet a cleaner way to add the first slash...

	pwms_cisbp_path = os.path.join(project_rootdir, 'cisbp_homo_sapiens/pwms_all_motifs/')

	# Random test motif: HOCOMOMCO FOXA2 motif
	cisbp_motif_filename = 'M09104_2.00.txt'

	# removing the extension (and the path) from the filename
	# code from https://stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python
	from pathlib import Path
	motif_name = Path(cisbp_motif_filename).resolve().stem

	pwm_path = os.path.join(pwms_cisbp_path, cisbp_motif_filename)
	print(motif_name, pwm_path)
	```

	```python
	import pandas as pd
	pwm_df = pd.read_csv(pwm_path, header=0, sep='\t')
	pwm_df
	```
	The matrix section of the MEME formatted PWM is the content of this table without the position column (it is implicit) and without the header (it is specified in at the `ALPHABHET` line.
	```python
	# sanity checking that columns are always in the same order
	assert all(pwm_df.columns == pd.Index(['Pos', 'A', 'C', 'G', 'T'])),\
	'Motif {}: the letters of the DNA alphabet are not in the correct order [A, C, G, T]'.format(motif_name)
	```

	```python
	# sanity checking that rowsums of the PWM sum up to 1 (within tolerance)
	import numpy as np

	assert all(pwm_df.drop(columns='Pos').apply(axis=1, func=lambda x: np.isclose(sum(x), 1))), \
	'Motif {}: some rows in the PWM do not sum up to 1'.format(motif_name)
	```
	### Writing the Cis-BP PWM to MEME-format

	```python
	meme_version_number = '5.1.1'
	meme_version_line = 'MEME version ' + meme_version_number

	alphabet_line = 'ALPHABET= ACGT'
	strands_line = 'strands: + -'
	background_letter_frequencies_lines = ['Background letter frequencies', 'A 0.25 C 0.25 G 0.25 T 0.25']

	motif_line = 'MOTIF ' + motif_name
	letter_probability_line = 'letter-probability matrix:'

	lines_head = [meme_version_line, alphabet_line, strands_line]
	lines_head.extend(background_letter_frequencies_lines)
	lines_before_matrix = [motif_line, letter_probability_line]

	[l for l in lines_before_matrix]
	```
	To write the PWM in MEME format:
	- create an output directory and file for all PWMs
	- write the `lines_before_matrix`
	- for each motif:
	- write the motif line
	- append `pmw_df` without index, header and the `Pos` column

	```python
	pwms_meme_output_dir = os.path.join(project_rootdir, 'pwms_MEME')
	import shutil
	if os.path.exists(pwms_meme_output_dir):
	shutil.rmtree(pwms_meme_output_dir, ignore_errors=True)
	os.mkdir(pwms_meme_output_dir)
	motif_source = 'cisbp'
	outfile_path = os.path.join(pwms_meme_output_dir, 'pwms_'+motif_source+'_meme.txt')

	# head of the file (only written once for all motifs)
	with open(outfile_path, 'w') as f:
	[f.write(l+'\n') for l in lines_head]

	# specific to each motif
	with open(outfile_path, 'a') as f:
	[f.write(l+'\n') for l in lines_before_matrix]

	pwm_df[['A', 'C', 'G', 'T']]\
	.round(6)\
	.applymap('{:.6f}'.format)\
	.to_csv(outfile_path, sep='\t', index=None, header=None, mode='a')
	```

	## Testing the new MEME matrix with FIMO

	To make sure that the matrix behaves as expected, we will scan a mock fasta file (random DNA sequence + the consensus TF motif at base 7, counting with 1-indexing, and at position final base -7 -length of the motif for the reverse complement).

	The random DNA sequence was generated as a fasta file [here](https://www.bioinformatics.org/sms2/random_dna.html).

	The expected match is the following sequence (in capital letters): CCTTGTTTACTTT


	```python
	motif_test_consensus = 'CCTTGTTTACTTT'
	motif_test_consensus_rev = 'AAAGTAAACAAGG'

	# deleting the fimo outdir if it already exists
	fimo_outdir = os.path.join(project_rootdir, 'fimo_out')
	if os.path.exists(fimo_outdir):
	shutil.rmtree(fimo_outdir, ignore_errors=True)

	# is fimo installed ?
	assert shutil.which('fimo') is not None, 'Error: please install FIMO 5.1.1 on this machine'

	# Running a test instance of FIMO:
	os.chdir(project_rootdir)
	os.system('fimo pwms_MEME/pwms_cisbp_meme.txt test_fimo/test_sequence.fa')
	os.chdir(os.path.join(project_rootdir, 'notebooks'))
	```

	```python
	fimo_test_out = pd.read_csv(os.path.join(project_rootdir, 'fimo_out/fimo.tsv'), sep='\t', header=0)
	fimo_test_out
	```

	```python
	assert all(fimo_test_out['matched_sequence'].dropna() == ['CCTTGTTTACTTT', 'CCTTGTTTACTTT']),\
	'Error: we do not find the expected number of motifs (2 motifs, one in reverse and one in forward)'
	```

	# Converting all Cis-BP PWMs to MEME format (into a single file)
	## Filtering on open-source human PWMs with direct experimental evidence
	Because all PWMs are always included, no matter the selection on Cis-BP, we need to filter out motifs:
	- Not from human TFs
	- Not with direct experimental evidence
	- from Transfac

	```python
	cisbp_metadata = pd.read_csv(os.path.join(project_rootdir, 'cisbp_homo_sapiens/TF_Information.txt'), sep='\t', header=0)
	cisbp_metadata
	```

	```python
	cisbp_metadata.columns
	```

	```python
	cisbp_metadata = cisbp_metadata[(cisbp_metadata['TF_Species']=='Homo_sapiens') & \
	(cisbp_metadata['TF_Status']=='D') & \
	(cisbp_metadata['Motif_Type']!='Transfac')].reset_index()
	len(cisbp_metadata)
	```

	```python
	motifs_selected = set(cisbp_metadata['Motif_ID'].unique())
	```

	```python
	def cisbp_to_meme(pmws_cisbp_path, motifs_selected, outfile_path, meme_version_number='5.1.1'):


	"""
	convert cis-BP-formatted position weight matrices text files to a single MEME PWM text file.

	pws_cisbp_path: path of directory where all files are cisbp-formatted PWMS (str)
	outfile_path: path of target file (str)
	meme_version_number: version number (str)
	"""

	# header of the MEME formatted file
	meme_version_line = 'MEME version ' + meme_version_number
	alphabet_line = 'ALPHABET= ACGT'
	strands_line = 'strands: + -'
	background_letter_frequencies_lines = ['Background letter frequencies', 'A 0.25 C 0.25 G 0.25 T 0.25']
	lines_head = [meme_version_line, alphabet_line, strands_line]
	lines_head.extend(background_letter_frequencies_lines)
	with open(outfile_path, 'w') as f:
	[f.write(l+'\n') for l in lines_head]



	cisbp_motif_files = [f for f in os.listdir(pmws_cisbp_path) if (os.path.isfile(os.path.join(pwms_cisbp_path, f)) and Path(f).resolve().stem in motifs_selected)]

	cisbp_pwms_stats = pd.DataFrame({'motif_filename':cisbp_motif_files})
	cisbp_pwms_max_prob_errors = np.zeros(len(cisbp_motif_files))

	for idx, cisbp_motif_filename in enumerate(cisbp_motif_files):
	print(idx, cisbp_motif_filename)
	pwm_path = os.path.join(pwms_cisbp_path, cisbp_motif_filename)
	pwm_df = pd.read_csv(pwm_path, header=0, sep='\t')
	motif_name = Path(cisbp_motif_filename).resolve().stem

	# sanity checking that columns are always in the same order
	assert all(pwm_df.columns == pd.Index(['Pos', 'A', 'C', 'G', 'T'])),\
	'Motif {}: the letters of the DNA alphabet are not in the correct order [A, C, G, T]'.format(motif_name)

	# sanity checking that rowsums of the PWM sum up to 1 (within tolerance)
	prob_error = pwm_df.drop(columns='Pos').apply(axis=1, func=lambda x: abs(sum(x)-1))
	cisbp_pwms_max_prob_errors[idx] = max(prob_error)
	if not all(np.isclose(prob_error, 0, 1e-3, 1e-3)):
	print('Motif {} warning: some rows do not sum up to 1 within 1e-3 absolute and relative precision'.format(motif_name))
	motif_line = 'MOTIF ' + motif_name
	letter_probability_line = 'letter-probability matrix:'
	lines_before_matrix = [motif_line, letter_probability_line]

	with open(outfile_path, 'a') as f:
	[f.write(l+'\n') for l in lines_before_matrix]

	pwm_df[['A', 'C', 'G', 'T']]\
	.round(6)\
	.applymap('{:.6f}'.format)\
	.to_csv(outfile_path, sep='\t', index=None, header=None, mode='a')

	cisbp_pwms_stats['max_prob_error'] = cisbp_pwms_max_prob_errors
	return cisbp_pwms_stats

	```

	```python
	cisbp_pwms_stats = cisbp_to_meme(pmws_cisbp_path=pwms_cisbp_path, motifs_selected=motifs_selected, outfile_path= os.path.join(project_rootdir, 'pwms_MEME/pwms_cisbp_meme.txt'))
	```

	## Problems met when scaling up to converting all PWMs from Cis-BP:
	The transfac files are present in the `pwms_all_motifs` directory, but they are empty, therefore they make the conversion crash. Example: `M10930_2.00.txt`. This is now taken care of with an additional filtering step.

	There are some PWMs where the rowsums do not sum up to 1 even within 1e-2 precision... Therefore, we keep track of the maximum error (that is the max value between rowsums subtracted from 1, the theoretical value of each rowsum).

	How bad is the problem ?

	```python
	cisbp_pwms_stats.sort_values(by='max_prob_error', ascending=False).head(10)
	```

	The largest error is 0.2, for motif `M05847_2.00.txt` for ZKCAN1, that is clearly an error due to some glitch in the computing of the motif PWM by the curators of Cis-BP. The corresponding site in the logo file is empty as well. We redo the analysis by excluding that motif.

	```python
	motifs_selected.remove('M05847_2.00')
	cisbp_pwms_stats = cisbp_to_meme(pmws_cisbp_path=pwms_cisbp_path, motifs_selected=motifs_selected, outfile_path= os.path.join(project_rootdir, 'pwms_MEME/pwms_cisbp_meme.txt'))
	```

	## Testing the integrity of the new PWM file

	```python
	# Running a test instance of FIMO:
	os.chdir(project_rootdir)
	os.system('fimo pwms_MEME/pwms_cisbp_meme.txt test_fimo/test_sequence.fa')
	os.chdir(os.path.join(project_rootdir, 'notebooks'))
	```

	```python
	fimo_test_out = pd.read_csv(os.path.join(project_rootdir, 'fimo_out/fimo.tsv'), sep='\t', header=0)
	fimo_test_out
	```

	The integrity of the matrix is fine.

	However, we find > 1000 hits within 1000 bp of DNA, for about 4000 motifs. That nicely illustrates the curse of the number of false positives we find with this method. The linear regression step is really crucial.

	```python
	fimo_test_out['motif_id'].nunique()
	```

convert_cisbp_to_meme.mdNo OneTemporaryActions

File Metadata

convert_cisbp_to_meme.mdView Options

Event Timeline

convert_cisbp_to_meme.md
No OneTemporary
Actions

convert_cisbp_to_meme.md
View Options