Page Menu
Home
c4science
Search
Configure Global Search
Log In
Files
F74437695
convert_cisbp_to_meme.md
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Subscribers
None
File Metadata
Details
File Info
Storage
Attached
Created
Sat, Jul 27, 20:16
Size
12 KB
Mime Type
text/x-python
Expires
Mon, Jul 29, 20:16 (2 d)
Engine
blob
Format
Raw Data
Handle
19394002
Attached To
R10595 cisbp-to-meme
convert_cisbp_to_meme.md
View Options
---
jupyter
:
jupytext
:
formats
:
ipynb
,
md
text_representation
:
extension
:
.
md
format_name
:
markdown
format_version
:
'
1.2
'
jupytext_version
:
1.3.3
kernelspec
:
display_name
:
Python
3
language
:
python
name
:
python3
---
#
Formatting
Cis
-
BP
probability
weight
matrices
to
MEME
##
Goal
:
Prototype
and
convert
probability
weight
matrices
(
PWMs
)
from
Cis
-
BP
to
MEME
format
,
for
use
with
FIMO
(
to
annotate
gene
and
TE
promoters
)
at
a
later
stage
.
Documentation
for
the
MEME
format
can
be
found
[
here
](
http
:
//meme-suite.org/doc/meme-format.html).
##
Data
provenance
:
The
complete
"Homo Sapiens"
species
archive
was
downloaded
from
<
http
:
//cisbp.ccbr.utoronto.ca/bulk.php>, on 2020.06.15.
The
zipped
directory
was
exported
to
the
newly
created
directory
`cisbp_homo_sapiens`
.
##
New
fields
required
in
the
MEME
PWM
format
For
MEME
formatted
PWMs
,
find
the
documentation
[
here
](
http
:
//meme-suite.org/doc/meme-format.html).
###
Version
number
The
version
number
of
the
MEME
suite
version
we
are
using
.
We
are
using
MEME
version
5.1.1
,
therefore
the
field
shoud
be
:
`MEME version 5.1.1`
###
Alphabet
It
is
possible
to
specify
a
custom
alaphabet
(
for
example
with
special
characters
other
than
ACGT
for
DNA
),
but
we
don
'
t
need
it
there
.
We
use
standard
DNA
alphabet
.
`ALPHABET= ACGT`
###
Background
frequencies
It
is
possible
to
specify
the
background
frequency
at
two
stages
:
either
in
the
MEME
PWMs
or
in
a
separate
Markov
Background
Model
Format
text
file
,
where
only
the
0
-
order
is
taken
into
account
.
We
will
use
that
option
and
set
uniform
background
frequencies
of
0.25
per
nucleotide
.
-
[
documentation
for
Markov
Background
Model
Format
](
http
:
//meme-suite.org/doc/bfile-format.html)
-
[
fasta
-
get
-
markov
](
http
:
//meme-suite.org/doc/fasta-get-markov.html) generates Markov Background Model Format of any order.
To
set
uniform
background
frequencies
:
`A 0.25 C 0.25 G 0.25 T 0.25`
###
Motif
name
line
The
unique
motif
identifier
,
we
take
it
from
the
cisbp
PMW
filename
.
If
the
PWM
is
named
`motif123.txt`
,
then
the
Motif
name
line
contains
:
`MOTIF motif123`
###
Motif
letter
probability
matrix
lines
Rows
are
positions
in
the
motif
,
columns
are
letters
in
the
alphabet
(
here
the
DNA
,
in
the
order
specified
in
the
alphabet
)
We
do
not
define
any
optional
value
.
Therefore
,
we
only
need
to
declare
the
matrix
after
the
line
:
`letter-probability matrix:`
...
matrix
goes
here
##
Prototyping
the
conversion
on
a
single
PWM
###
Getting
all
the
information
from
a
Cis
-
BP
-
formatted
PWM
Parsing
the
PWM
matrix
text
file
.
We
sanity
check
the
order
of
the
columns
to
be
`A C G T`
to
avoid
bad
surprises
.
```
python
import
os
#
relative
paths
project_rootdir
=
os
.
path
.
join
(*
os
.
getcwd
().
split
(
os
.
sep
)[:-
1
])
project_rootdir
=
'/'
+
project_rootdir
#
there
must
bet
a
cleaner
way
to
add
the
first
slash
...
pwms_cisbp_path
=
os
.
path
.
join
(
project_rootdir
,
'
cisbp_homo_sapiens
/
pwms_all_motifs
/
'
)
#
Random
test
motif
:
HOCOMOMCO
FOXA2
motif
cisbp_motif_filename
=
'
M09104_2
.00
.
txt
'
#
removing
the
extension
(
and
the
path
)
from
the
filename
#
code
from
https
:
//stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python
from
pathlib
import
Path
motif_name
=
Path
(
cisbp_motif_filename
).
resolve
().
stem
pwm_path
=
os
.
path
.
join
(
pwms_cisbp_path
,
cisbp_motif_filename
)
print
(
motif_name
,
pwm_path
)
```
```
python
import
pandas
as
pd
pwm_df
=
pd
.
read_csv
(
pwm_path
,
header
=
0
,
sep
=
'\t'
)
pwm_df
```
The
matrix
section
of
the
MEME
formatted
PWM
is
the
content
of
this
table
without
the
position
column
(
it
is
implicit
)
and
without
the
header
(
it
is
specified
in
at
the
`ALPHABHET`
line
.
```
python
#
sanity
checking
that
columns
are
always
in
the
same
order
assert
all
(
pwm_df
.
columns
==
pd
.
Index
([
'
Pos
'
,
'A'
,
'C'
,
'G'
,
'T'
])),\
'
Motif
{}:
the
letters
of
the
DNA
alphabet
are
not
in
the
correct
order
[
A
,
C
,
G
,
T
]
'
.
format
(
motif_name
)
```
```
python
#
sanity
checking
that
rowsums
of
the
PWM
sum
up
to
1
(
within
tolerance
)
import
numpy
as
np
assert
all
(
pwm_df
.
drop
(
columns
=
'
Pos
'
).
apply
(
axis
=
1
,
func
=
lambda
x
:
np
.
isclose
(
sum
(
x
),
1
))),
\
'
Motif
{}:
some
rows
in
the
PWM
do
not
sum
up
to
1
'
.
format
(
motif_name
)
```
###
Writing
the
Cis
-
BP
PWM
to
MEME
-
format
```
python
meme_version_number
=
'
5.1.1
'
meme_version_line
=
'
MEME
version
'
+
meme_version_number
alphabet_line
=
'
ALPHABET
=
ACGT
'
strands_line
=
'
strands
:
+
-
'
background_letter_frequencies_lines
=
[
'
Background
letter
frequencies
'
,
'
A
0.25
C
0.25
G
0.25
T
0.25
'
]
motif_line
=
'
MOTIF
'
+
motif_name
letter_probability_line
=
'
letter
-
probability
matrix
:
'
lines_head
=
[
meme_version_line
,
alphabet_line
,
strands_line
]
lines_head
.
extend
(
background_letter_frequencies_lines
)
lines_before_matrix
=
[
motif_line
,
letter_probability_line
]
[
l
for
l
in
lines_before_matrix
]
```
To
write
the
PWM
in
MEME
format
:
-
create
an
output
directory
and
file
for
all
PWMs
-
write
the
`lines_before_matrix`
-
for
each
motif
:
-
write
the
motif
line
-
append
`pmw_df` without index, header and the `Pos`
column
```
python
pwms_meme_output_dir
=
os
.
path
.
join
(
project_rootdir
,
'
pwms_MEME
'
)
import
shutil
if
os
.
path
.
exists
(
pwms_meme_output_dir
):
shutil
.
rmtree
(
pwms_meme_output_dir
,
ignore_errors
=
True
)
os
.
mkdir
(
pwms_meme_output_dir
)
motif_source
=
'
cisbp
'
outfile_path
=
os
.
path
.
join
(
pwms_meme_output_dir
,
'
pwms_
'
+
motif_source
+
'
_meme
.
txt
'
)
#
head
of
the
file
(
only
written
once
for
all
motifs
)
with
open
(
outfile_path
,
'w'
)
as
f
:
[
f
.
write
(
l
+
'\n'
)
for
l
in
lines_head
]
#
specific
to
each
motif
with
open
(
outfile_path
,
'a'
)
as
f
:
[
f
.
write
(
l
+
'\n'
)
for
l
in
lines_before_matrix
]
pwm_df
[[
'A'
,
'C'
,
'G'
,
'T'
]]\
.
round
(
6
)\
.
applymap
(
'
{:
.6
f
}
'
.
format
)\
.
to_csv
(
outfile_path
,
sep
=
'\t'
,
index
=
None
,
header
=
None
,
mode
=
'a'
)
```
##
Testing
the
new
MEME
matrix
with
FIMO
To
make
sure
that
the
matrix
behaves
as
expected
,
we
will
scan
a
mock
fasta
file
(
random
DNA
sequence
+
the
consensus
TF
motif
at
base
7
,
counting
with
1
-
indexing
,
and
at
position
final
base
-
7
-
length
of
the
motif
for
the
reverse
complement
).
The
random
DNA
sequence
was
generated
as
a
fasta
file
[
here
](
https
:
//www.bioinformatics.org/sms2/random_dna.html).
The
expected
match
is
the
following
sequence
(
in
capital
letters
):
CCTTGTTTACTTT
```
python
motif_test_consensus
=
'
CCTTGTTTACTTT
'
motif_test_consensus_rev
=
'
AAAGTAAACAAGG
'
#
deleting
the
fimo
outdir
if
it
already
exists
fimo_outdir
=
os
.
path
.
join
(
project_rootdir
,
'
fimo_out
'
)
if
os
.
path
.
exists
(
fimo_outdir
):
shutil
.
rmtree
(
fimo_outdir
,
ignore_errors
=
True
)
#
is
fimo
installed
?
assert
shutil
.
which
(
'
fimo
'
)
is
not
None
,
'
Error
:
please
install
FIMO
5.1.1
on
this
machine
'
#
Running
a
test
instance
of
FIMO
:
os
.
chdir
(
project_rootdir
)
os
.
system
(
'
fimo
pwms_MEME
/
pwms_cisbp_meme
.
txt
test_fimo
/
test_sequence
.
fa
'
)
os
.
chdir
(
os
.
path
.
join
(
project_rootdir
,
'
notebooks
'
))
```
```
python
fimo_test_out
=
pd
.
read_csv
(
os
.
path
.
join
(
project_rootdir
,
'
fimo_out
/
fimo
.
tsv
'
),
sep
=
'\t'
,
header
=
0
)
fimo_test_out
```
```
python
assert
all
(
fimo_test_out
[
'
matched_sequence
'
].
dropna
()
==
[
'
CCTTGTTTACTTT
'
,
'
CCTTGTTTACTTT
'
]),\
'
Error
:
we
do
not
find
the
expected
number
of
motifs
(
2
motifs
,
one
in
reverse
and
one
in
forward
)
'
```
#
Converting
all
Cis
-
BP
PWMs
to
MEME
format
(
into
a
single
file
)
##
Filtering
on
open
-
source
human
PWMs
with
direct
experimental
evidence
Because
all
PWMs
are
always
included
,
no
matter
the
selection
on
Cis
-
BP
,
we
need
to
filter
out
motifs
:
-
Not
from
human
TFs
-
Not
with
direct
experimental
evidence
-
from
Transfac
```
python
cisbp_metadata
=
pd
.
read_csv
(
os
.
path
.
join
(
project_rootdir
,
'
cisbp_homo_sapiens
/
TF_Information
.
txt
'
),
sep
=
'\t'
,
header
=
0
)
cisbp_metadata
```
```
python
cisbp_metadata
.
columns
```
```
python
cisbp_metadata
=
cisbp_metadata
[(
cisbp_metadata
[
'
TF_Species
'
]==
'
Homo_sapiens
'
)
&
\
(
cisbp_metadata
[
'
TF_Status
'
]==
'D'
)
&
\
(
cisbp_metadata
[
'
Motif_Type
'
]!=
'
Transfac
'
)].
reset_index
()
len
(
cisbp_metadata
)
```
```
python
motifs_selected
=
set
(
cisbp_metadata
[
'
Motif_ID
'
].
unique
())
```
```
python
def
cisbp_to_meme
(
pmws_cisbp_path
,
motifs_selected
,
outfile_path
,
meme_version_number
=
'
5.1.1
'
):
"""
convert cis-BP-formatted position weight matrices text files to a single MEME PWM text file.
pws_cisbp_path: path of directory where all files are cisbp-formatted PWMS (str)
outfile_path: path of target file (str)
meme_version_number: version number (str)
"""
#
header
of
the
MEME
formatted
file
meme_version_line
=
'
MEME
version
'
+
meme_version_number
alphabet_line
=
'
ALPHABET
=
ACGT
'
strands_line
=
'
strands
:
+
-
'
background_letter_frequencies_lines
=
[
'
Background
letter
frequencies
'
,
'
A
0.25
C
0.25
G
0.25
T
0.25
'
]
lines_head
=
[
meme_version_line
,
alphabet_line
,
strands_line
]
lines_head
.
extend
(
background_letter_frequencies_lines
)
with
open
(
outfile_path
,
'w'
)
as
f
:
[
f
.
write
(
l
+
'\n'
)
for
l
in
lines_head
]
cisbp_motif_files
=
[
f
for
f
in
os
.
listdir
(
pmws_cisbp_path
)
if
(
os
.
path
.
isfile
(
os
.
path
.
join
(
pwms_cisbp_path
,
f
))
and
Path
(
f
).
resolve
().
stem
in
motifs_selected
)]
cisbp_pwms_stats
=
pd
.
DataFrame
({
'
motif_filename
'
:
cisbp_motif_files
})
cisbp_pwms_max_prob_errors
=
np
.
zeros
(
len
(
cisbp_motif_files
))
for
idx
,
cisbp_motif_filename
in
enumerate
(
cisbp_motif_files
):
print
(
idx
,
cisbp_motif_filename
)
pwm_path
=
os
.
path
.
join
(
pwms_cisbp_path
,
cisbp_motif_filename
)
pwm_df
=
pd
.
read_csv
(
pwm_path
,
header
=
0
,
sep
=
'\t'
)
motif_name
=
Path
(
cisbp_motif_filename
).
resolve
().
stem
#
sanity
checking
that
columns
are
always
in
the
same
order
assert
all
(
pwm_df
.
columns
==
pd
.
Index
([
'
Pos
'
,
'A'
,
'C'
,
'G'
,
'T'
])),\
'
Motif
{}:
the
letters
of
the
DNA
alphabet
are
not
in
the
correct
order
[
A
,
C
,
G
,
T
]
'
.
format
(
motif_name
)
#
sanity
checking
that
rowsums
of
the
PWM
sum
up
to
1
(
within
tolerance
)
prob_error
=
pwm_df
.
drop
(
columns
=
'
Pos
'
).
apply
(
axis
=
1
,
func
=
lambda
x
:
abs
(
sum
(
x
)-
1
))
cisbp_pwms_max_prob_errors
[
idx
]
=
max
(
prob_error
)
if
not
all
(
np
.
isclose
(
prob_error
,
0
,
1
e
-
3
,
1
e
-
3
)):
print
(
'
Motif
{}
warning
:
some
rows
do
not
sum
up
to
1
within
1
e
-
3
absolute
and
relative
precision
'
.
format
(
motif_name
))
motif_line
=
'
MOTIF
'
+
motif_name
letter_probability_line
=
'
letter
-
probability
matrix
:
'
lines_before_matrix
=
[
motif_line
,
letter_probability_line
]
with
open
(
outfile_path
,
'a'
)
as
f
:
[
f
.
write
(
l
+
'\n'
)
for
l
in
lines_before_matrix
]
pwm_df
[[
'A'
,
'C'
,
'G'
,
'T'
]]\
.
round
(
6
)\
.
applymap
(
'
{:
.6
f
}
'
.
format
)\
.
to_csv
(
outfile_path
,
sep
=
'\t'
,
index
=
None
,
header
=
None
,
mode
=
'a'
)
cisbp_pwms_stats
[
'
max_prob_error
'
]
=
cisbp_pwms_max_prob_errors
return
cisbp_pwms_stats
```
```
python
cisbp_pwms_stats
=
cisbp_to_meme
(
pmws_cisbp_path
=
pwms_cisbp_path
,
motifs_selected
=
motifs_selected
,
outfile_path
=
os
.
path
.
join
(
project_rootdir
,
'
pwms_MEME
/
pwms_cisbp_meme
.
txt
'
))
```
##
Problems
met
when
scaling
up
to
converting
all
PWMs
from
Cis
-
BP
:
The
transfac
files
are
present
in
the
`pwms_all_motifs` directory, but they are empty, therefore they make the conversion crash. Example: `M10930_2.00.txt`
.
This
is
now
taken
care
of
with
an
additional
filtering
step
.
There
are
some
PWMs
where
the
rowsums
do
not
sum
up
to
1
even
within
1
e
-
2
precision
...
Therefore
,
we
keep
track
of
the
maximum
error
(
that
is
the
max
value
between
rowsums
subtracted
from
1
,
the
theoretical
value
of
each
rowsum
).
How
bad
is
the
problem
?
```
python
cisbp_pwms_stats
.
sort_values
(
by
=
'
max_prob_error
'
,
ascending
=
False
).
head
(
10
)
```
The
largest
error
is
0.2
,
for
motif
`M05847_2.00.txt`
for
ZKCAN1
,
that
is
clearly
an
error
due
to
some
glitch
in
the
computing
of
the
motif
PWM
by
the
curators
of
Cis
-
BP
.
The
corresponding
site
in
the
logo
file
is
empty
as
well
.
We
redo
the
analysis
by
excluding
that
motif
.
```
python
motifs_selected
.
remove
(
'
M05847_2
.00
'
)
cisbp_pwms_stats
=
cisbp_to_meme
(
pmws_cisbp_path
=
pwms_cisbp_path
,
motifs_selected
=
motifs_selected
,
outfile_path
=
os
.
path
.
join
(
project_rootdir
,
'
pwms_MEME
/
pwms_cisbp_meme
.
txt
'
))
```
##
Testing
the
integrity
of
the
new
PWM
file
```
python
#
Running
a
test
instance
of
FIMO
:
os
.
chdir
(
project_rootdir
)
os
.
system
(
'
fimo
pwms_MEME
/
pwms_cisbp_meme
.
txt
test_fimo
/
test_sequence
.
fa
'
)
os
.
chdir
(
os
.
path
.
join
(
project_rootdir
,
'
notebooks
'
))
```
```
python
fimo_test_out
=
pd
.
read_csv
(
os
.
path
.
join
(
project_rootdir
,
'
fimo_out
/
fimo
.
tsv
'
),
sep
=
'\t'
,
header
=
0
)
fimo_test_out
```
The
integrity
of
the
matrix
is
fine
.
However
,
we
find
>
1000
hits
within
1000
bp
of
DNA
,
for
about
4000
motifs
.
That
nicely
illustrates
the
curse
of
the
number
of
false
positives
we
find
with
this
method
.
The
linear
regression
step
is
really
crucial
.
```
python
fimo_test_out
[
'
motif_id
'
].
nunique
()
```
Event Timeline
Log In to Comment