CHILI¶

Research Data Management Bootcamp¶

Wednesday December 7th, 2016¶

Karine Delvert, Aude Dieudé, Jan Krause, Nathalie Lambeng¶

datamanagementplan@epfl.ch

Part 1¶

1.1 - Introduction to RDM¶

Definition, context and best practices¶

Introduction: video,

Definition : Research data¶

The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations.

For the Organization for Economic Cooperation and Development OCDE, research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results.

One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data.

Research Data Lifecycle¶

Source: Formation URFIST, Rennes, 2016

1.2 - Actors and Skills¶

Actors¶

Actors

Skills¶

Requirements regarding research data management¶

Publishers¶

Many publishers and scientific journals require, under specific conditions, the publication of used data to achieve the research project results (permanent archiving, standardized formats, etc). This is the case, for instance, with PLoS and Nature Publishing Group. An overview of the editorial policies are available online on this Dryad website

Funders¶

Examples of funders which require DMPs or equivalent:

Funding agency and DMP : Horizon 2020¶

Horizon 2020: is the biggest funding agency from the European Commission with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its main objective is to promote and support excellence in the scientific field.

Horizon 2020 requires for some research projects the preparation of a data management plan, which is mandatory in order to receive research funding.

As of 2017, the Commission will make open research data the default option, while ensuring opt-outs, for all new projects of the Horizon 2020 program.

Funding agency and DMP : SNSF¶

Research data management policy to be established in 2017
Submission of data management plans with the grant application

1.3 - Data Management Plan¶

Definition : Data Management Plan¶

Data Management Plan (DMP) refers to the strategies put into place to create, store, share, maintain, archive and preserve research data throughout their life cycle.

The DMP describes which data are going to be produced and how each type of data will be organized, classified, archived, shared, distributed, secured and preserved in a secure way.

Here is a video, which illustrates how the DMP works concretely:

Why publish in a data archive?¶

Accelerate science and careers

Many sutdies show there are significant advantages for articles that share their code or data.

Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149

Avoid bias in science¶

FDA Turner

Machine learning needs¶

Barend Mons : Field

Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.

Source: Barend Mons, IDCC, Amsterdamm 2016.

1.4 - DMP best practices¶

Best practices examples: DMPonline (UK)¶

http://dmponline.dcc.ac.uk

Best practices examples: EPFL (Switzerland)¶

To provide guidance in preparing a DMP, the EPFL-ETHZ checklist includes four categories to cover questions related to:

Research Data Acquisition : type, quantity, license, etc.
Research Data Format : format, metadata, identification, etc.
Research Data Sharing : embargo, intellectual property, etc.
Data Preservation : storage, sensitivity of the data, archiving, etc.

Part 2 - CHILI Specific Topics¶

Ethics, legal aspects, anonymization
Reproducibility
Collaborative coding and writing
Computational workflows
(Meta)data formats
Publication and long term preservation
Data visualization

2.1 - Ethics¶

-

2.1.1. Do you work with personal, sensitive data ?¶

-

Does your research practice involve collecting, processing and storing information on persons?
- ... identifiable persons ?
- ... vulnerable persons ?
- ... children ?
-

+

2.1.1. When human beings are involved...¶

Human Beings

+

Ethics issues arise in many areas of research.

-

How do you inform persons/subjects on what you will be doing ?

+

Research involving the voluntary participation of research subjects and the collection of data that might be considered as personal.

-

What data do you typically use (collect, process, store) in the course of a research project ?

+

You must protect your volunteers, yourself and your researcher colleagues.

Among these data which ones are personal ? Sensitive ?
Does your research practice involve collecting, processing and storing information on persons?
- ... identifiable persons ?
- ... vulnerable persons ?
- ... children ?
+

Do you need to identify the subject/person ?
How do you inform persons/subjects on what you will be doing ?

-

If disclosed, do the data you collect lead to the dissemination of personal information ?

+

Human Research Ethics Committee at EPFL (HREC)¶

The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project."

-

Do subjects/persons sometimes ask you for their performance ? The data you collected about them ?

Simple, understandable, in a language adapted to their age information
See form on Research Office Ethics Assessment.

+

Sources:

How and to whom the data will be disseminated ?
H2020 Programme Guidance : How to complete your ethics self assessment, 12th July 2016. Page 1.
http://research-office.epfl.ch/op/edit/page-117394.html

-

If you answered yes to any of the above question, ethical and legal issues apply.

-

You should check the Research Office Checklist: Research Office Ethics Assessment.

+

2.1.2. Data ? What data ? Personal data ? Sensitive data ?¶

+

personal data (data)

+

all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)
examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these.

-

2.1.2. When human beings are involved...¶

Human Beings

-

For instance in such cases : "...collection of personal data, interviews, observations, questionnaires, recordings, tracking or the secondary use of information provided for other purposes, e.g. social media sites, other research projects etc.

-

In such cases the Human Research Ethics Committee at EPFL (HREC) should be consulted.

-

The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project."

-

http://research-office.epfl.ch/op/edit/page-117394.html

+

sensitive personal data

+

According to the Swiss FADP (article 3 c.) data on:

+

religious, ideological, political or trade union-related views or activities,
health, the intimate sphere or the racial origin,
social security measures,
administrative or criminal proceedings and sanctions;

-

2.1.3. Data ? What data ? Personal data ?¶

-

personal data (data)

all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)
examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these.
What data do you typically use (collect, process, store) in the course of a research project ?

-

sensitive personal data

-

According to the Swiss FADP (article 3 c.) data on:

-

religious, ideological, political or trade union-related views or activities,
health, the intimate sphere or the racial origin,
social security measures,
administrative or criminal proceedings and sanctions;

-

2.3.1. Doing what with data ?¶

-

Simple, understandable, in a language adapted to their age information
See form on Research Office Ethics Assessment.

Among these data which ones are personal ?

-

Processing¶

Swiss FADP, article 3 e.: -any operation with personal data, irrespective of the means applied and the procedure, and in particular the collection, storage, use, revision, disclosure, archiving or destruction of data;

-

Notably:

carried out in good faith
only for the purpose indicated at the time of collection (...)
consent must be given expressly in the case of processing of sensitive personal data or personality profiles.
Among these data which ones are sensitive ?

-

Correcting¶

Anyone who processes personal data must make certain that it is correct. He must take all reasonable measures to ensure that data that is incorrect or incomplete in view of the purpose of its collection is either corrected or destroyed.
Any data subject may request that incorrect data be corrected.

+

If you work with personal or sensitive data,

+

you should check the Research Office Checklist: Research Office Ethics Assessment, especially the checklists (login with Gaspar).

-

Right to information¶

Any person may request information from the controller of a data file as to whether data concerning them is being processed.

-

of all available data concerning the subject (...),
including the available information on the source of the data (...) as well as the categories of the personal data processed, the other parties involved with the file and the data recipient.
(...) The information must normally be provided in writing, in the form of a printout or a photocopy, and is free of charge.

-

Swiss FADP Article 8.

+

2.1.3 Doing what with data ?¶

-

Protecting¶

(Swiss FADP, article 7) Personal data must be protected against unauthorised processing through adequate technical and organisational measures.

-

Technical measures : notably it is forbidden to store personal data in countries that are not compatible with Swiss law, such as the US.

-

This excludes the usage of many clouds: Dropbox, Google Drive, Microsoft Azure, Amazon S3...

+

Personal or sensitive data processing¶

Swiss Federal Act on Data Protection (FADP) (or Loi sur la Protection des Données LPD), article 3 e.: +any operation with personal data, irrespective of the means applied and the procedure, and in particular:

+

the collection,
storage,
use,
revision,
disclosure,
archiving
or destruction

+

of data;

-

2.1.4. Disclosing personal data¶

Personal Data collection and processing implies compliance with the law on privacy and data protection:

+

Processing data should notably:

+

be carried out in good faith
only for the purpose indicated at the time of collection [...]
consent must be given expressly in the case of processing of sensitive personal data or personality profiles.
be accurate (and corrected or destroyed if required) (FAPLD article 5)
Any person may request information from the controller of a data file as to whether data concerning them is being processed (FAPD article 8).

-

Disclosure

-

(Swiss FADP article 3 f.): making personal data accessible, for example:

by permitting access,
transmission
or publication.
Do you need to identify the subject/person ?
+
If disclosed, do the data you collect lead to the dissemination of personal information ?
+
How and to whom the data will be disseminated ?
+

-

Cross-border disclosure

-

Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.

-

Art. 61Cross-border disclosure -Personal data must be protected against unauthorised processing through adequate technical and organisational measures.

+

2.1.4. Protecting and disclosing personal data¶

Protection¶

Personal data must be protected against unauthorised processing through adequate technical and organisational measures (Swiss FADP, article 7).

-

Anonymisation¶

Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if: -a. the data is rendered anonymous, as soon as the purpose of the processing permits; -b. the recipient only discloses the data with the consent of the federal body and -c. the results are published in such a manner that the data subjects may not be identified.

+

Disclosure

+

Making personal data accessible, for example:

+

by permitting access,
transmission
or publication.

+

(Swiss FADP article 3 f.)

-

Recapitulation

+

Cross-border disclosure

+

Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.

+

Cross-border disclosure of personal data must be protected against unauthorised processing through adequate technical and organisational measures.

+

(FDAP Art. 6)

+ +

+

Anonymisation¶

Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if:

Autorisation and information to provide and DMP
Collect consent
Inform participants : sample information sheet
Autorisations
the data is rendered anonymous, as soon as the purpose of the processing permits;
the recipient only discloses the data with the consent of the federal body and
the results are published in such a manner that the data subjects may not be identified.

References

Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).
Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).
- Directive 95/46/EC
- As of 2018: REGULATION (EU) 2016/679 repealing Directive 95/46/EC
- H2020 Program Guidance : how to compleate your ethics self assessment, 12.7.2016
Information
- http://research-office.epfl.ch/ethique-recherche/research-ethics-assessment/ethical-review/personal-data

-

Anonymisation

-

2.2 - Anonymization methods¶

Privacy protection methods, either :

removing,
generalizing or
encrypting,

personal information from datasets.

In passing, there is more to this (Privacy-Preserving Data Mining Methods / Charu Affarwal and Philip Yu. 2008.):

k-anonymity¶

Definition¶

"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release" (Source).

Illustration¶

Example including removal and generalization (same source):

Name	Age	Gender	State of domicile	Religion	Disease
Ramsha	29	Female	Tamil Nadu	Hindu	Cancer
Yadu	24	Female	Kerala	Hindu	Viral infection
Salima	28	Female	Tamil Nadu	Muslim	TB
sunny	27	Male	Karnataka	Parsi	No illness
Joan	24	Female	Kerala	Christian	Heart-related
Bahuksana	23	Male	Karnataka	Buddhist	TB
Rambha	19	Male	Kerala	Hindu	Cancer
Kishor	29	Male	Karnataka	Hindu	Heart-related
Johnson	17	Male	Kerala	Christian	Heart-related
John	19	Male	Kerala	Christian	Viral infection

To (name and religion were removed, age was generalized):

Name	Age	Gender	State of domicile	Religion	Disease
*	20 < Age ≤ 30	Female	Tamil Nadu	*	Cancer
*	20 < Age ≤ 30	Female	Kerala	*	Viral infection
*	20 < Age ≤ 30	Female	Tamil Nadu	*	TB
*	20 < Age ≤ 30	Male	Karnataka	*	No illness
*	20 < Age ≤ 30	Female	Kerala	*	Heart-related
*	20 < Age ≤ 30	Male	Karnataka	*	TB
*	Age ≤ 20	Male	Kerala	*	Cancer
*	20 < Age ≤ 30	Male	Karnataka	*	Heart-related
*	Age ≤ 20	Male	Kerala	*	Heart-related
*	Age ≤ 20	Male	Kerala	*	Viral infection

This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.

l-diversity - motivation¶

An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:

homogeneity attacks: in the case that a group of lines are homogeneous ,
background knowledge attacks: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) (source).

Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :

Name	Age	Gender	State of domicile	Religion	Disease
*	20 < Age ≤ 30	Female	Tamil Nadu	*	AIDS
*	20 < Age ≤ 30	Female	Tamil Nadu	*	AIDS
*	20 < Age ≤ 30	Female	Tamil Nadu	*	AIDS

If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity.

l-diversity - definition¶

The l-diversity Principle : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.

There are several definition of "well-represented" (source).

By the way, l-diversity has weaknesses to, that is why people invented t-closeness.

t-closeness - motivation¶

L-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be the semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases (adapted form source).

t-closeness - definition¶

The t-closeness Principle: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness (source).

differential privacy¶

By linking with another database: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts.

Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1 (source).

Anonymization - theory and tools¶

Statistical Disclosure Control / Hundepool, & al. 2012.

Ebook provided by the EPFL library.

Tools

sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation (R package)
ARX Data Anonymization Tool (Java: library &GUI)
μ-ARGUS (Java, GUI)

-

2.3 - Reproducibility¶

According to a Nature study in 2012, 47 out of 53 medical research papers are irreproducible (1).

+

2.3 - Reproducibility¶

+

According to a Nature study in 2012, 47 out of 53 medical research papers are irreproducible (1).

A previous study showed in 2009 that 16 out of 18 bioinformatics papers could not be reproduced entirely (2).

In 2004, it was found that less than 9% of papers share their code (3).

(1) Begley, C. G.; Ellis, L. M. (2012). "Drug development: Raise standards for preclinical cancer research". Nature 483 (7391): 531–533.
(2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.
(3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. "Reproducible research in signal processing." Signal Processing Magazine, IEEE 26.3 (2009): 37-47

[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]

A workflow for reproducible research¶

Researchers often start to think about reproduciblity at the end of projects. It is sometimes too late: by then numerous versions of code and datasets may be spread in various places (folders, dropbox, usb drives...).

A practical 5 points approach:

document everything
everything is a (text) file
files should be human readable
explicitly tie your files together
have a plan to organize, store and make your files available

Slide inspired by chapter 2 of Reproducible Research with R and RStudio.

More details:

document everything
- reproduction requires documentation of what you did

everything is a text file
- notably: data, code and results
- the simplest formats are the best: CSV / JSON, Markdown / $\LaTeX$, because they are future proofed

files should be human readable
- treat all files as if someone who does not know the project will have to use them
- otherwise they (or you 6 months later) will probably not undestand them
- important elements to document:
  - description of what the file is or does (in general, local comments)
  - contributors
  - date of last update

explicitly tie your files together, including generated documents
- locally or using persistent identifiers
- formalize the way data is processed
- generally difficult to trace back (e.g.: how was a specific figure generated?)

have a plan to organize, store and make your files available

2.4 - Collaborative tools¶

Personal/group level: OwnCloud, free software: Mac, Windows, Linux, iOS, Android... Web.
- Your own server: OwnCloud https://owncloud.org/
- Many plugins: contacts, calendar, collaborative writing, image galleries, etc.

Swiss level: SwitchDrive https://drive.switch.ch/
- Owncloud with 25 Go by user,
- Restricted to Swiss universities members.

A recent fork of ownCloud: NextCloud aims more transparent development processes.

2.4.2 - Collaborative writing¶

People often need to collaborate at a finer level. More and more.

Source: Pr. Vanderghenyst, EFPL Library Noon Talk, 25.8.2016.

In summary

Text processing comments / revision mode functionalities are not sufficient for good collaboration.

Google Scholar and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.

$\Rightarrow$ we need something else!

Share LaTeX is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users.

Good, but only if all partners are LaTeX users.

Authorea¶

Authorea: collaborative writing, easy to use.

Authorea

Free account to test (limited to 1 private document, no limits on public documents). EPFL licence provided by the Library.

Simple syntax : WYSIWYG and Markdown (lightweight text formatting language). More complex formating possible using LaTeX

Enables others to make comments

Supports interactive documents / figures (Jupyter)

Offline synchronization on personal computer (using the Git version control system)

-

2.4.3 - Collaborative coding¶

Git¶

+

2.4.3 - Collaborative versioning and branching¶

Git¶

Git is a multi-platform (Windows, Mac, GNU/Linux) version control tool.

Git Servers

GitHUB, very popular, some date hosted in the US. Closed repositories limited (payment or subject to other conditions).
c4science is the Swiss collaborative development platform. Unlimited number of repositories (opened / closed).

Git workflows¶

Git will however not do everything for you.

You need to think up a naming convention (folder structure, file names) e.g.
- PROJECT-Experiment-Researcher(ORCID)-YYYYMMDD.extension
- PROJECT-Experiment-Researcher(ORCID)-Software-Format-YYYYMMDD.extension
- PROJECT-Experiment-Researcher(ORCID)-Software-Version-Format-YYYYMMDD.extension
Set up an appropriated workflow.

Locally

Source: J.-L. Falcone.

The easiest way is to use a centralized repository.

Centralized

Source: J.-L. Falcone.

For more complex projects, a project leader can manage the quality.

Centralized

Source: J.-L. Falcone.

For big projects, it is possible to dispatch responsabilities.

Centralized

Source: J.-L. Falcone.

Non linear development is supported: branches

Centralized

Source: J.-L. Falcone.

Git and GitHub are not suited for long term preservation¶

Some git commands can delete data (namely: rebase and reset --hard)
Repositories can be deleted (including on GitHUB)
A link GitHub $\Rightarrow$ Zenodo can be set, so each release will be automatically made citable through a DOI and preserved in Zenodo.

Guide : Making your code citable

2.4.4 - Jupyter, Jupyterhub, Sagemath¶

Jupyter¶

Interactive Jupyter Notebooks documents.: try.jupyter.org

Structure:

Rich-hyper-text cells (including tables, $\LaTeX$, images, videos)
Live code cells (with interactive widgets)

Characteristics

Over 50 languages supported : Python, R, Octave, BASH, Matlab, Scala, Java, Haskell...

Can be visualized on line using nbviewer. (e.g.: http://norvig.com/ipython/Economics.ipynb ).
- Nbviewer is integrated in GitHub and Zenodo

Jupyter Notebooks are JSON files $\rightarrow$ can be tracked with Git.

Nbconvert allows conversion to many formats, including python:
- jupyter nbconvert notebook.ipynb --to python
- jupyter nbconvert notebook.ipynb --to latex
- jupyter nbconvert notebook.ipynb --to markdown
- jupyter nbconvert notebook.ipynb --to markdown
- jupyter nbconvert notebook.ipynb --to slides
- jupyter nbconvert notebook.ipynb --to html

Executing from command line:
- jupyter nbconvert --to notebook --execute mynotebook.ipynb

Powerful python libraries¶

Pandas is a powerful library providing high-performance, easy-to-use data structures and data analysis tools. Examples.
Numpy is the fundamental package for scientific computing with Python:
- N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities
Matplotlib is a plotting library with great flexibility. It has features comparable to Matlab plotting. Examples.
Seaborn relies on Pandas (see below). Examples.
NetworkX is suited for complex networks analysis and representation. Examples.
r2py is an interface to R running embedded in a Python process.

And web libraries¶

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation.
D3.js is an open source JavaScript library for creating interactive documents based on data**. D3 helps bringing data to life using HTML, SVG, and CSS. As mentioned above it can be used in Jupyter using matplotlib via mpld3.

Jupyterhub¶

Jupyter Multi Users server (system users, or via GitHub)
Collaborate in a local folder, mounting a VSPI Share, or with Git.

2.4.5 - R, RStudio and RStudio server¶

R¶

R is a free software environment for statistical computing and graphics. One of the best.

Platforms:

wide variety of GNU/Linux and UNIX platforms,
Windows
MacOS

Strength: The diversity of quality open extensions (easily installable with CRAN).

RStudio¶

RStudio is a free and open-source integrated development environment (IDE) for R.

R and reproducible research¶

Reproducible research and documents¶

knitr and rmarkdown
tying together results and their presentation in articles (pdf, word), presentations or web sites
notably in $\LaTeX$ (.Rtex) or Markdown (.Rmarkdown)
well integrated in RStudio

Rmarkdown¶

Include R code chunks in markdown:

# Prime numbers
 
 Storing a few prime numbers in a variable:
 
 ```{r}
 primes <- c(2,3,5,7,11,13)
 ```
 Done.

First you need to setup document properties in YAML:

---
 title: "Rmarkdown example"
 author: "Jan Krause"
 date: "24 novembre 2016"
 output: pdf_document
 ---
 
 # Prime numbers
 
 Storing a few prime numbers in a variable:
 
 ```{r}
 primes <- c(2,3,5,7,11,13)
 ```
 Done.

RStudio Server¶

RStudio in your browser.

2.5 - Computational workflow management¶

Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.

2.5.1 - AiiDA¶

AiiDA a free software has been developed at EPFL (in material sciences): http://www.aiida.net/

2.5.2 - SnakeMake : a simple tool¶

simple : nodes are connected through files (inspired by GNU Make)
complete :
- supports remote files (http(s), sftp, dropbox, googledrive)
- handles data provenance and rule versions,
- parallelization,
- suspend/resume,
- logging,
- creates schema
flexible :the SnakeFile is an extension of Python
http://snakemake.bitbucket.org/

Simple Rule:

rule sort:
     input:
         f = "path/to/dataset.txt"
     output:
         f = "dataset.sorted.txt"
     shell:
         "sort {input.f} > {output.f}"

Simple Rule (two inputs):

rule sort:
     input:
         f1 = "dataset1.txt"
         f2 = "dataset2.txt"
     output:
         f = "dataset.sorted.txt"
     shell:
         "cat {input.f1} {input.f2}  > {output.f}"

Simple Rule (here in Python, but R scripts are supported too):

rule sort:
     input:
         a="path/to/dataset.txt"
     output:
         b="dataset.sorted.txt"
     run:
         with open(output.b, "w") as out:
             for l in sorted(open(input.a)):
                 print(l, file=out)

More than one rule:

rule result:
     input:
         'result.txt'
 
 rule genrate_cal_2017:
     input:
         ()
     output:
         fname = "tmp/cal.txt"
     shell:
         "cal 2017 > {output.fname}"
 
 rule describe:
     input:
         fname1 = "DESCRIPTION.txt",
         fname2 = "tmp/cal.txt"
     output:
         fname = "result.txt"
     shell:
         "cat {input.fname1} {input.fname2} > {output.fname}"

Expand (running rules in parallel):

DATASETS = ["D1", "D2", "D3", "D4", "D5", "D6"]
 
 rule all:
     input:
         expand("{dataset}.sorted.txt", dataset=DATASETS)
 
 rule sort:
     input:
         "{dataset}.txt"
     output:
         "{dataset}.sorted.txt"
     shell:
         "sort {input} > {output}"

Output : example Graph

Output : Log (simplified)

output_file	date	rule	version
result.txt	Fri Nov 11 15:48:17 2016	cleanup	3.14
tmp/pre-result.txt	Fri Nov 11 15:48:17 2016	add_head_foot	1.02
tmp/FOOT.txt	Fri Nov 11 15:48:17 2016	generate_foot	5.6
tmp/HEAD.txt	Fri Nov 11 15:48:17 2016	generate_head	5.6
tmp/described_cal.txt	Fri Nov 11 15:48:17 2016	describe	0.1alpha
tmp/cal.txt	Fri Nov 11 15:48:17 2016	genrate_cal	8.234

More about workflows

Another tool: Taverna which includes the desktop oriented Taverna Workbench, command-line and server applications.

Finally, myExperiment is a platform for sharing scientific workfows, and notably fully supported by Taverna.

EPFL platforms

EPFL SV sLIMS
- http://sv-it.epfl.ch/slims
- Gaël Anex, Nicolas Argento, Peter Hliva
- Laboratory information management system

EPFL SCITAS (Victoria Rezzonico)
- High Performance Computing and data Storage

-

2.6 - Data and Storage¶

2.6.1 - (Meta)data formats¶

+

2.6 - Data and Storage¶

+

2.6.1 - (Meta)data formats¶

Metadata¶

Most common generalist metadata formats: Dublin Core (DCES), Dublin Core (DCMI), DataCite Metadata Schema.

Numerous specilized metadata formats are available for most disciplines, the Research Data Alliance Metadata Directory is a good starting point.

Data format¶

Prefer a

standard format,
open and
widely used

This way your data will not depend upon a particular software (or company), operating system, or platform. And you will be able to:

collaborate with more people (on various platforms)
avoid licensing problems
maximize the reusability in the future

Some open formats to take into account¶

Portable Document Format PDF/A, ISO standard, text [PDF for archiving, no ciphers, included fonts...]

Text simple way to encode data. Can be read by most software.
- CSV tables, can be read by most software, and extended using CSV on the Web (metadata, datatypes, relation...)
- JSON: Simply structured, less bulky than XML, ideal for data exchange.

Geodata
- ISO 19115-1:2014 : the norm.
- GeoJson.org : lighter.

HDF5, more flexible (not text, but structured and indexed, supports arbitrary metadata, good performances).
- Compatible with many tools (Python, R, Matlab, Mathematica...)

Databases:
- SQL: Postgresql is relational, open and efficient
- BigData: MongoDB for volume, velocity, and variety

Data formats list¶

Sustainability of digital formats by the US Library of Congress. This list is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)

2.6.2 - Storage, publication and preservation¶

Data access sustainability¶

A Plos One study showed in 2014 that more than 60% of links to datasets are broken after 10 years (1).

Another Plos One 2014 article showed that the bibliography of 1 out of every 5 is impacted by that phenomenon (2).

(1) Pepe et al. (2014). How Do Astronomers Share Data? Reliability and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers. PLoS ONE, 9(8). doi:10.1371/journal.pone.0104798
(2) Klein et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. doi:10.1371/journal.pone.0115253

Digital preservation¶

At CERN, a 2007 studies (1,2) showed that the error ratio was of $10^{-7}$ (over 2 months).

Causes are complex and varied: disk errors, RAID errors, memory errors, etc.

For 1 Gigabyte (1000 Mégabytes), we have: $10^9 \cdot 10^{-7} = 10^2 = 100$ bytes of bitrot.

(1) https://indico.cern.ch/event/13797/session/0/contribution/3/attachments/115080/163419/Data_integrity_v3.pdf
(2) http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/

Preservation vs. Backup

EPFL storage options¶

EPFL offers many storage options, as described on the VPSI page Databases, Storage and Virtualization.

EPFL Storage Prices 2016

Prices Table

Why publish in a data archive?¶

Accelerate science and careers

Many studies show there are significant advantages for articles that share their code or data.

Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149

Avoid bias in science¶

FDA Turner

Machine learning needs¶

Barend Mons : Field

Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.

Source: Barend Mons, IDCC, Amsterdamm 2016.

Data repositories¶

Zenodo (hosted by CERN, free) http://zenodo.org
- either EPFL or CHILI community

Other data repositories¶

Dryad («curated», non-profit organisation, partnership with publishers) http://datadryad.org/
Figshare (commercial, belongs to Macmillian [as does NPG]) http://figshare.com/
For more information see re3data in which more than 1'500 data repositoris are described.

Data Citation¶

Always use persistent identifiers to avoid broken links (about 60% after 10 years)
The most common persistent identifier is the DOI (digital object identifier)
- e.g.: http://doi.org/10.5281/zenodo.7525
Zenodo, Figshare, Dryad and Infoscience can provide DOIs.

2.7 - Licences¶

A licence allows to define the way your data can be reused. For instance:

Creative Commons (CC0 and CC-BY) http://creativecommons.org/ Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) https://wiki.creativecommons.org/wiki/Data

You can contact us in the future here:

datamanagementplan@epfl.ch

We look forward to hearing from you!

Aude, Jan, Karine and Nathalie

CHILI¶

Research Data Management Bootcamp¶

Wednesday December 7th, 2016¶

Karine Delvert, Aude Dieudé, Jan Krause, Nathalie Lambeng¶

Part 1¶

1.1 - Introduction to RDM¶

Definition, context and best practices¶

Definition : Research data¶

Research Data Lifecycle¶

1.2 - Actors and Skills¶

Actors¶

Skills¶

Requirements regarding research data management¶

Publishers¶

Funders¶

Funding agency and DMP : Horizon 2020¶

Funding agency and DMP : SNSF¶

1.3 - Data Management Plan¶

Definition : Data Management Plan¶

Why publish in a data archive?¶

Avoid bias in science¶

Machine learning needs¶

1.4 - DMP best practices¶

Best practices examples: DMPonline (UK)¶

Best practices examples: EPFL (Switzerland)¶

Part 2 - CHILI Specific Topics¶

2.1 - Ethics¶

2.1.1. Do you work with personal, sensitive data ?¶

2.1.1. When human beings are involved...¶

Human Research Ethics Committee at EPFL (HREC)¶

Collecting consent¶

2.1.2. Data ? What data ? Personal data ? Sensitive data ?¶

2.1.2. When human beings are involved...¶

2.1.3. Data ? What data ? Personal data ?¶

2.3.1. Doing what with data ?¶

Collecting consent¶

Processing¶

Correcting¶

Right to information¶

2.1.3 Doing what with data ?¶

Protecting¶

Personal or sensitive data processing¶

2.1.4. Disclosing personal data¶

2.1.4. Protecting and disclosing personal data¶

Protection¶

Anonymisation¶

Anonymisation¶

2.2 - Anonymization methods¶

k-anonymity¶

Definition¶

Illustration¶

l-diversity - motivation¶

l-diversity - definition¶

t-closeness - motivation¶

t-closeness - definition¶

differential privacy¶

Anonymization - theory and tools¶

2.3 - Reproducibility¶

2.3 - Reproducibility¶

A workflow for reproducible research¶

2.4 - Collaborative tools¶

2.4.1 - File sharing¶

2.4.2 - Collaborative writing¶

File sharing is not enough¶

Share LaTeX¶

Authorea¶

2.4.3 - Collaborative coding¶

Git¶

2.4.3 - Collaborative versioning and branching¶

Git¶

Git workflows¶

Git and GitHub are not suited for long term preservation¶

2.4.4 - Jupyter, Jupyterhub, Sagemath¶

Jupyter¶

Powerful python libraries¶

And web libraries¶

Jupyterhub¶

2.4.5 - R, RStudio and RStudio server¶

R¶

RStudio¶