diff --git a/OptimizeResearchDataManagement.html b/OptimizeResearchDataManagement.html index 7cef179..6acc6f6 100644 --- a/OptimizeResearchDataManagement.html +++ b/OptimizeResearchDataManagement.html @@ -1,14296 +1,14302 @@ OptimizeResearchDataManagement

CHILI

Research Data Management Bootcamp

Wednesday December 7th, 2016

Karine Delvert, Aude Dieudé, Jan Krause, Nathalie Lambeng


datamanagementplan@epfl.ch

.

Part 1

1.1 - Introduction to RDM

Definition, context and best practices

Definition : Research data

  • The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations.
  • For the Organization for Economic Cooperation and Development OCDE, research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results.
  • One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data.

Research Data Lifecycle

1.2 - Actors and Skills

Actors

Actors

Skills

Requirements regarding research data management

Publishers

Many publishers and scientific journals require, under specific conditions, the publication of used data to achieve the research project results (permanent archiving, standardized formats, etc). This is the case, for instance, with PLoS and Nature Publishing Group. An overview of the editorial policies are available online on this Dryad website

Funders

Examples of funders which require DMPs or equivalent:

Funding agency and DMP : Horizon 2020

.

  • Horizon 2020: is the biggest funding agency from the European Commission with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its main objective is to promote and support excellence in the scientific field.
  • Horizon 2020 requires for some research projects the preparation of a data management plan, which is mandatory in order to receive research funding.
  • As of 2017, the Commission will make open research data the default option, while ensuring opt-outs, for all new projects of the Horizon 2020 program.

Funding agency and DMP : SNSF

.

  • Research data management policy to be established in 2017

  • Submission of data management plans with the grant application

1.3 - Data Management Plan

Definition : Data Management Plan

  • Data Management Plan (DMP) refers to the strategies put into place to create, store, share, maintain, archive and preserve research data throughout their life cycle.
  • The DMP describes which data are going to be produced and how each type of data will be organized, classified, archived, shared, distributed, secured and preserved in a secure way.
  • Here is a video, which illustrates how the DMP works concretely:

Why publish in a data archive?

Accelerate science and careers

Many sutdies show there are significant advantages for articles that share their code or data.

Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149

Avoid bias in science

FDA Turner

Machine learning needs

Barend Mons : Field

Barend Mons : Field

Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.

Source: Barend Mons, IDCC, Amsterdamm 2016.

1.4 - DMP best practices

Best practices examples: DMPonline (UK)

http://dmponline.dcc.ac.uk

Best practices examples: EPFL (Switzerland)

To provide guidance in preparing a DMP, the EPFL-ETHZ checklist includes four categories to cover questions related to:

  • Research Data Acquisition : type, quantity, license, etc.
  • Research Data Format : format, metadata, identification, etc.
  • Research Data Sharing : embargo, intellectual property, etc.
  • Data Preservation : storage, sensitivity of the data, archiving, etc.

Part 2 - CHILI Specific Topics

  • Ethics, legal aspects, anonymization
  • Reproducibility
  • Collaborative coding and writing
  • Computational workflows
  • (Meta)data formats
  • Publication and long term preservation
  • Data visualization

.

2.1 - Ethics

-

2.1.1. Do you work with personal, sensitive data ?

-
    -
  • Does your research practice involve collecting, processing and storing information on persons?
      -
    • ... identifiable persons ?
    • -
    • ... vulnerable persons ?
    • -
    • ... children ?
    • -
    -
  • -
+

2.1.1. When human beings are involved...

Human Beings

+

Ethics issues arise in many areas of research.

-
    -
  • How do you inform persons/subjects on what you will be doing ?
  • -
+

Research involving the voluntary participation of research subjects and the collection of data that might be considered as personal.

-
    -
  • What data do you typically use (collect, process, store) in the course of a research project ?
  • -
+

You must protect your volunteers, yourself and your researcher colleagues.

    -
  • Among these data which ones are personal ? Sensitive ?
  • +
  • Does your research practice involve collecting, processing and storing information on persons?
      +
    • ... identifiable persons ?
    • +
    • ... vulnerable persons ?
    • +
    • ... children ?
    • +
    +
    -
  • Do you need to identify the subject/person ?
  • +
  • How do you inform persons/subjects on what you will be doing ?
-
    -
  • If disclosed, do the data you collect lead to the dissemination of personal information ?
  • -
+

Human Research Ethics Committee at EPFL (HREC)

The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project."

-
    -
  • Do subjects/persons sometimes ask you for their performance ? The data you collected about them ?
  • +
+

Sources:

-

If you answered yes to any of the above question, ethical and legal issues apply.

-

You should check the Research Office Checklist: Research Office Ethics Assessment.

+

2.1.2. Data ? What data ? Personal data ? Sensitive data ?

+

personal data (data)

+
    +
  • all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)
  • +
  • examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these.
  • +
-

2.1.2. When human beings are involved...

Human Beings

-

For instance in such cases : "...collection of personal data, interviews, observations, questionnaires, recordings, tracking or the secondary use of information provided for other purposes, e.g. social media sites, other research projects etc.

-

In such cases the Human Research Ethics Committee at EPFL (HREC) should be consulted.

-

The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project."

-

http://research-office.epfl.ch/op/edit/page-117394.html

+

sensitive personal data

+

According to the Swiss FADP (article 3 c.) data on:

+
    +
  1. religious, ideological, political or trade union-related views or activities,
  2. +
  3. health, the intimate sphere or the racial origin,
  4. +
  5. social security measures,
  6. +
  7. administrative or criminal proceedings and sanctions;
  8. +
-

2.1.3. Data ? What data ? Personal data ?

-

personal data (data)

    -
  • all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)
  • -
  • examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these.
  • +
  • What data do you typically use (collect, process, store) in the course of a research project ?
-

sensitive personal data

-

According to the Swiss FADP (article 3 c.) data on:

-
    -
  1. religious, ideological, political or trade union-related views or activities,
  2. -
  3. health, the intimate sphere or the racial origin,
  4. -
  5. social security measures,
  6. -
  7. administrative or criminal proceedings and sanctions;
  8. -
-

2.3.1. Doing what with data ?

-
    -
  • Simple, understandable, in a language adapted to their age information
  • -
  • See form on Research Office Ethics Assessment.
  • +
      +
    • Among these data which ones are personal ?
-
Processing

Swiss FADP, article 3 e.: -any operation with personal data, irrespective of the means applied and the procedure, and in particular the collection, storage, use, revision, disclosure, archiving or destruction of data;

-

Notably:

    -
  • carried out in good faith
  • -
  • only for the purpose indicated at the time of collection (...)
  • -
  • consent must be given expressly in the case of processing of sensitive personal data or personality profiles.
  • +
  • Among these data which ones are sensitive ?
-
Correcting
    -
  • Anyone who processes personal data must make certain that it is correct. He must take all reasonable measures to ensure that data that is incorrect or incomplete in view of the purpose of its collection is either corrected or destroyed.
  • -
  • Any data subject may request that incorrect data be corrected.
  • -
+

If you work with personal or sensitive data,

+

you should check the Research Office Checklist: Research Office Ethics Assessment, especially the checklists (login with Gaspar).

-
Right to information

Any person may request information from the controller of a data file as to whether data concerning them is being processed.

-
    -
  • of all available data concerning the subject (...),
  • -
  • including the available information on the source of the data (...) as well as the categories of the personal data processed, the other parties involved with the file and the data recipient.
  • -
  • (...) The information must normally be provided in writing, in the form of a printout or a photocopy, and is free of charge.
  • -
-

Swiss FADP Article 8.

+

2.1.3 Doing what with data ?

-
Protecting

(Swiss FADP, article 7) Personal data must be protected against unauthorised processing through adequate technical and organisational measures.

-

Technical measures : notably it is forbidden to store personal data in countries that are not compatible with Swiss law, such as the US.

-

This excludes the usage of many clouds: Dropbox, Google Drive, Microsoft Azure, Amazon S3...

+
Personal or sensitive data processing

Swiss Federal Act on Data Protection (FADP) (or Loi sur la Protection des Données LPD), article 3 e.: +any operation with personal data, irrespective of the means applied and the procedure, and in particular:

+
    +
  • the collection,
  • +
  • storage,
  • +
  • use,
  • +
  • revision,
  • +
  • disclosure,
  • +
  • archiving
  • +
  • or destruction
  • +
+

of data;

-

2.1.4. Disclosing personal data

Personal Data collection and processing implies compliance with the law on privacy and data protection:

+

Processing data should notably:

+
    +
  • be carried out in good faith
  • +
  • only for the purpose indicated at the time of collection [...]
  • +
  • consent must be given expressly in the case of processing of sensitive personal data or personality profiles.
  • +
  • be accurate (and corrected or destroyed if required) (FAPLD article 5)
  • +
  • Any person may request information from the controller of a data file as to whether data concerning them is being processed (FAPD article 8).
  • +
-

Disclosure

-

(Swiss FADP article 3 f.): making personal data accessible, for example:

    -
  • by permitting access,
  • -
  • transmission
  • -
  • or publication.
  • +
  • Do you need to identify the subject/person ?

    +
  • +
  • If disclosed, do the data you collect lead to the dissemination of personal information ?

    +
  • +
  • How and to whom the data will be disseminated ?

    +
-

Cross-border disclosure

-

Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.

-

Art. 61Cross-border disclosure -Personal data must be protected against unauthorised processing through adequate technical and organisational measures.

+

2.1.4. Protecting and disclosing personal data

Protection

Personal data must be protected against unauthorised processing through adequate technical and organisational measures (Swiss FADP, article 7).

-

Anonymisation

    -
  1. Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if: -a. the data is rendered anonymous, as soon as the purpose of the processing permits; -b. the recipient only discloses the data with the consent of the federal body and -c. the results are published in such a manner that the data subjects may not be identified.
  2. -
+

Disclosure

+

Making personal data accessible, for example:

+
    +
  • by permitting access,
  • +
  • transmission
  • +
  • or publication.
  • +
+

(Swiss FADP article 3 f.)

-

Recapitulation

+

Cross-border disclosure

+

Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.

+

Cross-border disclosure of personal data must be protected against unauthorised processing through adequate technical and organisational measures.

+

(FDAP Art. 6)

+ +
+
+
+
+
+
+
+
+

Anonymisation

Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if:

    -
  • Autorisation and information to provide and DMP
  • -
  • Collect consent
  • -
  • Inform participants : sample information sheet
  • -
  • Autorisations
  • +
  • the data is rendered anonymous, as soon as the purpose of the processing permits;
  • +
  • the recipient only discloses the data with the consent of the federal body and
  • +
  • the results are published in such a manner that the data subjects may not be identified.

References

-
-
-
-

2.2 - Anonymization methods

Privacy protection methods, either :

  • removing,
  • generalizing or
  • encrypting,

personal information from datasets.

In passing, there is more to this (Privacy-Preserving Data Mining Methods / Charu Affarwal and Philip Yu. 2008.):

Privacy-Preserving_Data_Mining__Methods.png

k-anonymity

Definition

"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release" (Source).

Illustration

Example including removal and generalization (same source):

Name Age Gender State of domicile Religion Disease
Ramsha 29 Female Tamil Nadu Hindu Cancer
Yadu 24 Female Kerala Hindu Viral infection
Salima 28 Female Tamil Nadu Muslim TB
sunny 27 Male Karnataka Parsi No illness
Joan 24 Female Kerala Christian Heart-related
Bahuksana 23 Male Karnataka Buddhist TB
Rambha 19 Male Kerala Hindu Cancer
Kishor 29 Male Karnataka Hindu Heart-related
Johnson 17 Male Kerala Christian Heart-related
John 19 Male Kerala Christian Viral infection

To (name and religion were removed, age was generalized):

Name Age Gender State of domicile Religion Disease
* 20 < Age ≤ 30 Female Tamil Nadu * Cancer
* 20 < Age ≤ 30 Female Kerala * Viral infection
* 20 < Age ≤ 30 Female Tamil Nadu * TB
* 20 < Age ≤ 30 Male Karnataka * No illness
* 20 < Age ≤ 30 Female Kerala * Heart-related
* 20 < Age ≤ 30 Male Karnataka * TB
* Age ≤ 20 Male Kerala * Cancer
* 20 < Age ≤ 30 Male Karnataka * Heart-related
* Age ≤ 20 Male Kerala * Heart-related
* Age ≤ 20 Male Kerala * Viral infection

This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.

l-diversity - motivation

An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:

  • homogeneity attacks: in the case that a group of lines are homogeneous ,
  • background knowledge attacks: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) (source).

Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :

Name Age Gender State of domicile Religion Disease
* 20 < Age ≤ 30 Female Tamil Nadu * AIDS
* 20 < Age ≤ 30 Female Tamil Nadu * AIDS
* 20 < Age ≤ 30 Female Tamil Nadu * AIDS

If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity.

l-diversity - definition

The l-diversity Principle : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.

There are several definition of "well-represented" (source).

By the way, l-diversity has weaknesses to, that is why people invented t-closeness.

t-closeness - motivation

L-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be the semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases (adapted form source).

t-closeness - definition

The t-closeness Principle: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness (source).

differential privacy

By linking with another database: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts.

Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1 (source).

Anonymization - theory and tools

Statistical Disclosure Control / Hundepool, & al. 2012.

Ebook provided by the EPFL library.

Tools

  • sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation (R package)
  • ARX Data Anonymization Tool (Java: library &GUI)
  • μ-ARGUS (Java, GUI)
-

2.3 - Reproducibility

According to a Nature study in 2012, 47 out of 53 medical research papers are irreproducible (1).

+

2.3 - Reproducibility

+
+
+
+
+
+
+
+
+

According to a Nature study in 2012, 47 out of 53 medical research papers are irreproducible (1).

A previous study showed in 2009 that 16 out of 18 bioinformatics papers could not be reproduced entirely (2).

In 2004, it was found that less than 9% of papers share their code (3).

(1) Begley, C. G.; Ellis, L. M. (2012). "Drug development: Raise standards for preclinical cancer research". Nature 483 (7391): 531–533.
(2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.
(3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. "Reproducible research in signal processing." Signal Processing Magazine, IEEE 26.3 (2009): 37-47

[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]

A workflow for reproducible research

Researchers often start to think about reproduciblity at the end of projects. It is sometimes too late: by then numerous versions of code and datasets may be spread in various places (folders, dropbox, usb drives...).

A practical 5 points approach:

  1. document everything
  2. everything is a (text) file
  3. files should be human readable
  4. explicitly tie your files together
  5. have a plan to organize, store and make your files available

Slide inspired by chapter 2 of Reproducible Research with R and RStudio.

More details:

  • document everything
    • reproduction requires documentation of what you did
  • everything is a text file
    • notably: data, code and results
    • the simplest formats are the best: CSV / JSON, Markdown / $\LaTeX$, because they are future proofed
  • files should be human readable
    • treat all files as if someone who does not know the project will have to use them
    • otherwise they (or you 6 months later) will probably not undestand them
    • important elements to document:
      • description of what the file is or does (in general, local comments)
      • contributors
      • date of last update
  • explicitly tie your files together, including generated documents
    • locally or using persistent identifiers
    • formalize the way data is processed
    • generally difficult to trace back (e.g.: how was a specific figure generated?)
  • have a plan to organize, store and make your files available

2.4 - Collaborative tools

2.4.1 - File sharing

.

  • Personal/group level: OwnCloud, free software: Mac, Windows, Linux, iOS, Android... Web.
    • Your own server: OwnCloud https://owncloud.org/
    • Many plugins: contacts, calendar, collaborative writing, image galleries, etc.
  • Swiss level: SwitchDrive https://drive.switch.ch/
    • Owncloud with 25 Go by user,
    • Restricted to Swiss universities members.
  • A recent fork of ownCloud: NextCloud aims more transparent development processes.

2.4.2 - Collaborative writing

File sharing is not enough

People often need to collaborate at a finer level. More and more.

... Source: Pr. Vanderghenyst, EFPL Library Noon Talk, 25.8.2016.

... Source: Pr. Vanderghenyst, EFPL Library Noon Talk, 25.8.2016.

... Source: Pr. Vanderghenyst, EFPL Library Noon Talk, 25.8.2016.

In summary

Text processing comments / revision mode functionalities are not sufficient for good collaboration.

Google Scholar and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.

$\Rightarrow$ we need something else!

Share LaTeX

Share LaTeX is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users. .

Good, but only if all partners are LaTeX users.

Authorea

Authorea: collaborative writing, easy to use.

Authorea

  • Free account to test (limited to 1 private document, no limits on public documents). EPFL licence provided by the Library.
  • Simple syntax : WYSIWYG and Markdown (lightweight text formatting language). More complex formating possible using LaTeX
  • Enables others to make comments
  • Supports interactive documents / figures (Jupyter)
  • Offline synchronization on personal computer (using the Git version control system)
-

2.4.3 - Collaborative coding

Git

.

+

2.4.3 - Collaborative versioning and branching

Git

.

Git is a multi-platform (Windows, Mac, GNU/Linux) version control tool.

Git Servers

  • GitHUB, very popular, some date hosted in the US. Closed repositories limited (payment or subject to other conditions).
  • c4science is the Swiss collaborative development platform. Unlimited number of repositories (opened / closed).

Git workflows

Git will however not do everything for you.

  • You need to think up a naming convention (folder structure, file names) e.g.
    • PROJECT-Experiment-Researcher(ORCID)-YYYYMMDD.extension
    • PROJECT-Experiment-Researcher(ORCID)-Software-Format-YYYYMMDD.extension
    • PROJECT-Experiment-Researcher(ORCID)-Software-Version-Format-YYYYMMDD.extension
  • Set up an appropriated workflow.

Locally

Locally

Source: J.-L. Falcone.

The easiest way is to use a centralized repository.

Centralized

Source: J.-L. Falcone.

For more complex projects, a project leader can manage the quality.

Centralized

Source: J.-L. Falcone.

For big projects, it is possible to dispatch responsabilities.

Centralized

Source: J.-L. Falcone.

Non linear development is supported: branches

Centralized

Source: J.-L. Falcone.

Git and GitHub are not suited for long term preservation

.

  • Some git commands can delete data (namely: rebase and reset --hard)
  • Repositories can be deleted (including on GitHUB)
  • A link GitHub $\Rightarrow$ Zenodo can be set, so each release will be automatically made citable through a DOI and preserved in Zenodo.

Guide : Making your code citable

.

2.4.4 - Jupyter, Jupyterhub, Sagemath

Jupyter

Interactive Jupyter Notebooks documents.: try.jupyter.org

.

Structure:

  • Rich-hyper-text cells (including tables, $\LaTeX$, images, videos)
  • Live code cells (with interactive widgets)

Characteristics

  • Over 50 languages supported : Python, R, Octave, BASH, Matlab, Scala, Java, Haskell...
  • Jupyter Notebooks are JSON files $\rightarrow$ can be tracked with Git.
  • Nbconvert allows conversion to many formats, including python:
    • jupyter nbconvert notebook.ipynb --to python
    • jupyter nbconvert notebook.ipynb --to latex
    • jupyter nbconvert notebook.ipynb --to markdown
    • jupyter nbconvert notebook.ipynb --to markdown
    • jupyter nbconvert notebook.ipynb --to slides
    • jupyter nbconvert notebook.ipynb --to html
  • Executing from command line:
    • jupyter nbconvert --to notebook --execute mynotebook.ipynb

Powerful python libraries

  • Pandas is a powerful library providing high-performance, easy-to-use data structures and data analysis tools. Examples.
  • Numpy is the fundamental package for scientific computing with Python:
    • N-dimensional array object
    • sophisticated (broadcasting) functions
    • tools for integrating C/C++ and Fortran code
    • useful linear algebra, Fourier transform, and random number capabilities
  • Matplotlib is a plotting library with great flexibility. It has features comparable to Matlab plotting. Examples.
  • Seaborn relies on Pandas (see below). Examples.
  • NetworkX is suited for complex networks analysis and representation. Examples.
  • r2py is an interface to R running embedded in a Python process.

.

And web libraries

  • Bokeh is a Python interactive visualization library that targets modern web browsers for presentation.
  • D3.js is an open source JavaScript library for creating interactive documents based on data**. D3 helps bringing data to life using HTML, SVG, and CSS. As mentioned above it can be used in Jupyter using matplotlib via mpld3.

.

Jupyterhub

  • Jupyter Multi Users server (system users, or via GitHub)
  • Collaborate in a local folder, mounting a VSPI Share, or with Git.

.

2.4.5 - R, RStudio and RStudio server

R

R is a free software environment for statistical computing and graphics. One of the best.

Platforms:

  • wide variety of GNU/Linux and UNIX platforms,
  • Windows
  • MacOS

Strength: The diversity of quality open extensions (easily installable with CRAN).

RStudio

RStudio is a free and open-source integrated development environment (IDE) for R.

.

R and reproducible research

.

Reproducible research and documents

  • knitr and rmarkdown
  • tying together results and their presentation in articles (pdf, word), presentations or web sites
  • notably in $\LaTeX$ (.Rtex) or Markdown (.Rmarkdown)
  • well integrated in RStudio

Rmarkdown

Include R code chunks in markdown:

# Prime numbers
 
 Storing a few prime numbers in a variable:
 
 ```{r}
 primes <- c(2,3,5,7,11,13)
 ```
 Done.

First you need to setup document properties in YAML:

---
 title: "Rmarkdown example"
 author: "Jan Krause"
 date: "24 novembre 2016"
 output: pdf_document
 ---
 
 # Prime numbers
 
 Storing a few prime numbers in a variable:
 
 ```{r}
 primes <- c(2,3,5,7,11,13)
 ```
 Done.

RStudio Server

RStudio in your browser.

.

2.5 - Computational workflow management

Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.

2.5.1 - AiiDA

AiiDA a free software has been developed at EPFL (in material sciences): http://www.aiida.net/

2.5.2 - SnakeMake : a simple tool

  • simple : nodes are connected through files (inspired by GNU Make)
  • complete :
    • supports remote files (http(s), sftp, dropbox, googledrive)
    • handles data provenance and rule versions,
    • parallelization,
    • suspend/resume,
    • logging,
    • creates schema
  • flexible :the SnakeFile is an extension of Python
  • http://snakemake.bitbucket.org/

Simple Rule:

rule sort:
     input:
         f = "path/to/dataset.txt"
     output:
         f = "dataset.sorted.txt"
     shell:
         "sort {input.f} > {output.f}"

Simple Rule (two inputs):

rule sort:
     input:
         f1 = "dataset1.txt"
         f2 = "dataset2.txt"
     output:
         f = "dataset.sorted.txt"
     shell:
         "cat {input.f1} {input.f2}  > {output.f}"

Simple Rule (here in Python, but R scripts are supported too):

rule sort:
     input:
         a="path/to/dataset.txt"
     output:
         b="dataset.sorted.txt"
     run:
         with open(output.b, "w") as out:
             for l in sorted(open(input.a)):
                 print(l, file=out)

More than one rule:

rule result:
     input:
         'result.txt'
 
 rule genrate_cal_2017:
     input:
         ()
     output:
         fname = "tmp/cal.txt"
     shell:
         "cal 2017 > {output.fname}"
 
 rule describe:
     input:
         fname1 = "DESCRIPTION.txt",
         fname2 = "tmp/cal.txt"
     output:
         fname = "result.txt"
     shell:
         "cat {input.fname1} {input.fname2} > {output.fname}"

Expand (running rules in parallel):

DATASETS = ["D1", "D2", "D3", "D4", "D5", "D6"]
 
 rule all:
     input:
         expand("{dataset}.sorted.txt", dataset=DATASETS)
 
 rule sort:
     input:
         "{dataset}.txt"
     output:
         "{dataset}.sorted.txt"
     shell:
         "sort {input} > {output}"

Output : example Graph

Output : Log (simplified)

output_file date rule version
result.txt Fri Nov 11 15:48:17 2016 cleanup 3.14
tmp/pre-result.txt Fri Nov 11 15:48:17 2016 add_head_foot 1.02
tmp/FOOT.txt Fri Nov 11 15:48:17 2016 generate_foot 5.6
tmp/HEAD.txt Fri Nov 11 15:48:17 2016 generate_head 5.6
tmp/described_cal.txt Fri Nov 11 15:48:17 2016 describe 0.1alpha
tmp/cal.txt Fri Nov 11 15:48:17 2016 genrate_cal 8.234

More about workflows

Another tool: Taverna which includes the desktop oriented Taverna Workbench, command-line and server applications.

Finally, myExperiment is a platform for sharing scientific workfows, and notably fully supported by Taverna.

EPFL platforms

  • EPFL SCITAS (Victoria Rezzonico)
    • High Performance Computing and data Storage .
-

2.6 - Data and Storage

2.6.1 - (Meta)data formats

+

2.6 - Data and Storage

+
+
+
+
+
+
+
+
+

2.6.1 - (Meta)data formats

Metadata

  • Numerous specilized metadata formats are available for most disciplines, the Research Data Alliance Metadata Directory is a good starting point.

.

Data format

Prefer a

  • standard format,
  • open and
  • widely used

This way your data will not depend upon a particular software (or company), operating system, or platform. And you will be able to:

  • collaborate with more people (on various platforms)
  • avoid licensing problems
  • maximize the reusability in the future

Some open formats to take into account

  • Portable Document Format PDF/A, ISO standard, text [PDF for archiving, no ciphers, included fonts...]
  • Text simple way to encode data. Can be read by most software.
    • CSV tables, can be read by most software, and extended using CSV on the Web (metadata, datatypes, relation...)
    • JSON: Simply structured, less bulky than XML, ideal for data exchange.
  • HDF5, more flexible (not text, but structured and indexed, supports arbitrary metadata, good performances).
    • Compatible with many tools (Python, R, Matlab, Mathematica...)
  • Databases:
    • SQL: Postgresql is relational, open and efficient
    • BigData: MongoDB for volume, velocity, and variety

Data formats list

Sustainability of digital formats by the US Library of Congress. This list is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)

2.6.2 - Storage, publication and preservation

Data access sustainability

A Plos One study showed in 2014 that more than 60% of links to datasets are broken after 10 years (1).

Another Plos One 2014 article showed that the bibliography of 1 out of every 5 is impacted by that phenomenon (2).

(1) Pepe et al. (2014). How Do Astronomers Share Data? Reliability and Persistence of Datasets Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers. PLoS ONE, 9(8). doi:10.1371/journal.pone.0104798
(2) Klein et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. doi:10.1371/journal.pone.0115253

Digital preservation

At CERN, a 2007 studies (1,2) showed that the error ratio was of $10^{-7}$ (over 2 months).

Causes are complex and varied: disk errors, RAID errors, memory errors, etc.

For 1 Gigabyte (1000 Mégabytes), we have: $10^9 \cdot 10^{-7} = 10^2 = 100$ bytes of bitrot.

(1) https://indico.cern.ch/event/13797/session/0/contribution/3/attachments/115080/163419/Data_integrity_v3.pdf
(2) http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/

Preservation vs. Backup

EPFL storage options

.

EPFL offers many storage options, as described on the VPSI page Databases, Storage and Virtualization.

EPFL Storage Prices 2016

Prices Table

Why publish in a data archive?

Accelerate science and careers

Many studies show there are significant advantages for articles that share their code or data.

Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149

Avoid bias in science

FDA Turner

Machine learning needs

Barend Mons : Field

Barend Mons : Field

Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.

Source: Barend Mons, IDCC, Amsterdamm 2016.

Data repositories

Other data repositories

  • Dryad («curated», non-profit organisation, partnership with publishers) http://datadryad.org/

  • Figshare (commercial, belongs to Macmillian [as does NPG]) http://figshare.com/

  • For more information see re3data in which more than 1'500 data repositoris are described.

Data Citation

  • Always use persistent identifiers to avoid broken links (about 60% after 10 years)
  • The most common persistent identifier is the DOI (digital object identifier)
  • Zenodo, Figshare, Dryad and Infoscience can provide DOIs.

2.7 - Licences

A licence allows to define the way your data can be reused. For instance:

Creative Commons (CC0 and CC-BY) http://creativecommons.org/ Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) https://wiki.creativecommons.org/wiki/Data

.

You can contact us in the future here:


datamanagementplan@epfl.ch




We look forward to hearing from you!




Aude, Jan, Karine and Nathalie

diff --git a/OptimizeResearchDataManagement.ipynb b/OptimizeResearchDataManagement.ipynb index 1e59cda..904f0c3 100755 --- a/OptimizeResearchDataManagement.ipynb +++ b/OptimizeResearchDataManagement.ipynb @@ -1,2608 +1,2612 @@ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# CHILI \n", "\n", "## Research Data Management Bootcamp\n", "\n", "#### Wednesday December 7th, 2016\n", "\n", "#### Karine Delvert, Aude Dieudé, Jan Krause, Nathalie Lambeng\n", "\n", "\n", "
datamanagementplan@epfl.ch\n", "\n", "![.](./Images/CC-By-NC-SA_88x31.png)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1\n", "\n", "## 1.1 - Introduction to RDM" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition, context and best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Introduction: [video](https://www.youtube.com/watch?v=N2zK3sAtr-4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition : Research data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For the Organization for Economic Cooperation and Development [OCDE](http://www.oecd.org/fr/sti/sci-tech/38500823.pdf), research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Research Data Lifecycle" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "\n", "[Source: Formation URFIST, Rennes, 2016](https://drive.google.com/file/d/0BxKZLWq08xX-TW5VOEUtd2FSRE0/view?pref=2&pli=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.2 - Actors and Skills" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Actors" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "![Actors](Images/Actors2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Skills" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Requirements regarding research data management\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Publishers\n", "\n", "Many publishers and scientific journals require, under specific\n", "conditions, the publication of used data to achieve the research project\n", "results (permanent archiving, standardized formats, etc). This is the case,\n", "for instance, with PLoS and Nature Publishing Group. An overview of the\n", "editorial policies are available online on this [Dryad website](http://wiki.datadryad.org/Journal_instructions)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funders" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Examples of funders which require DMPs or equivalent:\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funding agency and DMP : Horizon 2020\n", "\n", "![.](Images/H2020.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- [Horizon 2020](http://research-office.epfl.ch/financements/international/horizon-2020): is the biggest funding agency from the European Commission \n", "with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its\n", "main objective is to promote and support excellence in the scientific field." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Horizon 2020 requires for some research projects the preparation of a [data management plan](http://ec.europa.eu/programmes/horizon2020/en/what-horizon-2020), which is mandatory in order to receive research funding. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "- [As of 2017](https://ec.europa.eu/digital-single-market/en/news/communication-european-cloud-initiative-building-competitive-data-and-knowledge-economy-europe), the Commission will make **open research data the default option**, while ensuring opt-outs, for all new projects of the Horizon 2020 program." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funding agency and DMP : SNSF\n", "![.](Images/SNSF.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Research data management policy** to be established in 2017\n", "\n", "- Submission of **data management plans** with the grant application" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.3 - Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition : Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Data Management Plan (DMP) refers to the strategies put into place to\n", "create, store, share, maintain, archive and preserve research data\n", "throughout their life cycle.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The DMP describes which data are going to be produced and how each\n", "type of data will be organized, classified, archived, shared, distributed,\n", "secured and preserved in a secure way." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Here is a [video](https://www.youtube.com/watch?v=gYDb-GP1CA4), which illustrates how the DMP works concretely:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Why publish in a data archive?\n", "\n", "**Accelerate science and careers**\n", "\n", "Many sutdies show there are significant advantages for articles that share their code or data.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### Avoid bias in science \n", "\n", "\n", "![FDA Turner](Images/FDA_Turner.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### Machine learning needs\n", "\n", "![Barend Mons : Field](Images/MonsField.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "![Barend Mons : Field](Images/MonsHelicopter.png)\n", "\n", "\n", "Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.\n", "\n", "Source: Barend Mons, IDCC, Amsterdamm 2016." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.4 - DMP best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Best practices examples: DMPonline (UK)\n", "
\n", "
http://dmponline.dcc.ac.uk
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Best practices examples: EPFL (Switzerland)\n", "To provide guidance in preparing a DMP, the **[EPFL-ETHZ checklist](http://library.epfl.ch/files/content/sites/library3/files/research-data/dmp/Data_management_plan_checklist_EPFL_2016.pdf)** includes\n", "four categories to cover questions related to:\n", "- Research Data Acquisition : type, quantity, license, etc.\n", "- Research Data Format : format, metadata, identification, etc.\n", "- Research Data Sharing : embargo, intellectual property, etc.\n", "- Data Preservation : storage, sensitivity of the data, archiving, etc.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 2 - CHILI Specific Topics\n", "\n", "* Ethics, legal aspects, anonymization \n", "* Reproducibility\n", "* Collaborative coding and writing\n", "* Computational workflows\n", "* (Meta)data formats\n", "* Publication and long term preservation\n", "* Data visualization\n", "\n", "![.](Images/tools.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "slide" } }, "source": [ "## 2.1 - Ethics" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.1.1. Do you work with personal, sensitive data ?\n", - "\n", - "![](Images/question.png)\n", - "\n", - "- Does your research practice involve collecting, processing and storing information on persons?\n", - " - ... identifiable persons ?\n", - " - ... vulnerable persons ?\n", - " - ... children ?" - ] - }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "fragment" + "slide_type": "subslide" } }, "source": [ - "- How do you inform persons/subjects on what you will be doing ?" + "### 2.1.1. When human beings are involved...\n", + "\n", + "![Human Beings](Images/humanbeing.png)\n", + "\n", + "**Ethics issues arise in many areas of research**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "- What data do you typically use (collect, process, store) in the course of a research project ?" + "Research involving the voluntary participation of research subjects and the collection of **data that might be considered as personal**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "- Among these data which ones are personal ? Sensitive ?" + "You must protect your **volunteers, yourself and your researcher colleagues**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "- Do you need to identify the subject/person ?" + "- Does your research practice involve collecting, processing and storing information on persons?\n", + " - ... identifiable persons ?\n", + " - ... vulnerable persons ?\n", + " - ... children ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "- If disclosed, do the data you collect lead to the dissemination of personal information ?" + "- How do you inform persons/subjects on what you will be doing ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "fragment" + "slide_type": "subslide" } }, "source": [ - "- Do subjects/persons sometimes ask you for their performance ? The data you collected about them ?" + "#### Human Research Ethics Committee at EPFL (HREC)\n", + "\n", + "The role of the HREC is to **review any research project carried out at EPFL involving non-invasive human research** from an ethical point of view, before the beginning of the project.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "fragment" + "slide_type": "subslide" } }, "source": [ - "- How and to whom the data will be disseminated ?" + "#### Collecting consent \n", + "\n", + "- Simple, understandable, in a language adapted to their age information\n", + "- See form on [Research Office Ethics Assessment](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "If you answered **yes** to any of the above question, ethical and legal issues apply.\n", + "**Sources:**\n", "\n", - "You should check the Research Office Checklist: [Research Office Ethics Assessment](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment)." + "* [H2020 Programme Guidance : How to complete your ethics self assessment](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf), 12th July 2016. Page 1.\n", + "* http://research-office.epfl.ch/op/edit/page-117394.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "### 2.1.2. When human beings are involved...\n", - "\n", - "![Human Beings](Images/humanbeing.png)\n", + "### 2.1.2. Data ? What data ? Personal data ? Sensitive data ?\n", "\n", - "For instance in such cases : \"...collection of personal data, interviews, observations, questionnaires, recordings, tracking or the secondary use of information provided for other purposes, e.g. social media sites, other research projects etc.\n", - "\n", - "In such cases the Human Research Ethics Committee at EPFL (HREC) should be consulted. \n", + "![](Images/personaldata.png)\n", "\n", - "The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project.\"\n", + "**personal data (data)**\n", "\n", - "http://research-office.epfl.ch/op/edit/page-117394.html" + "* all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)\n", + "* examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "### 2.1.3. Data ? What data ? Personal data ?\n", - "\n", - "![](Images/personaldata.png)\n", - "\n", - "**personal data (data)**\n", - "\n", - "* all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)\n", - "* examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these.\n", - "\n", "**sensitive personal data**\n", "\n", "According to the Swiss FADP (article 3 c.) data on: \n", "\n", "1. religious, ideological, political or trade union-related views or activities,\n", "2. **health, the intimate sphere or the racial origin**,\n", "3. social security measures,\n", "4. administrative or criminal proceedings and sanctions;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "### 2.3.1. Doing what with data ?\n", - "\n", - "\n", - "![](Images/dataanalysis.png)\n", - "\n", - "#### Collecting consent \n", - "\n", - "- Simple, understandable, in a language adapted to their age information\n", - "- See form on [Research Office Ethics Assessment](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment)." + "- What data do you typically use (collect, process, store) in the course of a research project ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "fragment" } }, "source": [ - "##### Processing\n", - "\n", - "**Swiss FADP, article 3 e.**: \n", - "any operation with personal data, irrespective of the means applied and the procedure, and in particular the collection, storage, use, revision, disclosure, archiving or destruction of data;\n", - "\n", - "Notably:\n", - "* carried out in good faith\n", - "* only for the purpose indicated at the time of collection (...) \n", - "* consent must be given expressly in the case of processing of sensitive personal data or personality profiles." + "- Among these data which ones are **personal** ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "notes" + "slide_type": "fragment" } }, "source": [ - "##### Correcting\n", - "* Anyone who processes personal data must make certain that it is correct. He must take all reasonable measures to ensure that data that is incorrect or incomplete in view of the purpose of its collection is either corrected or destroyed.\n", - "* Any data subject may request that incorrect data be corrected.\n", - " " + "- Among these data which ones are **sensitive** ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "##### Right to information\n", - "\n", - "\n", - "Any person may request information from the controller of a data file as to whether data concerning them is being processed.\n", - " \n", - "- of all available data concerning the subject (...), \n", - "- including the available information on the source of the data (...) as well as the categories of the personal data processed, the other parties involved with the file and the data recipient.\n", - "- (...) The information must normally be provided in writing, in the form of a printout or a photocopy, and is free of charge. \n", + "If you work with personal or sensitive data,\n", "\n", - "Swiss FADP Article 8.\n" + "you should check the Research Office Checklist: [Research Office Ethics Assessment](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment), especially the [**checklists**](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment/ethical-issues-checklists) (login with Gaspar)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "##### Protecting\n", + "### 2.1.3 Doing what with data ?\n", "\n", - "(Swiss FADP, article 7) Personal data must be protected against unauthorised processing through adequate technical and organisational measures.\n", "\n", - "**Technical measures : notably it is forbidden to store personal data in countries that are not compatible with Swiss law, such as the US**. \n", - "\n", - "This excludes the usage of many clouds: Dropbox, Google Drive, Microsoft Azure, Amazon S3... \n" + "![](Images/dataanalysis.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "### 2.1.4. Disclosing personal data\n", + "##### Personal or sensitive data processing\n", "\n", + "**Swiss [Federal Act on Data Protection](https://www.admin.ch/opc/en/classified-compilation/19920153/index.html) (FADP) (or Loi sur la Protection des Données LPD), article 3 e.**: \n", + "any operation with personal data, irrespective of the means applied and the procedure, and in particular:\n", + "* the collection, \n", + "* storage, \n", + "* use, \n", + "* revision, \n", + "* disclosure, \n", + "* archiving \n", + "* or destruction \n", "\n", - "Personal Data collection and processing implies compliance with the law on privacy and data protection: " + "of data;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "**Disclosure**\n", + "**Processing data should notably:**\n", "\n", - "(Swiss FADP article 3 f.): making personal data accessible, for example:\n", - "* by permitting access,\n", - "* transmission\n", - "* or publication." + "* be carried out in good faith\n", + "* only for the purpose indicated at the time of collection [...] \n", + "* consent must be given expressly in the case of processing of sensitive personal data or personality profiles.\n", + "* be accurate (and corrected or destroyed if required) (FAPLD article 5)\n", + "* Any person may request information from the controller of a data file as to whether data concerning them is being processed (FAPD article 8)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "notes" } }, "source": [ - "**Cross-border disclosure**\n", + "- Do you need to identify the subject/person ?\n", "\n", - "Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection. \n", + "- If disclosed, do the data you collect lead to the dissemination of personal information ?\n", "\n", - "Art. 61Cross-border disclosure\n", - "Personal data must be protected against unauthorised processing through adequate technical and organisational measures. \n", - " " + "- How and to whom the data will be disseminated ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "#### Anonymisation\n", + "### 2.1.4. Protecting and disclosing personal data\n", + "\n", + "#### Protection\n", "\n", - "3. Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if:\n", - "a. the data is rendered anonymous, as soon as the purpose of the processing permits;\n", - "b. the recipient only discloses the data with the consent of the federal body and\n", - "c. the results are published in such a manner that the data subjects may not be identified." + "Personal data must be protected against unauthorised processing through adequate technical and organisational measures (Swiss FADP, article 7).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "notes" + "slide_type": "subslide" } }, "source": [ - "**Recapitulation**\n", + "**Disclosure**\n", "\n", - "- Autorisation and information to provide and DMP\n", - "* Collect consent\n", - "* Inform participants : sample information sheet\n", - "* Autorisations" + "Making personal data accessible, for example:\n", + "* by permitting access,\n", + "* transmission\n", + "* or publication.\n", + "\n", + "(Swiss FADP article 3 f.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "**References**\n", + "**Cross-border disclosure**\n", "\n", - "* Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).\n", + "Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection. \n", "\n", - "* Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).\n", - " * [Directive 95/46/EC](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=URISERV%3Al14012)\n", - " * [H2020 Program Guidance : how to compleate your ethics self assessment](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf), 12.7.2016\n", - "* Information\n", - " * http://research-office.epfl.ch/ethique-recherche/research-ethics-assessment/ethical-review/personal-data" + "Cross-border disclosure of personal data must be protected against unauthorised processing through adequate technical and organisational measures. \n", + "\n", + "(FDAP Art. 6)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "notes" + "slide_type": "subslide" } }, "source": [ - "Anonymisation\n", - "* http://stackoverflow.com/questions/39337242/anonymize-names-in-paragraph-variable-by-matching-and-replacement\n", - "* http://stackexchange.com/sites#\n", - "* http://stackoverflow.com/questions/23795406/how-to-convert-an-existing-project-to-a-reproducible-research-rr-project-using\n", - "* http://stackoverflow.com/questions/31201685/reproducibility-in-scientific-computing/32104277#32104277\n", - "* http://stackoverflow.com/questions/34536197/alternative-approach-to-reproducible-research-where-source-code-is-the-primary-m" + "#### Anonymisation\n", + "\n", + "Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if:\n", + "* the data is rendered anonymous, as soon as the purpose of the processing permits;\n", + "* the recipient only discloses the data with the consent of the federal body and\n", + "* the results are published in such a manner that the data subjects may not be identified." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, + "source": [ + "**References**\n", + "\n", + "* Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).\n", + "\n", + "* Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).\n", + " * [Directive 95/46/EC](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=URISERV%3Al14012)\n", + " * As of 2018: [REGULATION (EU) 2016/679 repealing Directive 95/46/EC](http://eur-lex.europa.eu/legal-content/de/TXT/?uri=CELEX%3A32016R0679)\n", + " * [H2020 Program Guidance : how to compleate your ethics self assessment](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf), 12.7.2016\n", + "* Information\n", + " * http://research-office.epfl.ch/ethique-recherche/research-ethics-assessment/ethical-review/personal-data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, "source": [ "## 2.2 - Anonymization methods\n", "\n", "Privacy protection methods, either :\n", "\n", "* removing,\n", "* generalizing or\n", "* encrypting,\n", "\n", "personal information from datasets." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In passing, there is more to this (Privacy-Preserving Data Mining Methods / Charu Affarwal and Philip Yu. 2008.):\n", "\n", "![Privacy-Preserving_Data_Mining__Methods.png](Images/Privacy-Preserving_Data_Mining__Methods.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### k-anonymity\n", "\n", "\n", "##### Definition\n", "\n", "\"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release\" ([Source](https://en.wikipedia.org/wiki/K-anonymity))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Illustration\n", "\n", "Example including removal and generalization (same source):\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|-----------|-----|--------|-------------------|-----------|-----------------|\n", "| Ramsha | 29 | Female | Tamil Nadu | Hindu | Cancer |\n", "| Yadu | 24 | Female | Kerala | Hindu | Viral infection |\n", "| Salima | 28 | Female | Tamil Nadu | Muslim | TB |\n", "| sunny | 27 | Male | Karnataka | Parsi | No illness |\n", "| Joan | 24 | Female | Kerala | Christian | Heart-related |\n", "| Bahuksana | 23 | Male | Karnataka | Buddhist | TB |\n", "| Rambha | 19 | Male | Kerala | Hindu | Cancer |\n", "| Kishor | 29 | Male | Karnataka | Hindu | Heart-related |\n", "| Johnson | 17 | Male | Kerala | Christian | Heart-related |\n", "| John | 19 | Male | Kerala | Christian | Viral infection |\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To (name and religion were removed, age was generalized):\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|------|---------------|--------|-------------------|----------|-----------------|\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | Cancer |\n", "| * | 20 < Age ≤ 30 | Female | Kerala | * | Viral infection |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | TB |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | No illness |\n", "| * | 20 < Age ≤ 30 | Female | Kerala | * | Heart-related |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | TB |\n", "| * | Age ≤ 20 | Male | Kerala | * | Cancer |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | Heart-related |\n", "| * | Age ≤ 20 | Male | Kerala | * | Heart-related |\n", "| * | Age ≤ 20 | Male | Kerala | * | Viral infection |\n", "\n", "This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### l-diversity - motivation\n", "\n", "An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:\n", "* **homogeneity attacks**: in the case that a group of lines are homogeneous ,\n", "* **background knowledge attacks**: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) ([source](https://en.wikipedia.org/wiki/K-anonymity)). \n", "\n", "Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|------|---------------|--------|-------------------|----------|-----------------|\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "\n", "If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### l-diversity - definition\n", "\n", "**The l-diversity Principle** : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.\n", "\n", "There are several definition of \"well-represented\" ([source](https://en.wikipedia.org/wiki/L-diversity)).\n", "\n", "By the way, l-diversity has weaknesses to, that is why people invented **t-closeness**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "##### t-closeness - motivation\n", "\n", "L-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be the semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases (adapted form [source](https://en.wikipedia.org/wiki/T-closeness))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "##### t-closeness - definition\n", "\n", "**The t-closeness Principle**: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness ([source](https://en.wikipedia.org/wiki/T-closeness))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### differential privacy\n", "\n", "**By linking with another database**: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1* ([source](https://en.wikipedia.org/wiki/Differential_privacy))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "slide" + "slide_type": "subslide" } }, "source": [ "#### Anonymization - theory and tools\n", "\n", "![](Images/sdc.jpg)\n", "\n", "Statistical Disclosure Control / Hundepool, & al. 2012.\n", "\n", "Ebook [provided by the EPFL library](http://proquest.safaribooksonline.com/9781118348215). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Tools\n", "* **sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation (R package)**\n", "* ARX Data Anonymization Tool (Java: library &GUI)\n", "* μ-ARGUS (Java, GUI)" ] }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## 2.3 - Reproducibility" + ] + }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "## 2.3 - Reproducibility\n", "According to a Nature study in 2012, **47 out of 53** medical research papers are irreproducible (1)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "A previous study showed in 2009 that **16 out of 18 bioinformatics papers could not be reproduced** entirely (2)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In 2004, it was found that less than **9% of papers share their code** (3)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "(1) Begley, C. G.; Ellis, L. M. (2012). \"Drug development: Raise standards for preclinical cancer research\". Nature 483 (7391): 531–533.
(2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.
(3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. \"Reproducible research in signal processing.\" Signal Processing Magazine, IEEE 26.3 (2009): 37-47

\n", "\n", "[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### A workflow for reproducible research\n", "\n", "Researchers often start to think about reproduciblity at the end of projects. It is sometimes too late: by then numerous versions of code and datasets may be spread in various places (folders, dropbox, usb drives...). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A practical 5 points approach:\n", "\n", "1. document everything \n", "2. everything is a (text) file\n", "3. files should be human readable\n", "4. explicitly tie your files together\n", "5. have a plan to organize, store and make your files available\n", "\n", "Slide inspired by chapter 2 of *Reproducible Research with R and RStudio*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "More details:\n", "\n", "* document everything \n", " * reproduction requires documentation of what you did" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* everything is a text file\n", " * notably: data, code and results\n", " * the simplest formats are the best: CSV / JSON, Markdown / $\\LaTeX$, because they are future proofed" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* files should be human readable\n", " * treat all files as if someone who does not know the project will have to use them\n", " * otherwise they (or you 6 months later) will probably not undestand them\n", " * important elements to document: \n", " * description of what the file is or does (in general, local comments)\n", " * contributors\n", " * date of last update" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* explicitly tie your files together, including generated documents\n", " * locally or using persistent identifiers\n", " * formalize the way data is processed\n", " * generally difficult to trace back (e.g.: how was a specific figure generated?)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* have a plan to organize, store and make your files available" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "slide" } }, "source": [ "## 2.4 - Collaborative tools\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.1 - File sharing\n", "\n", "![.](Images/owncloud.png)\n", "\n", "- Personal/group level: OwnCloud, free software: Mac, Windows, Linux, iOS, Android... Web.\n", " - Your own server: OwnCloud https://owncloud.org/\n", " - Many plugins: contacts, calendar, collaborative writing, image galleries, etc.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Swiss level: SwitchDrive https://drive.switch.ch/\n", " - Owncloud with 25 Go by user, \n", " - Restricted to Swiss universities members.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A recent fork of ownCloud: [NextCloud](https://nextcloud.com/) aims more transparent development processes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.2 - Collaborative writing\n", "#### File sharing is not enough\n", "\n", "People often need to collaborate at a finer level. More and more.\n", "\n", "![...](Images/Vandergheynst_Collaborative.png)\n", "Source: Pr. Vanderghenyst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![...](Images/Vandergheynst_IncrCollab.png)\n", "Source: Pr. Vanderghenyst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![...](Images/Vandergheynst_Versions.png)\n", "Source: Pr. Vanderghenyst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "** In summary **\n", "\n", "**Text processing** comments / revision mode functionalities are not sufficient for good collaboration.\n", "\n", "**Google Scholar** and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.\n", "\n", "** $\\Rightarrow$ we need something else! **\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Share LaTeX\n", "\n", "**[Share LaTeX](https://de.sharelatex.com/)** is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users. ![.](Images/ShareLaTeX.png)\n", "\n", "Good, but only if all partners are LaTeX users." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Authorea\n", "\n", "**[Authorea](https://www.authorea.com/)**: collaborative writing, easy to use.\n", "\n", "![Authorea](Images/Authorea.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Free account to test (limited to 1 private document, no limits on public documents). EPFL licence provided by the Library." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Simple syntax : WYSIWYG and Markdown (lightweight text formatting language). More complex formating possible using LaTeX " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Enables others to make comments" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Supports interactive documents / figures (Jupyter)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Offline synchronization on personal computer (using the Git version control system) \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "slide" + "slide_type": "subslide" } }, "source": [ - "### 2.4.3 - Collaborative coding\n", + "### 2.4.3 - Collaborative versioning and branching\n", "\n", "#### Git\n", "\n", "![.](Images/git.png)\n", "\n", "Git is a **multi-platform** (Windows, Mac, GNU/Linux) version control tool.\n", "\n", "Git Servers\n", "* [GitHUB](https://github.com/), very popular, some date hosted in the US. Closed repositories limited (payment or subject to other conditions).\n", "* [c4science](https://c4science.ch/) is the Swiss collaborative development platform. Unlimited number of repositories (opened / closed). \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Git workflows\n", "\n", "Git will however not do everything for you.\n", "\n", "- You need to think up a naming convention (folder structure, file names) e.g.\n", " - PROJECT-Experiment-Researcher(ORCID)-YYYYMMDD.extension\n", " - PROJECT-Experiment-Researcher(ORCID)-Software-Format-YYYYMMDD.extension\n", " - PROJECT-Experiment-Researcher(ORCID)-Software-Version-Format-YYYYMMDD.extension\n", "- Set up an appropriated workflow." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Locally\n", "\n", "![Locally](Images/git_workdir_staging_repro_falcone.png)\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The easiest way is to use a centralized repository.\n", "\n", "![Centralized](Images/git_centralized_falcone.png)\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For more complex projects, a project leader can manage the quality.\n", "\n", "![Centralized](Images/git_project_leader_falcone.png)\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For big projects, it is possible to dispatch responsabilities.\n", "\n", "![Centralized](Images/git_dispatched_responsabilities_falcone.png)\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Non linear development is supported: branches \n", "\n", "![Centralized](Images/git_branches_falcone.png)\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Git and GitHub are not suited for long term preservation\n", "\n", "![.](Images/git.png)\n", "\n", "* Some git commands can delete data (namely: *rebase* and *reset --hard*)\n", "* Repositories can be deleted (including on GitHUB)\n", "* A link GitHub $\\Rightarrow$ Zenodo can be set, so each release will be automatically made citable through a DOI and preserved in Zenodo.\n", "\n", "Guide : [Making your code citable](https://guides.github.com/activities/citable-code/)\n", "\n", "![.](Images/zenodo-logo.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.4 - Jupyter, Jupyterhub, Sagemath\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Jupyter\n", "\n", "Interactive **Jupyter Notebooks** documents.: [try.jupyter.org](http://try.jupyter.org)\n", "\n", "![.](Images/jupyterpreview.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Structure:\n", "\n", "* Rich-hyper-text cells (including tables, $\\LaTeX$, images, videos)\n", "* Live code cells (with interactive widgets)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Characteristics \n", "\n", "* Over 50 languages supported : Python, R, Octave, BASH, Matlab, Scala, Java, Haskell..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Can be visualized on line using [nbviewer](http://norvig.com/ipython/Economics.ipynb). (e.g.: http://norvig.com/ipython/Economics.ipynb ).\n", " * Nbviewer is integrated in GitHub and Zenodo" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Jupyter Notebooks are JSON files $\\rightarrow$ can be tracked with Git." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "* Nbconvert allows conversion to many formats, including python:\n", " * jupyter nbconvert notebook.ipynb --to python\n", " * jupyter nbconvert notebook.ipynb --to latex\n", " * jupyter nbconvert notebook.ipynb --to markdown\n", " * jupyter nbconvert notebook.ipynb --to markdown\n", " * jupyter nbconvert notebook.ipynb --to slides\n", " * jupyter nbconvert notebook.ipynb --to html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Executing from command line:\n", " * jupyter nbconvert --to notebook --execute mynotebook.ipynb\n", "\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Powerful python libraries\n", "\n", "- [Pandas](http://pandas.pydata.org/) is a powerful library providing high-performance, easy-to-use data structures and data analysis tools. [Examples](http://pandas.pydata.org/pandas-docs/stable/visualization.html).\n", "- [Numpy](http://www.numpy.org/) is the fundamental package for scientific computing with Python:\n", " - N-dimensional array object\n", " - sophisticated (broadcasting) functions\n", " - tools for integrating C/C++ and Fortran code\n", " - useful linear algebra, Fourier transform, and random number capabilities\n", "- [Matplotlib](http://matplotlib.org/) is a plotting library with great flexibility. It has features comparable to Matlab plotting. [Examples](http://matplotlib.org/gallery.html).\n", "- [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) relies on Pandas (see below). [Examples](https://stanford.edu/~mwaskom/software/seaborn/examples/).\n", "- [NetworkX](https://networkx.github.io/) is suited for complex networks analysis and representation. [Examples](http://networkx.github.io/documentation/latest/gallery.html).\n", "- [r2py](http://rpy2.bitbucket.org/) is an interface to R running embedded in a Python process. \n", "\n", "![.](Images/python.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### And web libraries \n", "\n", "* [Bokeh](http://bokeh.pydata.org/en/latest/) is a Python interactive visualization library that targets modern web browsers for presentation. \n", "* [D3.js](https://d3js.org/) is an open source JavaScript library for creating interactive documents based on data**. D3 helps bringing data to life using HTML, SVG, and CSS. As mentioned above it can be used in Jupyter using matplotlib via [mpld3](http://mpld3.github.io/). \n", "\n", "\n", "![.](Images/d3.js.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Jupyterhub\n", "\n", "* Jupyter Multi Users server (system users, or via GitHub)\n", "* Collaborate in a local folder, mounting a VSPI Share, or with Git.\n", "\n", "![.](Images/jupyterhublogo.svg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.5 - R, RStudio and RStudio server\n", "\n", "#### R\n", "\n", "R is a free software environment for statistical computing and graphics. [One of the best](https://en.wikipedia.org/wiki/R_(programming_language).\n", "\n", "\n", "Platforms:\n", "* wide variety of GNU/Linux and UNIX platforms, \n", "* Windows\n", "* MacOS\n", "\n", "Strength: The diversity of quality open extensions (easily installable with [CRAN](https://cran.r-project.org/))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### RStudio\n", "\n", "RStudio is a free and open-source integrated development environment (IDE) for R.\n", "\n", "![.](Images/Rstudio_Shiny_OtherPackages.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### R and reproducible research\n", "\n", "![.](Images/RStudio_ReprodResearch.png)\n", "\n", "#### Reproducible research and documents\n", "* *knitr* and *rmarkdown*\n", "* tying together results and their presentation in articles (pdf, word), presentations or web sites \n", "* notably in $\\LaTeX$ (.Rtex) or Markdown (.Rmarkdown)\n", "* well integrated in RStudio\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Rmarkdown\n", "\n", "Include R code chunks in markdown:\n", "\n", " # Prime numbers\n", " \n", " Storing a few prime numbers in a variable:\n", "\n", " ```{r}\n", " primes <- c(2,3,5,7,11,13)\n", " ```\n", " Done." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "First you need to setup document properties in YAML:\n", "\n", " ---\n", " title: \"Rmarkdown example\"\n", " author: \"Jan Krause\"\n", " date: \"24 novembre 2016\"\n", " output: pdf_document\n", " ---\n", "\n", " # Prime numbers\n", " \n", " Storing a few prime numbers in a variable:\n", "\n", " ```{r}\n", " primes <- c(2,3,5,7,11,13)\n", " ```\n", " Done." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "#### RStudio Server\n", "\n", "RStudio in your browser.\n", "\n", "![.](Images/rstudio-server.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "slide" } }, "source": [ "## 2.5 - Computational workflow management\n", "\n", "Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.5.1 - AiiDA\n", "\n", "AiiDA a free software has been developed at EPFL (in material sciences): http://www.aiida.net/\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.5.2 - SnakeMake : a simple tool\n", "\n", "* simple : nodes are connected through files (inspired by GNU Make)\n", "* complete :\n", " * supports remote files (http(s), sftp, dropbox, googledrive)\n", " * handles data provenance and rule versions, \n", " * parallelization, \n", " * suspend/resume, \n", " * logging, \n", " * creates schema\n", "* flexible :the SnakeFile is an extension of Python\n", "* http://snakemake.bitbucket.org/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Simple Rule:**\n", "\n", "\n", " rule sort:\n", " input:\n", " f = \"path/to/dataset.txt\"\n", " output:\n", " f = \"dataset.sorted.txt\"\n", " shell:\n", " \"sort {input.f} > {output.f}\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Simple Rule (two inputs):**\n", "\n", "\n", " rule sort:\n", " input:\n", " f1 = \"dataset1.txt\"\n", " f2 = \"dataset2.txt\"\n", " output:\n", " f = \"dataset.sorted.txt\"\n", " shell:\n", " \"cat {input.f1} {input.f2} > {output.f}\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Simple Rule (here in Python, but R scripts are supported too):**\n", "\n", "\n", " rule sort:\n", " input:\n", " a=\"path/to/dataset.txt\"\n", " output:\n", " b=\"dataset.sorted.txt\"\n", " run:\n", " with open(output.b, \"w\") as out:\n", " for l in sorted(open(input.a)):\n", " print(l, file=out)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**More than one rule:**\n", "\n", "\n", " rule result:\n", " input:\n", " 'result.txt'\n", "\n", " rule genrate_cal_2017:\n", " input:\n", " ()\n", " output:\n", " fname = \"tmp/cal.txt\"\n", " shell:\n", " \"cal 2017 > {output.fname}\"\n", "\n", " rule describe:\n", " input:\n", " fname1 = \"DESCRIPTION.txt\",\n", " fname2 = \"tmp/cal.txt\"\n", " output:\n", " fname = \"result.txt\"\n", " shell:\n", " \"cat {input.fname1} {input.fname2} > {output.fname}\"\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Expand (running rules in parallel):**\n", "\n", " DATASETS = [\"D1\", \"D2\", \"D3\", \"D4\", \"D5\", \"D6\"]\n", "\n", " rule all:\n", " input:\n", " expand(\"{dataset}.sorted.txt\", dataset=DATASETS)\n", "\n", " rule sort:\n", " input:\n", " \"{dataset}.txt\"\n", " output:\n", " \"{dataset}.sorted.txt\"\n", " shell:\n", " \"sort {input} > {output}\"\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Output : example Graph**\n", "\n", "![](Images/snakemake_workflow.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Output : Log (simplified)**\n", "\n", "| output_file | date | rule | version |\n", "|-----------------------|--------------------------|---------------|----------|\n", "| result.txt | Fri Nov 11 15:48:17 2016 | cleanup | 3.14 |\n", "| tmp/pre-result.txt | Fri Nov 11 15:48:17 2016 | add_head_foot | 1.02 |\n", "| tmp/FOOT.txt | Fri Nov 11 15:48:17 2016 | generate_foot | 5.6 |\n", "| tmp/HEAD.txt | Fri Nov 11 15:48:17 2016 | generate_head | 5.6 |\n", "| tmp/described_cal.txt | Fri Nov 11 15:48:17 2016 | describe | 0.1alpha |\n", "| tmp/cal.txt | Fri Nov 11 15:48:17 2016 | genrate_cal | 8.234 |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "**More about workflows**\n", "\n", "Another tool: Taverna which includes the desktop oriented [Taverna Workbench](https://taverna.incubator.apache.org/download/ (multi-platform and open source), command-line and server applications.\n", "\n", "Finally, **myExperiment** is a platform for sharing scientific workfows, and notably fully supported by Taverna." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "**EPFL platforms**\n", "\n", "- **EPFL SV** sLIMS\n", " - http://sv-it.epfl.ch/slims\n", " - Gaël Anex, Nicolas Argento, Peter Hliva\n", " - Laboratory information management system ![.](Images/SLims.png)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- **EPFL SCITAS** (Victoria Rezzonico)\n", " - High Performance Computing and data Storage ![.](Images/scitas.png)" ] }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "## 2.6 - Data and Storage" + ] + }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "## 2.6 - Data and Storage\n", - "\n", "### 2.6.1 - (Meta)data formats" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Metadata\n", "\n", "- Most common generalist metadata formats: [Dublin Core (DCES)](http://dublincore.org/documents/dces/), [Dublin Core (DCMI)](http://dublincore.org/documents/dcmi-terms/), [DataCite Metadata Schema](https://schema.datacite.org/). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Numerous specilized metadata formats are available for most disciplines, the Research Data Alliance [Metadata Directory](http://rd-alliance.github.io/metadata-directory/) is a good starting point.\n", "\n", "![.](Images/MetadataDirectory.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data format\n", "\n", "Prefer a\n", "\n", "- **standard format**,\n", "- **open** and\n", "- **widely used** \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This way your data will not depend upon a particular software (or company), operating system, or platform. And you will be able to:\n", "- collaborate with more people (on various platforms)\n", "- avoid licensing problems\n", "- maximize the reusability in the future" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Some open formats to take into account\n", "- Portable Document Format **PDF/A, ISO standard**, text [PDF for archiving, no ciphers, included fonts...]\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Text** simple way to encode data. Can be read by most software.\n", " - CSV tables, can be read by most software, and extended using [CSV on the Web](https://www.w3.org/standards/techs/csv) (metadata, datatypes, relation...)\n", " - JSON: Simply structured, less bulky than XML, ideal for data exchange." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Geodata**\n", " * [ISO 19115-1:2014](http://www.iso.org/iso/catalogue_detail.htm?csnumber=53798) : the norm.\n", " * [GeoJson.org](http://geojson.org/) : lighter." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **HDF5**, more flexible (not text, but structured and indexed, supports arbitrary metadata, good performances).\n", " - Compatible with many tools (Python, R, Matlab, Mathematica...)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Databases:** \n", " - SQL: [Postgresql](https://www.postgresql.org/) is relational, open and efficient\n", " - BigData: [MongoDB](https://www.mongodb.com/) for volume, velocity, and variety" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data formats list\n", "\n", "Sustainability of digital formats by the US Library of Congress. [This list](http://www.digitalpreservation.gov/formats/) is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.6.2 - Storage, publication and preservation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Data access sustainability\n", "\n", "A Plos One study showed in 2014 that **more than 60% of links to datasets are broken after 10 years** (1)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Another Plos One 2014 article showed that **the bibliography of 1 out of every 5 is impacted by that phenomenon** (2)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "(1) Pepe et al. (2014). How Do Astronomers Share Data? Reliability and Persistence of Datasets\n", "Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers.\n", "PLoS ONE, 9(8). doi:10.1371/journal.pone.0104798
\n", "(2) Klein et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference\n", "Rot. doi:10.1371/journal.pone.0115253
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Digital preservation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "At CERN, a 2007 studies (1,2) showed that the error ratio was of $10^{-7}$ (over 2 months).\n", "\n", "Causes are complex and varied: disk errors, RAID errors, memory errors, etc.\n", "\n", "For 1 Gigabyte (1000 Mégabytes), we have:\n", "$10^9 \\cdot 10^{-7} = 10^2 = 100$ bytes of bitrot." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "\n", "(1) https://indico.cern.ch/event/13797/session/0/contribution/3/attachments/115080/163419/Data_integrity_v3.pdf
\n", "(2) http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![Preservation vs. Backup](Images/Preservervation_vs_Storage.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### EPFL storage options\n", "\n", "![.](Images/epfl_logo.png)\n", "\n", "EPFL offers many storage options, as described on the VPSI page [Databases, Storage and Virtualization](https://it.epfl.ch/business_service.do?sysparm_document_key=cmdb_ci_service,90cbd58e0ff121009f8579f692050eb7&sysparm_service=Bases_de_donnees_et_Stockage_Serveurs).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**EPFL Storage Prices 2016**\n", "\n", "![Prices Table](Images/EPFL_Storage_Prices_2015-2017.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Why publish in a data archive?\n", "\n", "**Accelerate science and careers**\n", "\n", "Many studies show there are significant advantages for articles that share their code or data.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Avoid bias in science \n", "\n", "\n", "![FDA Turner](Images/FDA_Turner.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Machine learning needs\n", "\n", "![Barend Mons : Field](Images/MonsField.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "![Barend Mons : Field](Images/MonsHelicopter.png)\n", "\n", "\n", "Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.\n", "\n", "Source: Barend Mons, IDCC, Amsterdamm 2016." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data repositories\n", "\n", "- **Zenodo** (hosted by CERN, free) http://zenodo.org\n", " - either EPFL or CHILI community\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Other data repositories\n", "\n", "- **Dryad** («curated», non-profit organisation, partnership with publishers) http://datadryad.org/\n", "\n", "- **Figshare** (commercial, belongs to Macmillian [as does NPG]) http://figshare.com/\n", "\n", "- For more information see **[re3data](http://re3data.org)** in which more than 1'500 data repositoris are described." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data Citation\n", "\n", "- Always use persistent identifiers to avoid broken links (about 60% after 10 years)\n", "- The most common persistent identifier is the DOI (digital object identifier)\n", " - e.g.: http://doi.org/10.5281/zenodo.7525\n", "- Zenodo, Figshare, Dryad and Infoscience can provide DOIs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "slide" } }, "source": [ "## 2.7 - Licences\n", "\n", "A licence allows to define the way your data can be reused. For instance:\n", "\n", "\n", "Creative Commons (**CC0** and **CC-BY**) http://creativecommons.org/ Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) https://wiki.creativecommons.org/wiki/Data\n", "\n", "![.](Images/CCbyncsa_others.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "slide" } }, "source": [ "
You can contact us in the future here: \n", "


\n", "datamanagementplan@epfl.ch
\n", "\n", "


\n", "\n", "
We look forward to hearing from you!
\n", "\n", "


\n", "\n", "
Aude, Jan, Karine and Nathalie
\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }