diff --git a/OptimizeResearchDataManagement.html b/OptimizeResearchDataManagement.html index 7cef179..6acc6f6 100644 --- a/OptimizeResearchDataManagement.html +++ b/OptimizeResearchDataManagement.html @@ -1,14296 +1,14302 @@
Many publishers and scientific journals require, under specific conditions, the publication of used data to achieve the research project results (permanent archiving, standardized formats, etc). This is the case, for instance, with PLoS and Nature Publishing Group. An overview of the editorial policies are available online on this Dryad website
Examples of funders which require DMPs or equivalent:
Research data management policy to be established in 2017
Submission of data management plans with the grant application
Accelerate science and careers
Many sutdies show there are significant advantages for articles that share their code or data.
Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149
Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.
Source: Barend Mons, IDCC, Amsterdamm 2016.
To provide guidance in preparing a DMP, the EPFL-ETHZ checklist includes four categories to cover questions related to:
Ethics issues arise in many areas of research.
Research involving the voluntary participation of research subjects and the collection of data that might be considered as personal.
You must protect your volunteers, yourself and your researcher colleagues.
The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project."
Sources:
If you answered yes to any of the above question, ethical and legal issues apply.
-You should check the Research Office Checklist: Research Office Ethics Assessment.
+personal data (data)
+For instance in such cases : "...collection of personal data, interviews, observations, questionnaires, recordings, tracking or the secondary use of information provided for other purposes, e.g. social media sites, other research projects etc.
-In such cases the Human Research Ethics Committee at EPFL (HREC) should be consulted.
-The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project."
-http://research-office.epfl.ch/op/edit/page-117394.html
+sensitive personal data
+According to the Swiss FADP (article 3 c.) data on:
+personal data (data)
sensitive personal data
-According to the Swiss FADP (article 3 c.) data on:
-Swiss FADP, article 3 e.: -any operation with personal data, irrespective of the means applied and the procedure, and in particular the collection, storage, use, revision, disclosure, archiving or destruction of data;
-Notably:
If you work with personal or sensitive data,
+you should check the Research Office Checklist: Research Office Ethics Assessment, especially the checklists (login with Gaspar).
Any person may request information from the controller of a data file as to whether data concerning them is being processed.
-Swiss FADP Article 8.
+(Swiss FADP, article 7) Personal data must be protected against unauthorised processing through adequate technical and organisational measures.
-Technical measures : notably it is forbidden to store personal data in countries that are not compatible with Swiss law, such as the US.
-This excludes the usage of many clouds: Dropbox, Google Drive, Microsoft Azure, Amazon S3...
+Swiss Federal Act on Data Protection (FADP) (or Loi sur la Protection des Données LPD), article 3 e.: +any operation with personal data, irrespective of the means applied and the procedure, and in particular:
+of data;
Personal Data collection and processing implies compliance with the law on privacy and data protection:
+Processing data should notably:
+Disclosure
-(Swiss FADP article 3 f.): making personal data accessible, for example:
Do you need to identify the subject/person ?
+If disclosed, do the data you collect lead to the dissemination of personal information ?
+How and to whom the data will be disseminated ?
+Cross-border disclosure
-Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.
-Art. 61Cross-border disclosure -Personal data must be protected against unauthorised processing through adequate technical and organisational measures.
+Personal data must be protected against unauthorised processing through adequate technical and organisational measures (Swiss FADP, article 7).
Disclosure
+Making personal data accessible, for example:
+(Swiss FADP article 3 f.)
Recapitulation
+Cross-border disclosure
+Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.
+Cross-border disclosure of personal data must be protected against unauthorised processing through adequate technical and organisational measures.
+(FDAP Art. 6)
+ +Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if:
References
Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).
Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).
Anonymisation
-Privacy protection methods, either :
personal information from datasets.
In passing, there is more to this (Privacy-Preserving Data Mining Methods / Charu Affarwal and Philip Yu. 2008.):
Example including removal and generalization (same source):
Name | Age | Gender | State of domicile | Religion | Disease |
---|---|---|---|---|---|
Ramsha | 29 | Female | Tamil Nadu | Hindu | Cancer |
Yadu | 24 | Female | Kerala | Hindu | Viral infection |
Salima | 28 | Female | Tamil Nadu | Muslim | TB |
sunny | 27 | Male | Karnataka | Parsi | No illness |
Joan | 24 | Female | Kerala | Christian | Heart-related |
Bahuksana | 23 | Male | Karnataka | Buddhist | TB |
Rambha | 19 | Male | Kerala | Hindu | Cancer |
Kishor | 29 | Male | Karnataka | Hindu | Heart-related |
Johnson | 17 | Male | Kerala | Christian | Heart-related |
John | 19 | Male | Kerala | Christian | Viral infection |
To (name and religion were removed, age was generalized):
Name | Age | Gender | State of domicile | Religion | Disease |
---|---|---|---|---|---|
* | 20 < Age ≤ 30 | Female | Tamil Nadu | * | Cancer |
* | 20 < Age ≤ 30 | Female | Kerala | * | Viral infection |
* | 20 < Age ≤ 30 | Female | Tamil Nadu | * | TB |
* | 20 < Age ≤ 30 | Male | Karnataka | * | No illness |
* | 20 < Age ≤ 30 | Female | Kerala | * | Heart-related |
* | 20 < Age ≤ 30 | Male | Karnataka | * | TB |
* | Age ≤ 20 | Male | Kerala | * | Cancer |
* | 20 < Age ≤ 30 | Male | Karnataka | * | Heart-related |
* | Age ≤ 20 | Male | Kerala | * | Heart-related |
* | Age ≤ 20 | Male | Kerala | * | Viral infection |
This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.
An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:
Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :
Name | Age | Gender | State of domicile | Religion | Disease |
---|---|---|---|---|---|
* | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |
* | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |
* | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |
If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity.
The l-diversity Principle : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.
There are several definition of "well-represented" (source).
By the way, l-diversity has weaknesses to, that is why people invented t-closeness.
L-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be the semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases (adapted form source).
The t-closeness Principle: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness (source).
By linking with another database: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts.
Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1 (source).
Statistical Disclosure Control / Hundepool, & al. 2012.
Ebook provided by the EPFL library.
Tools
According to a Nature study in 2012, 47 out of 53 medical research papers are irreproducible (1).
A previous study showed in 2009 that 16 out of 18 bioinformatics papers could not be reproduced entirely (2).
In 2004, it was found that less than 9% of papers share their code (3).
(1) Begley, C. G.; Ellis, L. M. (2012). "Drug development: Raise standards for preclinical cancer research". Nature 483 (7391): 531–533.
(2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.
(3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. "Reproducible research in signal processing." Signal Processing Magazine, IEEE 26.3 (2009): 37-47
[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]
Researchers often start to think about reproduciblity at the end of projects. It is sometimes too late: by then numerous versions of code and datasets may be spread in various places (folders, dropbox, usb drives...).
A practical 5 points approach:
Slide inspired by chapter 2 of Reproducible Research with R and RStudio.
More details:
People often need to collaborate at a finer level. More and more.
Source: Pr. Vanderghenyst, EFPL Library Noon Talk, 25.8.2016.
Source: Pr. Vanderghenyst, EFPL Library Noon Talk, 25.8.2016.
Source: Pr. Vanderghenyst, EFPL Library Noon Talk, 25.8.2016.
In summary
Text processing comments / revision mode functionalities are not sufficient for good collaboration.
Google Scholar and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.
$\Rightarrow$ we need something else!
Share LaTeX is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users.
Good, but only if all partners are LaTeX users.
Git will however not do everything for you.
Locally
Source: J.-L. Falcone.
The easiest way is to use a centralized repository.
Source: J.-L. Falcone.
For more complex projects, a project leader can manage the quality.
Source: J.-L. Falcone.
For big projects, it is possible to dispatch responsabilities.
Source: J.-L. Falcone.
Non linear development is supported: branches
Source: J.-L. Falcone.
Guide : Making your code citable
Interactive Jupyter Notebooks documents.: try.jupyter.org
Structure:
Characteristics
R is a free software environment for statistical computing and graphics. One of the best.
Platforms:
Strength: The diversity of quality open extensions (easily installable with CRAN).
RStudio is a free and open-source integrated development environment (IDE) for R.
Include R code chunks in markdown:
# Prime numbers
Storing a few prime numbers in a variable:
```{r}
primes <- c(2,3,5,7,11,13)
```
Done.
First you need to setup document properties in YAML:
---
title: "Rmarkdown example"
author: "Jan Krause"
date: "24 novembre 2016"
output: pdf_document
---
# Prime numbers
Storing a few prime numbers in a variable:
```{r}
primes <- c(2,3,5,7,11,13)
```
Done.
RStudio in your browser.
Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.
AiiDA a free software has been developed at EPFL (in material sciences): http://www.aiida.net/
Simple Rule:
rule sort:
input:
f = "path/to/dataset.txt"
output:
f = "dataset.sorted.txt"
shell:
"sort {input.f} > {output.f}"
Simple Rule (two inputs):
rule sort:
input:
f1 = "dataset1.txt"
f2 = "dataset2.txt"
output:
f = "dataset.sorted.txt"
shell:
"cat {input.f1} {input.f2} > {output.f}"
Simple Rule (here in Python, but R scripts are supported too):
rule sort:
input:
a="path/to/dataset.txt"
output:
b="dataset.sorted.txt"
run:
with open(output.b, "w") as out:
for l in sorted(open(input.a)):
print(l, file=out)
More than one rule:
rule result:
input:
'result.txt'
rule genrate_cal_2017:
input:
()
output:
fname = "tmp/cal.txt"
shell:
"cal 2017 > {output.fname}"
rule describe:
input:
fname1 = "DESCRIPTION.txt",
fname2 = "tmp/cal.txt"
output:
fname = "result.txt"
shell:
"cat {input.fname1} {input.fname2} > {output.fname}"
Expand (running rules in parallel):
DATASETS = ["D1", "D2", "D3", "D4", "D5", "D6"]
rule all:
input:
expand("{dataset}.sorted.txt", dataset=DATASETS)
rule sort:
input:
"{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Output : example Graph
Output : Log (simplified)
output_file | date | rule | version |
---|---|---|---|
result.txt | Fri Nov 11 15:48:17 2016 | cleanup | 3.14 |
tmp/pre-result.txt | Fri Nov 11 15:48:17 2016 | add_head_foot | 1.02 |
tmp/FOOT.txt | Fri Nov 11 15:48:17 2016 | generate_foot | 5.6 |
tmp/HEAD.txt | Fri Nov 11 15:48:17 2016 | generate_head | 5.6 |
tmp/described_cal.txt | Fri Nov 11 15:48:17 2016 | describe | 0.1alpha |
tmp/cal.txt | Fri Nov 11 15:48:17 2016 | genrate_cal | 8.234 |
More about workflows
Another tool: Taverna which includes the desktop oriented Taverna Workbench, command-line and server applications.
Finally, myExperiment is a platform for sharing scientific workfows, and notably fully supported by Taverna.
EPFL platforms
This way your data will not depend upon a particular software (or company), operating system, or platform. And you will be able to:
A Plos One study showed in 2014 that more than 60% of links to datasets are broken after 10 years (1).
Another Plos One 2014 article showed that the bibliography of 1 out of every 5 is impacted by that phenomenon (2).
(1) Pepe et al. (2014). How Do Astronomers Share Data? Reliability and Persistence of Datasets
Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers.
PLoS ONE, 9(8). doi:10.1371/journal.pone.0104798
(2) Klein et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference
Rot. doi:10.1371/journal.pone.0115253
At CERN, a 2007 studies (1,2) showed that the error ratio was of $10^{-7}$ (over 2 months).
Causes are complex and varied: disk errors, RAID errors, memory errors, etc.
For 1 Gigabyte (1000 Mégabytes), we have: $10^9 \cdot 10^{-7} = 10^2 = 100$ bytes of bitrot.
(1) https://indico.cern.ch/event/13797/session/0/contribution/3/attachments/115080/163419/Data_integrity_v3.pdf
(2) http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/
EPFL offers many storage options, as described on the VPSI page Databases, Storage and Virtualization.
EPFL Storage Prices 2016
Accelerate science and careers
Many studies show there are significant advantages for articles that share their code or data.
Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149
Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.
Source: Barend Mons, IDCC, Amsterdamm 2016.
Dryad («curated», non-profit organisation, partnership with publishers) http://datadryad.org/
Figshare (commercial, belongs to Macmillian [as does NPG]) http://figshare.com/
For more information see re3data in which more than 1'500 data repositoris are described.
A licence allows to define the way your data can be reused. For instance:
Creative Commons (CC0 and CC-BY) http://creativecommons.org/ Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) https://wiki.creativecommons.org/wiki/Data