diff --git a/OptimizeResearchDataManagement.ipynb b/OptimizeResearchDataManagement.ipynb index 970bfc5..4315a1e 100755 --- a/OptimizeResearchDataManagement.ipynb +++ b/OptimizeResearchDataManagement.ipynb @@ -1,1808 +1,1808 @@ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "# Optimizing Research Data Management\n", "\n", "## EPFL SFP\n", "\n", "### Thursday June the 8th, 2017\n", "\n", "#### Aude Dieudé & Jan Krause\n", "\n", "
Contact: researchdata@epfl.ch\n", "\n", "![.](./Images/CC-By-NC-SA_88x31.png)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1\n", "\n", "## 1.1 - Introduction to RDM" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition, context and best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Introduction: [video](https://www.youtube.com/watch?v=N2zK3sAtr-4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition : Research data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For the Organization for Economic Cooperation and Development [OCDE](http://www.oecd.org/fr/sti/sci-tech/38500823.pdf), research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Research Data Lifecycle" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![Sidney Cycle](Images/SidneyLifecycle.png)\n", "\n", "[Source: Formation URFIST, Rennes, 2016](https://drive.google.com/file/d/0BxKZLWq08xX-TW5VOEUtd2FSRE0/view?pref=2&pli=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.2 - Actors and Skills" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Actors" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![Actors](Images/Actors2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Skills" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "![Skills](./Images/Skills.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Requirements regarding research data management\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Publishers\n", "\n", "Many publishers and scientific journals require, under specific\n", "conditions, the publication of used data to achieve the research project\n", "results (permanent archiving, standardized formats, etc.). This is the case,\n", "for instance, with PLoS and Nature Publishing Group. A list of\n", "editorial policies are available online on this [Dryad website](http://wiki.datadryad.org/Journal_instructions). Note: This page seems to be a one shot publication and is not exhaustive." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funders" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Examples of funders which require DMPs or equivalent:\n", "\n", "![Funders](./Images/FundersWhichRequireDMPs.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funding agency and DMP : Horizon 2020\n", "\n", "![.](Images/H2020.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- [Horizon 2020](https://ec.europa.eu/programmes/horizon2020/): is the biggest funding agency from the European Commission \n", "with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its\n", "main objective is to promote and support excellence in the scientific field." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Horizon 2020 requires the preparation of a [data management plan](http://ec.europa.eu/programmes/horizon2020/en/what-horizon-2020), for [all disciplines](https://ec.europa.eu/digital-single-market/en/news/communication-european-cloud-initiative-building-competitive-data-and-knowledge-economy-europe). DMPs are living documents; at least 3 versions (official deliverables) are required. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funding agency and DMP : SNSF\n", "![.](Images/SNSF.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Submission of **data management plans** with the grant application **is now mandatory**. DMP **are published** on the Web (P3) at the end of the project: \n", "\n", "* See the [SNSF guidelines, policy and FAQ](www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx) regarding open research data.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.3 - Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition : Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Data Management Plan (DMP) refers to the strategies put into place to\n", "create, store, share, maintain, archive and preserve research data\n", "throughout their life cycle.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The DMP describes which data are going to be produced and how each\n", "type of data will be organized, classified, archived, shared, distributed,\n", "secured and preserved in a secure way." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Here is a [video](https://www.youtube.com/watch?v=gYDb-GP1CA4), which illustrates how the DMP works concretely:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.4 - DMP best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Best practices examples: DMPonline (UK)\n", "![Best Practices](./Images/DMP_DMPonline.png)\n", "\n", "
http://dmponline.dcc.ac.uk
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Best practices examples: EPFL (Switzerland)\n", "To provide guidance in preparing a DMP, the **[EPFL-ETHZ checklist](http://library.epfl.ch/files/content/sites/library3/files/research-data/dmp/Data_management_plan_checklist_EPFL_2016.pdf)** includes\n", "four categories to cover questions related to:\n", "- Research Data Acquisition : type, quantity, license, etc.\n", "- Research Data Format : format, metadata, identification, etc.\n", "- Research Data Sharing : embargo, intellectual property, etc.\n", "- Data Preservation : storage, sensitivity of the data, archiving, etc.\n", "![EPFL Checklist](./Images/EPFL-checklist.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 2 - Practical issues\n", "\n", "* Ethics, legal aspects, anonymization \n", "* Collaborative coding and writing\n", "* (Meta)data formats\n", "* Publication and long term preservation\n", "\n", "![.](Images/tools.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.1 - Ethics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.1. When human beings are involved...\n", "\n", "![Human Beings](Images/humanbeing.png)\n", "\n", "**Ethics issues arise in many areas of research**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Research involving the voluntary participation of research subjects and the collection of **data that might be considered as personal**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You must protect your **volunteers, yourself and your researcher colleagues**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "[H2020 Programme Guidance How to complete your ethics self-assessment, p.1, 12 July 2016](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![Questions](Images/question.png)\n", "\n", "- Does your research practice involve collecting, processing and storing information on persons?\n", " - ... identifiable persons ?\n", " - ... vulnerable persons ?\n", " - ... children ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- How do you inform persons/subjects on what you will be doing ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Human Research Ethics Committee at EPFL (HREC)\n", "\n", "The role of the [HREC](http://research-office.epfl.ch/research-ethics/research-ethics-assessment/epfl-human-research-ethics-committee/hrec) is to **review any research project carried out at EPFL involving non-invasive human research** from an ethical point of view, before the beginning of the project.\n", "\n", "Contact person: [Esther van der Velde](https://people.epfl.ch/elisabeth.vandervelde?lang=fr)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Collecting consent \n", "\n", "“Research involving human beings may only be carried out if, […], the persons concerned have given their informed consent or, after being duly informed, have not exercised their right to dissent. […] The persons concerned may withhold or revoke their consent at any time, without stating their reasons.” *Human Research Act (HRA), article 7. *\n", "\n", "The consent must be:\n", "- Simple, understandable,\n", "- Adapted to the subject (child, teenager...) (HRA Art. 21-22)\n", "- See the following resources :\n", " - [Ethical Issues Checklists](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment/ethical-issues-checklists) (restricted access)\n", " - [Non Invasive Research](http://research-office.epfl.ch/op/edit/page-117394.html)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Sources:**\n", "\n", "* [H2020 Programme Guidance : How to complete your ethics self assessment](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf), 12th July 2016. Page 1.\n", - "* http://research-office.epfl.ch/op/edit/page-117394.html\n", + "* [Non-invasive research](http://research-office.epfl.ch/op/edit/page-117394.html)\n", "* Swiss Academy of Medical Sciences (SAMS) (2015). “Research with human subjects. A manual for practitioners.” 2nd edition, http://swissethics.ch/doc/swissethics/manual_research_nov2015_e.pdf\n", "* Federal Act on Research involving Human Beings (Human Research Act, HRA) of 30 September 2011 (Status as of 1 January 2014). https://www.admin.ch/opc/en/classified-compilation/20061313/index.html\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.2. Data ? What data ? Personal data ? Sensitive data ?\n", "\n", "![](Images/personaldata.png)\n", "\n", "**personal data**\n", "\n", "* all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)\n", "* examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**sensitive personal data**\n", "\n", "According to the Swiss FADP (article 3 c.), data on: \n", "\n", "1. religious, ideological, political or trade union-related views or activities,\n", "2. **health, the intimate sphere or the racial origin**,\n", "3. social security measures,\n", "4. administrative or criminal proceedings and sanctions;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![Questions](Images/question.png)\n", "\n", "- What data do you typically use (collect, process, store) in the course of a research project ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Among these data which ones are **personal** ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Among these data which ones are **sensitive** ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If you work with personal or sensitive data,\n", "\n", "you should check the Research Office website: [Research Office Ethics Assessment](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment), especially the [**checklists**](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment/ethical-issues-checklists) (login with Gaspar)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.3 Doing what with data ?\n", "\n", "\n", "![](Images/dataanalysis.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Personal or sensitive data processing\n", "\n", "**Swiss [Federal Act on Data Protection](https://www.admin.ch/opc/en/classified-compilation/19920153/index.html) (FADP) (or Loi sur la Protection des Données LPD), article 3 e.**: \n", "any operation with personal data, irrespective of the means applied and the procedure, and in particular:\n", "* the collection, \n", "* storage, \n", "* use, \n", "* revision, \n", "* disclosure, \n", "* archiving \n", "* or destruction \n", "\n", "of data;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Make sure that...**\n", "\n", "* processing data is carried out in good faith and only for the purpose indicated at the time of collection [...] (FAPD article 4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* consent is valid (given voluntarily on the provision of adequate information (FAPD article 4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* data is correct (FAPD article 5)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* participants can be informed on their personal data (FAPD article 8 , HRA, article 8)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* data are rendered anonymous, as soon as the purpose of the processing permits (FAPD article 22, Processing for reasearch)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.4. Protecting and disclosing personal data\n", "\n", "#### Protection\n", "\n", "Personal data must be protected against unauthorised processing through adequate technical and organisational measures (Swiss FADP, article 7).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Disclosure**\n", "\n", "Making personal data accessible, for example:\n", "* by permitting access,\n", "* transmission\n", "* or publication.\n", "\n", "(Swiss FADP article 3 f.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Cross-border disclosure**\n", "\n", "Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection. \n", "\n", "Cross-border disclosure of personal data must be protected against unauthorised processing through adequate technical and organisational measures. \n", "\n", "(FDAP Art. 6)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**References**\n", "\n", "* Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).\n", "\n", "* Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).\n", " * [Directive 95/46/EC](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=URISERV%3Al14012)\n", " * As of 2018: [REGULATION (EU) 2016/679 repealing Directive 95/46/EC](http://eur-lex.europa.eu/legal-content/de/TXT/?uri=CELEX%3A32016R0679)\n", " * [H2020 Program Guidance : how to complete your ethics self assessment](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf), 12.7.2016\n", "* Information\n", " * http://research-office.epfl.ch/ethique-recherche/research-ethics-assessment/ethical-review/personal-data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.2 - Anonymization methods\n", "\n", "Privacy protection methods, either :\n", "\n", "* removing,\n", "* generalizing or\n", "* encrypting,\n", "\n", "personal information from datasets." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Note:** Anonymization or de-identification\n", "\n", "* **Anonymization** is irreversible.\n", - "* **De-identification** may include preserving indentifiers that can be re-linked by a trusted party.\n", + "* **De-identification** may include preserving identifiers that can be re-linked by a trusted party.\n", "\n", "For legal aspects see HRA 35." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In passing, there is more to this (Privacy-Preserving Data Mining Methods / Charu Affarwal and Philip Yu. 2008.):\n", "\n", "![Privacy-Preserving_Data_Mining__Methods.png](Images/Privacy-Preserving_Data_Mining__Methods.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### k-anonymity\n", "\n", "\n", "##### Definition\n", "\n", "\"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release\" ([Source](https://en.wikipedia.org/wiki/K-anonymity))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Illustration\n", "\n", "Example including removal and generalization (same source):\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|-----------|-----|--------|-------------------|-----------|-----------------|\n", "| Ramsha | 29 | Female | Tamil Nadu | Hindu | Cancer |\n", "| Yadu | 24 | Female | Kerala | Hindu | Viral infection |\n", "| Salima | 28 | Female | Tamil Nadu | Muslim | TB |\n", "| sunny | 27 | Male | Karnataka | Parsi | No illness |\n", "| Joan | 24 | Female | Kerala | Christian | Heart-related |\n", "| Bahuksana | 23 | Male | Karnataka | Buddhist | TB |\n", "| Rambha | 19 | Male | Kerala | Hindu | Cancer |\n", "| Kishor | 29 | Male | Karnataka | Hindu | Heart-related |\n", "| Johnson | 17 | Male | Kerala | Christian | Heart-related |\n", "| John | 19 | Male | Kerala | Christian | Viral infection |\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To (name and religion were removed, age was generalized):\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|------|---------------|--------|-------------------|----------|-----------------|\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | Cancer |\n", "| * | 20 < Age ≤ 30 | Female | Kerala | * | Viral infection |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | TB |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | No illness |\n", "| * | 20 < Age ≤ 30 | Female | Kerala | * | Heart-related |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | TB |\n", "| * | Age ≤ 20 | Male | Kerala | * | Cancer |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | Heart-related |\n", "| * | Age ≤ 20 | Male | Kerala | * | Heart-related |\n", "| * | Age ≤ 20 | Male | Kerala | * | Viral infection |\n", "\n", "This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### l-diversity - motivation\n", "\n", "An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:\n", "* **homogeneity attacks**: in the case that a group of lines are homogeneous ,\n", "* **background knowledge attacks**: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) ([source](https://en.wikipedia.org/wiki/K-anonymity)). \n", "\n", "Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|------|---------------|--------|-------------------|----------|-----------------|\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "\n", "If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### l-diversity - definition\n", "\n", "**The l-diversity Principle** : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.\n", "\n", "There are several definition of \"well-represented\" ([source](https://en.wikipedia.org/wiki/L-diversity)).\n", "\n", "By the way, l-diversity has weaknesses to, that is why people invented **t-closeness**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### differential privacy\n", "\n", "**By linking with another database**: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1* ([source](https://en.wikipedia.org/wiki/Differential_privacy))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Anonymization - theory and tools\n", "\n", "![](Images/sdc.jpg)\n", "\n", "Statistical Disclosure Control / Hundepool, & al. 2012. [Ebook / EPFL library](http://proquest.safaribooksonline.com/9781118348215). \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Tools\n", "* **sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation (R package)**\n", "* ARX Data Anonymization Tool (Java: library &GUI)\n", "* μ-ARGUS (Java, GUI)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.3 - Collaborative tools\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.3.1 - File sharing\n", "\n", "![.](Images/owncloud.png)\n", "\n", "- Personal/group level: OwnCloud, free software: Mac, Windows, Linux, iOS, Android... Web.\n", " - Your own server: OwnCloud https://owncloud.org/\n", " - Many plugins: contacts, calendar, collaborative writing, image galleries, etc.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Swiss level: SwitchDrive https://drive.switch.ch/\n", " - Owncloud with 50 Go by user, \n", " - Restricted to Swiss universities members.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A recent fork of ownCloud: [NextCloud](https://nextcloud.com/) aims more transparent development processes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.3.2 - Collaborative writing\n", "#### File sharing is not enough\n", "\n", "People often need to collaborate at a finer level. More and more.\n", "\n", "![...](Images/Vandergheynst_Collaborative.png)\n", "Source: Pr. Vandergheynst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![...](Images/Vandergheynst_IncrCollab.png)\n", "Source: Pr. Vandergheynst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![...](Images/Vandergheynst_Versions.png)\n", "Source: Pr. Vandergheynst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "** In summary **\n", "\n", "**Text processing** comments / revision mode functionalities are not sufficient for good collaboration.\n", "\n", "**Google Documents** and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.\n", "\n", "** $\\Rightarrow$ we need something else! **\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Share LaTeX\n", "\n", "**[Share LaTeX](https://de.sharelatex.com/)** is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users. ![.](Images/ShareLaTeX.png)\n", "\n", "Access provided by SWITCH, via the [Sandstorm platform](https://sandstorm.cloud.switch.ch/).\n", "\n", "Good, but only if all partners are LaTeX users." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Authorea\n", "\n", "**[Authorea](https://www.authorea.com/)**: collaborative writing, easy to use.\n", "\n", "![Authorea](Images/Authorea.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Simple syntax : WYSIWYG and Markdown (lightweight text formatting language). More complex formating possible using LaTeX " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Enables others to make comments" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Supports interactive documents / figures (Jupyter)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Offline synchronization on personal computer (using the Git version control system) \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.3.3 - Tools for coding and analyzing data\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![.](Images/git.png)\n", "\n", "Git is a **multi-platform** (Windows, Mac, GNU/Linux) version control tool.\n", "\n", "Git Servers\n", "* [GitHUB](https://github.com/), very popular, some date hosted in the US. Closed repositories limited (payment or subject to other conditions).\n", "* [c4science](https://c4science.ch/) is the Swiss collaborative development platform. Unlimited number of repositories (opened / closed). \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Interactive **Jupyter Notebooks** documents.: [try.jupyter.org](http://try.jupyter.org)\n", "\n", "![.](Images/jupyterpreview.png)\n", "\n", "Structure:\n", "\n", "* Rich-hyper-text cells (including tables, $\\LaTeX$, images, videos)\n", "* Live code cells (with interactive widgets)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Characteristics \n", "\n", "* Over 50 languages supported : Python, R, Octave, BASH, Matlab, Scala, Java, Haskell...\n", "\n", "* Can be visualized on line using [nbviewer](http://norvig.com/ipython/Economics.ipynb). (e.g.: http://norvig.com/ipython/Economics.ipynb ).\n", " * Nbviewer is integrated in GitHub and Zenodo\n", "\n", "* Jupyter Notebooks are JSON files $\\rightarrow$ can be tracked with Git.\n", "\n", "* Nbconvert allows conversion to many formats, including python:\n", " * jupyter nbconvert notebook.ipynb --to python\n", " * jupyter nbconvert notebook.ipynb --to latex\n", " * jupyter nbconvert notebook.ipynb --to pdf\n", " * jupyter nbconvert notebook.ipynb --to slides\n", " * jupyter nbconvert notebook.ipynb --to html\n", "\n", "* Executing from command line:\n", " * jupyter nbconvert --to notebook --execute mynotebook.ipynb" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Computational workflow (e.g. AiiDA, SnakeMake, Taverana, Kepler and Pegasus)\n", "![AiiDA](./Images/AiiDA.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.4 - Data and Storage" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.1 - (Meta)data formats" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Metadata\n", "\n", "- Most common generalist metadata formats: [Dublin Core (DCES)](http://dublincore.org/documents/dces/), [Dublin Core (DCMI)](http://dublincore.org/documents/dcmi-terms/), [DataCite Metadata Schema](https://schema.datacite.org/). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "- Numerous specialized metadata formats are available for most disciplines, the Research Data Alliance [Metadata Directory](http://rd-alliance.github.io/metadata-directory/) is a good starting point.\n", - "\n", - "![.](Images/MetadataDirectory.png)\n" + "- Numerous specialized metadata formats are available for most disciplines, the Research Data Alliance [Metadata Directory](http://rd-alliance.github.io/metadata-directory/) is a good starting point.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data format\n", "\n", "Prefer a\n", "\n", "- **standard format**,\n", "- **open** and\n", "- **widely used** \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This way your data will not depend upon a particular software (or company), operating system, or platform. And you will be able to:\n", "- collaborate with more people (on various platforms)\n", "- avoid licensing problems\n", "- maximize the reusability in the future" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Some open formats to take into account\n", "- Portable Document Format **PDF/A, ISO standard**, text [PDF for archiving, no ciphers, included fonts...]\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Text** simple way to encode data. Can be read by most software.\n", " - CSV tables, can be read by most software, and extended using [CSV on the Web](https://www.w3.org/standards/techs/csv) (metadata, datatypes, relation...)\n", " - JSON: Simply structured, less bulky than XML, ideal for data exchange." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Geodata**\n", " * [ISO 19115-1:2014](http://www.iso.org/iso/catalogue_detail.htm?csnumber=53798) : the norm.\n", " * [GeoJson.org](http://geojson.org/) : lighter." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **HDF5**, more flexible (not text, but structured and indexed, supports arbitrary metadata, good performances).\n", " - Compatible with many tools (Python, R, Matlab, Mathematica...)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Databases:** \n", " - SQL: [Postgresql](https://www.postgresql.org/) is relational, open and efficient\n", " - BigData: [MongoDB](https://www.mongodb.com/) for volume, velocity, and variety" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data formats list\n", "\n", "Sustainability of digital formats by the US Library of Congress. [This list](http://www.digitalpreservation.gov/formats/) is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.2 - Storage, publication and preservation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### 2.4.2.1 - Storagae at EPFL" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### EPFL storage options\n", "\n", "![.](Images/epfl_logo.png)\n", "\n", "EPFL offers many storage options, as described on the VPSI page [Databases, Storage and Virtualization](https://it.epfl.ch/business_service.do?sysparm_document_key=cmdb_ci_service,90cbd58e0ff121009f8579f692050eb7&sysparm_service=Bases_de_donnees_et_Stockage_Serveurs).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**EPFL Storage Prices 2017**\n", "\n", - "![Prices Table](Images/EPFL_Storage_Prices_2015-2017.png)" + "![Prices Table](Images/EPFL_Storage_Prices_2015-2017.png)\n", + "\n", + "Disclaimer: Google servers are not located in Switzerland, thus personal and sensitive data should not be stored on that platform." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**EPFL platforms**\n", "\n", "- **EPFL SV** sLIMS\n", " - http://sv-it.epfl.ch/slims\n", " - Gaël Anex, Nicolas Argento, Peter Hliva\n", " - Laboratory information management system ![.](Images/SLims.png)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- **EPFL SCITAS** (Victoria Rezzonico)\n", " - High Performance Computing and data Storage ![.](Images/scitas.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### 2.4.2.2 - Publication and preservation\n", "\n", "#### Research data publication\n", "\n", "“ It is the **release of research data, associated metadata, accompanying documentation, and software code […] for re-use and analysis** in such a manner that they can be discovered on the Web and referred to in a unique and persistent way.\n", "\n", "![Data Publishing](Images/DataPublishing.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Data publishing occurs **via dedicated data repositories and/or (data) journals** which ensure that the published research objects are well documented, curated, archived for the long term, interoperable, citable, quality assured and discoverable \n", "– all aspects of data publishing that are important for future reuse of data by third party end-users.”" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "fragment" + "slide_type": "notes" } }, "source": [ - "Austin, C. C., Bloom, T. K., Dallmeier-Tiessen, S., Khodiyar, V., Murphy, F., Nurnberger, A., . . . Whyte, A. (2016). " + "[Add reference]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data Papers, Data Journals\n", "\n", "![DataPaper-DataJournal](Images/DataPaper-DataJournal.PNG)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Comparison: Dryad - Figshare - Zenodo\n", "\n", "![Z-D-F](Images/Comp_Zenodo_Dryad_Figshare_1.PNG)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![Z-D-F](Images/Comp_Zenodo_Dryad_Figshare_2.PNG)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Backup vs. Preservation\n", "\n", "![Preservation vs. Backup](Images/Preservervation_vs_Storage.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Why publish in a data archive?\n", "\n", - "**Accelerate science and careers**\n", + "*Accelerate science and careers*\n", "\n", - "Many studies show there are significant advantages for articles that share their code or data.\n" + "Many studies show there are significant advantages for articles that share their code or data as described in Drachen's article.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data repositories\n", "\n", "- **Zenodo** (hosted by CERN, free) http://zenodo.org\n", " - either EPFL or CHILI community\n", "\n", "![Zenodo](./Images/Zenodo.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Other data repositories\n", "\n", "- **Dryad** («curated», non-profit organisation, partnership with publishers) http://datadryad.org/\n", "\n", "- **Figshare** (commercial, belongs to Macmillian [as does NPG]) http://figshare.com/\n", "\n", - "- For more information see **[re3data](http://re3data.org)** in which more than 1'500 data repositoris are described.\n", + "- For more information see **[re3data](http://re3data.org)** in which more than 1'500 data repositories are described.\n", "\n", "![re3data](Images/re3datalogo_black.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data Citation\n", "\n", "- Always use persistent identifiers to avoid broken links (about 60% after 10 years)\n", "- The most common persistent identifier is the DOI (digital object identifier)\n", " - e.g.: http://doi.org/10.5281/zenodo.7525\n", "- Zenodo, Figshare, Dryad and Infoscience can provide DOIs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.5 - Licences\n", "\n", "A licence allows to define the way your data can be reused. For instance:\n", "\n", "\n", "Creative Commons (**CC0** and **CC-BY**) http://creativecommons.org/ Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) https://wiki.creativecommons.org/wiki/Data\n", "\n", "![.](Images/CCbyncsa_others.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# - Supplementary materials\n", "\n", "* [Authorea](Authorea.ipynb)\n", "* [Jupyter](Jupyter.ipynb)\n", "* [Git](Git.BASH.ipynb)\n", "* [SnakeMake](SnakeMake.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
You can contact us in the future here: \n", "


\n", "
Contact: researchdata@epfl.ch\n", "\n", "


\n", "\n", "
We look forward to hearing from you!
\n", "\n", "


\n", "\n", "
Aude & Jan
\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 } diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..8209e3f --- /dev/null +++ b/TODO.md @@ -0,0 +1,6 @@ +* Differntial version between Handouts and Presentation using notes +* Replace: Actors -> Stakeolders +* Check definition of "Intimate sphere" in section 2.1.x , et lier à la définition +* Exemple of crossborder disclosure, example of list, e.g. question on list (see CHILI branch) +* Credit Karine +* Sotorage Table : update!