diff --git a/OptimizeResearchDataManagement.ipynb b/OptimizeResearchDataManagement.ipynb index 5baf7b9..ddfd953 100755 --- a/OptimizeResearchDataManagement.ipynb +++ b/OptimizeResearchDataManagement.ipynb @@ -1,2454 +1,2454 @@ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "# Optimizing Research Data Management\n", "\n", "## University of Basel\n", "\n", - "### Wednesday March 22 and Thursday May 11, 2017\n", + "### Monday, 12th February, Tuesday, 13th February, 2018\n", "\n", - "#### Aude Dieudé , Jan Krause , Lorenza Salvatori (EPFL) & Silke Bellanger (UNIBAS) \n", + "#### Aude Dieudé , Eliane Blumer , Raphaël Rey (EPFL) & Silke Bellanger (UNIBAS) \n", "\n", "\n", "<br />Contact: <font size=\"4\" color=\"blue\">silke.bellanger@unibas.ch</font> & <font size=\"4\" color=\"blue\">researchdata@epfl.ch</font>\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1\n", "\n", "## 1.1 - Introduction to RDM" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition, context and best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Introduction: [video](https://www.youtube.com/watch?v=N2zK3sAtr-4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "<center><iframe width=\"300\" height=\"196\" src=\"https://www.youtube.com/embed/N2zK3sAtr-4?start=20&autoplay=0\" frameborder=\"0\" allowfullscreen></iframe></center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition : Research data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- For the Organization for Economic Cooperation and Development [OCDE](http://www.oecd.org/fr/sti/sci-tech/38500823.pdf), research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Research Data Lifecycle" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "<center><img src=\"./Images/SidneyLifecycle.png\" width=\"600\" height=\"450\" /></center>\n", "\n", "[Source: Formation URFIST, Rennes, 2016](https://drive.google.com/file/d/0BxKZLWq08xX-TW5VOEUtd2FSRE0/view?pref=2&pli=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.2 - Actors and Skills" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Actors" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "<!-- <center><img src=\"./Images/Actors.png\" width=\"600\" height=\"450\" /></center> -->\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Skills" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "<center><img src=\"./Images/Skills.png\" width=\"600\" height=\"450\" /></center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Requirements regarding research data management\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Publishers\n", "\n", "Many publishers and scientific journals require, under specific\n", "conditions, the publication of used data to achieve the research project\n", "results (permanent archiving, standardized formats, etc.). This is the case,\n", "for instance, with PLoS and Nature Publishing Group. A list of\n", "editorial policies are available online on this [Dryad website](http://wiki.datadryad.org/Journal_instructions). Note: This page seems to be a one shot publication and is not exhaustive." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funders" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Examples of funders which require DMPs or equivalent:\n", "<center><img src=\"./Images/FundersWhichRequireDMPs.png\" width=\"600\" height=\"450\" /></center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funding agency and DMP : Horizon 2020\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- [Horizon 2020](https://ec.europa.eu/programmes/horizon2020/): is the biggest funding agency from the European Commission \n", "with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its\n", "main objective is to promote and support excellence in the scientific field." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Horizon 2020 requires for some research projects the preparation of a [data management plan](http://ec.europa.eu/programmes/horizon2020/en/what-horizon-2020), which is mandatory in order to receive research funding. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "- [As of 2017](https://ec.europa.eu/digital-single-market/en/news/communication-european-cloud-initiative-building-competitive-data-and-knowledge-economy-europe), the Commission will make **open research data the default option**, while ensuring opt-outs, for all new projects of the Horizon 2020 program." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Funding agency and DMP : SNSF\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Submission of **data management plans** with the grant application will be **mandatory as of October 2017**. See the [communication](http://www.snf.ch/en/researchinFocus/newsroom/Pages/news-170306-towards-open-research-data.aspx)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.3 - Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Definition : Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Data Management Plan (DMP) refers to the strategies put into place to\n", "create, store, share, maintain, archive and preserve research data\n", "throughout their life cycle.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The DMP describes which data are going to be produced and how each\n", "type of data will be organized, classified, archived, shared, distributed,\n", "secured and preserved in a secure way." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Here is a [video](https://www.youtube.com/watch?v=gYDb-GP1CA4), which illustrates how the DMP works concretely:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "<center><iframe width=\"300\" height=\"196\" src=\"https://www.youtube.com/embed/gYDb-GP1CA4?start=20&autoplay=0\" frameborder=\"0\" allowfullscreen></iframe></center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.4 - DMP best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Best practices examples: DMPonline (UK)\n", "<center><img src=\"./Images/DMP_DMPonline.png\" width=\"400\" height=\"300\" /></center>\n", "<center> <a href=\"http://dmponline.dcc.ac.uk/\">http://dmponline.dcc.ac.uk</a> </center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Best practices examples: EPFL (Switzerland)\n", "To provide guidance in preparing a DMP, the **[EPFL-ETHZ checklist](http://library.epfl.ch/files/content/sites/library3/files/research-data/dmp/Data_management_plan_checklist_EPFL_2016.pdf)** includes\n", "four categories to cover questions related to:\n", "- Research Data Acquisition : type, quantity, license, etc.\n", "- Research Data Format : format, metadata, identification, etc.\n", "- Research Data Sharing : embargo, intellectual property, etc.\n", "- Data Preservation : storage, sensitivity of the data, archiving, etc.\n", "<center><img src=\"./Images/EPFL-checklist.png\" width=\"600\" height=\"450\" /></center>\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Guidelines and Policies, University Basel\n", "\n", "Research data policy is in preparation. \n", "\n", "Guidelines regarding good scientific practice: https://www.unibas.ch/en/Research/Research-in-Basel/Values-and-Principles.html \n", "\n", "Informations regarding general data and it guidelines: https://its.unibas.ch/content.cfm?content=586 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 2 - Practical issues\n", "\n", "* Ethics, legal aspects, anonymization \n", "* Collaborative coding and writing\n", "* (Meta)data formats\n", "* Publication and long term preservation\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.1 - Ethics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.1. When human beings are involved...\n", "\n", "\n", "\n", "**Ethics issues arise in many areas of research**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Research involving the voluntary participation of research subjects and the collection of **data that might be considered as personal**. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You must protect your **volunteers, yourself and your researcher colleagues**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "[H2020 Programme Guidance How to complete your ethics self-assessment, p.1, 12 July 2016](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "- Does your research practice involve collecting, processing and storing information on persons?\n", " - ... identifiable persons ?\n", " - ... vulnerable persons ?\n", " - ... children ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- How do you inform persons/subjects on what you will be doing ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Human Research Ethics Committee at EPFL (HREC)\n", "\n", "The role of the [HREC](http://research-office.epfl.ch/research-ethics/research-ethics-assessment/epfl-human-research-ethics-committee/hrec) is to **review any research project carried out at EPFL involving non-invasive human research** from an ethical point of view, before the beginning of the project.\n", "\n", "Contact person: [Esther van der Velde](https://people.epfl.ch/elisabeth.vandervelde?lang=fr)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Collecting consent \n", "\n", "“Research involving human beings may only be carried out if, […], the persons concerned have given their informed consent or, after being duly informed, have not exercised their right to dissent. […] The persons concerned may withhold or revoke their consent at any time, without stating their reasons.” *Human Research Act (HRA), article 7. *\n", "\n", "The consent must be:\n", "- Simple, understandable,\n", "- Adapted to the subject (child, teenager...) (HRA Art. 21-22)\n", "- See the following resources :\n", " - [Ethical Issues Checklists](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment/ethical-issues-checklists) (restricted access)\n", " - [Non Invasive Research](http://research-office.epfl.ch/op/edit/page-117394.html)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Sources:**\n", "\n", "* [H2020 Programme Guidance : How to complete your ethics self assessment](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf), 12th July 2016. Page 1.\n", "* http://research-office.epfl.ch/op/edit/page-117394.html\n", "* Swiss Academy of Medical Sciences (SAMS) (2015). “Research with human subjects. A manual for practitioners.” 2nd edition, http://swissethics.ch/doc/swissethics/manual_research_nov2015_e.pdf\n", "* Federal Act on Research involving Human Beings (Human Research Act, HRA) of 30 September 2011 (Status as of 1 January 2014). https://www.admin.ch/opc/en/classified-compilation/20061313/index.html\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.2. Data ? What data ? Personal data ? Sensitive data ?\n", "\n", "\n", "\n", "**personal data**\n", "\n", "* all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)\n", "* examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**sensitive personal data**\n", "\n", "According to the Swiss FADP (article 3 c.), data on: \n", "\n", "1. religious, ideological, political or trade union-related views or activities,\n", "2. **health, the intimate sphere or the racial origin**,\n", "3. social security measures,\n", "4. administrative or criminal proceedings and sanctions;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "- What data do you typically use (collect, process, store) in the course of a research project ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Among these data which ones are **personal** ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Among these data which ones are **sensitive** ?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If you work with personal or sensitive data,\n", "\n", "you should check the Research Office website: [Research Office Ethics Assessment](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment), especially the [**checklists**](http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment/ethical-issues-checklists) (login with Gaspar)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.3 Doing what with data ?\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Personal or sensitive data processing\n", "\n", "**Swiss [Federal Act on Data Protection](https://www.admin.ch/opc/en/classified-compilation/19920153/index.html) (FADP) (or Loi sur la Protection des Données LPD), article 3 e.**: \n", "any operation with personal data, irrespective of the means applied and the procedure, and in particular:\n", "* the collection, \n", "* storage, \n", "* use, \n", "* revision, \n", "* disclosure, \n", "* archiving \n", "* or destruction \n", "\n", "of data;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Make sure that...**\n", "\n", "* processing data is carried out in good faith and only for the purpose indicated at the time of collection [...] (FAPD article 4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* consent is valid (given voluntarily on the provision of adequate information (FAPD article 4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* data is correct (FAPD article 5)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* participants can be informed on their personal data (FAPD article 8 , HRA, article 8)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* data are rendered anonymous, as soon as the purpose of the processing permits (FAPD article 22, Processing for reasearch)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.1.4. Protecting and disclosing personal data\n", "\n", "#### Protection\n", "\n", "Personal data must be protected against unauthorised processing through adequate technical and organisational measures (Swiss FADP, article 7).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Disclosure**\n", "\n", "Making personal data accessible, for example:\n", "* by permitting access,\n", "* transmission\n", "* or publication.\n", "\n", "(Swiss FADP article 3 f.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Cross-border disclosure**\n", "\n", "Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection. \n", "\n", "Cross-border disclosure of personal data must be protected against unauthorised processing through adequate technical and organisational measures. \n", "\n", "(FDAP Art. 6)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Anonymisation\n", "\n", "Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if:\n", "* the data is rendered anonymous, as soon as the purpose of the processing permits;\n", "* the recipient only discloses the data with the consent of the federal body and\n", "* the results are published in such a manner that the data subjects may not be identified." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**References**\n", "\n", "* Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).\n", "\n", "* Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).\n", " * [Directive 95/46/EC](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=URISERV%3Al14012)\n", " * As of 2018: [REGULATION (EU) 2016/679 repealing Directive 95/46/EC](http://eur-lex.europa.eu/legal-content/de/TXT/?uri=CELEX%3A32016R0679)\n", " * [H2020 Program Guidance : how to complete your ethics self assessment](http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf), 12.7.2016\n", "* Information\n", " * http://research-office.epfl.ch/ethique-recherche/research-ethics-assessment/ethical-review/personal-data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.2 - Anonymization methods\n", "\n", "Privacy protection methods, either :\n", "\n", "* removing,\n", "* generalizing or\n", "* encrypting,\n", "\n", "personal information from datasets." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Note:** Anonymization or de-identification\n", "\n", "* **Anonymization** is irreversible.\n", "* **De-identification** may include preserving indentifiers that can be re-linked by a trusted party.\n", "\n", "For legal aspects see HRA 35." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In passing, there is more to this (Privacy-Preserving Data Mining Methods / Charu Affarwal and Philip Yu. 2008.):\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### k-anonymity\n", "\n", "\n", "##### Definition\n", "\n", "\"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release\" ([Source](https://en.wikipedia.org/wiki/K-anonymity))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Illustration\n", "\n", "Example including removal and generalization (same source):\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|-----------|-----|--------|-------------------|-----------|-----------------|\n", "| Ramsha | 29 | Female | Tamil Nadu | Hindu | Cancer |\n", "| Yadu | 24 | Female | Kerala | Hindu | Viral infection |\n", "| Salima | 28 | Female | Tamil Nadu | Muslim | TB |\n", "| sunny | 27 | Male | Karnataka | Parsi | No illness |\n", "| Joan | 24 | Female | Kerala | Christian | Heart-related |\n", "| Bahuksana | 23 | Male | Karnataka | Buddhist | TB |\n", "| Rambha | 19 | Male | Kerala | Hindu | Cancer |\n", "| Kishor | 29 | Male | Karnataka | Hindu | Heart-related |\n", "| Johnson | 17 | Male | Kerala | Christian | Heart-related |\n", "| John | 19 | Male | Kerala | Christian | Viral infection |\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To (name and religion were removed, age was generalized):\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|------|---------------|--------|-------------------|----------|-----------------|\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | Cancer |\n", "| * | 20 < Age ≤ 30 | Female | Kerala | * | Viral infection |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | TB |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | No illness |\n", "| * | 20 < Age ≤ 30 | Female | Kerala | * | Heart-related |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | TB |\n", "| * | Age ≤ 20 | Male | Kerala | * | Cancer |\n", "| * | 20 < Age ≤ 30 | Male | Karnataka | * | Heart-related |\n", "| * | Age ≤ 20 | Male | Kerala | * | Heart-related |\n", "| * | Age ≤ 20 | Male | Kerala | * | Viral infection |\n", "\n", "This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### l-diversity - motivation\n", "\n", "An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:\n", "* **homogeneity attacks**: in the case that a group of lines are homogeneous ,\n", "* **background knowledge attacks**: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) ([source](https://en.wikipedia.org/wiki/K-anonymity)). \n", "\n", "Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :\n", "\n", "| Name | Age | Gender | State of domicile | Religion | Disease |\n", "|------|---------------|--------|-------------------|----------|-----------------|\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "| * | 20 < Age ≤ 30 | Female | Tamil Nadu | * | AIDS |\n", "\n", "If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### l-diversity - definition\n", "\n", "**The l-diversity Principle** : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.\n", "\n", "There are several definition of \"well-represented\" ([source](https://en.wikipedia.org/wiki/L-diversity)).\n", "\n", "By the way, l-diversity has weaknesses to, that is why people invented **t-closeness**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "##### t-closeness - motivation\n", "\n", "L-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be the semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases (adapted form [source](https://en.wikipedia.org/wiki/T-closeness))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "##### t-closeness - definition\n", "\n", "**The t-closeness Principle**: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness ([source](https://en.wikipedia.org/wiki/T-closeness))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### differential privacy\n", "\n", "**By linking with another database**: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "*Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1* ([source](https://en.wikipedia.org/wiki/Differential_privacy))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Anonymization - theory and tools\n", "\n", "\n", "\n", "Statistical Disclosure Control / Hundepool, & al. 2012. [Ebook / EPFL library](http://proquest.safaribooksonline.com/9781118348215). \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Tools\n", "* **sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation (R package)**\n", "* ARX Data Anonymization Tool (Java: library &GUI)\n", "* μ-ARGUS (Java, GUI)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "## x.y - Reproducibility\n", "\n", "According to a Nature study in 2012, **47 out of 53** medical research papers are irreproducible (1).\n", "\n", "A previous study showed in 2009 that **16 out of 18 bioinformatics papers could not be reproduced** entirely (2).\n", "\n", "In 2004, it was found that less than **9% of papers share their code** (3).\n", "\n", "<font size=\"1\">(1) Begley, C. G.; Ellis, L. M. (2012). \"Drug development: Raise standards for preclinical cancer research\". Nature 483 (7391): 531–533.<br /> (2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.<br /> (3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. \"Reproducible research in signal processing.\" Signal Processing Magazine, IEEE 26.3 (2009): 37-47 </font><br />\n", "\n", "<font size=\"1\">[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]</font>\n", "\n", "### A workflow for reproducible research\n", "\n", "Researchers often start to think about reproduciblity at the end of projects. It is sometimes too late: by then numerous versions of code and datasets may be spread in various places (folders, dropbox, usb drives...). \n", "\n", "A practical 5 points approach:\n", "\n", "1. document everything \n", "2. everything is a (text) file\n", "3. files should be human readable\n", "4. explicitly tie your files together\n", "5. have a plan to organize, store and make your files available\n", "\n", "Slide inspired by chapter 2 of *Reproducible Research with R and RStudio*.\n", "\n", "More details:\n", "\n", "* document everything \n", " * reproduction requires documentation of what you did\n", "\n", "* everything is a text file\n", " * notably: data, code and results\n", " * the simplest formats are the best: CSV / JSON, Markdown / $\\LaTeX$, because they are future proofed\n", "\n", "* files should be human readable\n", " * treat all files as if someone who does not know the project will have to use them\n", " * otherwise they (or you 6 months later) will probably not undestand them\n", " * important elements to document: \n", " * description of what the file is or does (in general, local comments)\n", " * contributors\n", " * date of last update\n", "\n", "* explicitly tie your files together, including generated documents\n", " * locally or using persistent identifiers\n", " * formalize the way data is processed\n", " * generally difficult to trace back (e.g.: how was a specific figure generated?)\n", "\n", "* have a plan to organize, store and make your files available\n", " * the data managment plan :)\n", "\n", "#### Further readings\n", "\n", "* Implementing Reproducible Research / Victoria Stodden , Friedrich Leisch , and Roger D . Peng. [Ebook via EPFL Library](http://www.crcnetbase.com/isbn/9781466561601)\n", "* Reproducible Research with R and RStudio / Christopher Gandrud [at EPFL Library](http://library.epfl.ch/nebis/?isbn=978-1-4987-1537-9)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.3 - Collaborative tools\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.3.1 - File sharing\n", "\n", "\n", "\n", "- Personal/group level: OwnCloud, free software: Mac, Windows, Linux, iOS, Android... Web.\n", " - Your own server: OwnCloud https://owncloud.org/\n", " - Many plugins: contacts, calendar, collaborative writing, image galleries, etc.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Swiss level: SwitchDrive https://drive.switch.ch/\n", " - Owncloud with 25 Go by user, \n", " - Restricted to Swiss universities members.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A recent fork of ownCloud: [NextCloud](https://nextcloud.com/) aims more transparent development processes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.3.2 - Collaborative writing\n", "#### File sharing is not enough\n", "\n", "People often need to collaborate at a finer level. More and more.\n", "\n", "\n", "Source: Pr. Vandergheynst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "Source: Pr. Vandergheynst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "Source: Pr. Vandergheynst, EFPL Library [Noon Talk, 25.8.2016](http://library.epfl.ch/noon-talks/en)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "** In summary **\n", "\n", "**Text processing** comments / revision mode functionalities are not sufficient for good collaboration.\n", "\n", "**Google Documents** and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.\n", "\n", "** $\\Rightarrow$ we need something else! **\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Share LaTeX\n", "\n", "**[Share LaTeX](https://de.sharelatex.com/)** is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users. \n", "\n", "Access provided by SWITCH, via the [Sandstorm platform](https://sandstorm.cloud.switch.ch/).\n", "\n", "Good, but only if all partners are LaTeX users." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Authorea\n", "\n", "**[Authorea](https://www.authorea.com/)**: collaborative writing, easy to use.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "- Free account to test (limited to 1 private document, no limits on public documents). EPFL licence provided by the Library." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Simple syntax : WYSIWYG and Markdown (lightweight text formatting language). More complex formating possible using LaTeX " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Enables others to make comments" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Supports interactive documents / figures (Jupyter)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Offline synchronization on personal computer (using the Git version control system) \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.3.3 - Tools for coding and analyzing data\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Computational workflow (e.g. SnakeMake)\n", "<center><img src=\"./Images/AiiDA.png\" width=\"400\" height=\"300\" /></center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### x.y.z - Collaborative versioning and branching\n", "\n", "#### Git\n", "\n", "\n", "\n", "Git is a **multi-platform** (Windows, Mac, GNU/Linux) version control tool.\n", "\n", "Git Servers\n", "* [GitHUB](https://github.com/), very popular, some date hosted in the US. Closed repositories limited (payment or subject to other conditions).\n", "* [c4science](https://c4science.ch/) is the Swiss collaborative development platform. Unlimited number of repositories (opened / closed). \n", "\n", "\n", "#### Git workflows\n", "\n", "Git will however not do everything for you.\n", "\n", "- You need to think up a naming convention (folder structure, file names) e.g.\n", " - PROJECT-Experiment-Researcher(ORCID)-YYYYMMDD.extension\n", " - PROJECT-Experiment-Researcher(ORCID)-Software-Format-YYYYMMDD.extension\n", " - PROJECT-Experiment-Researcher(ORCID)-Software-Version-Format-YYYYMMDD.extension\n", "- Set up an appropriated workflow.\n", "\n", "Locally\n", "\n", "\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA).\n", "\n", "The easiest way is to use a centralized repository.\n", "\n", "\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA).\n", "\n", "For more complex projects, a project leader can manage the quality.\n", "\n", "\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA).\n", "\n", "For big projects, it is possible to dispatch responsabilities.\n", "\n", "\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA).\n", "\n", "Non linear development is supported: branches \n", "\n", "\n", "\n", "Source: [J.-L. Falcone](https://www.youtube.com/watch?v=KrHrJoGNpaA).\n", "\n", "#### Git and GitHub are not suited for long term preservation\n", "\n", "\n", "\n", "* Some git commands can delete data (namely: *rebase* and *reset --hard*)\n", "* Repositories can be deleted (including on GitHUB)\n", "* A link GitHub $\\Rightarrow$ Zenodo can be set, so each release will be automatically made citable through a DOI and preserved in Zenodo.\n", "\n", "Guide : [Making your code citable](https://guides.github.com/activities/citable-code/)\n", "\n", "\n", "\n", "\n", "\n", "### x.y.z - Jupyter, Jupyterhub, Sagemath\n", "\n", "\n", "\n", "##### Jupyter\n", "\n", "Interactive **Jupyter Notebooks** documents.: [try.jupyter.org](http://try.jupyter.org)\n", "\n", "\n", "\n", "Structure:\n", "\n", "* Rich-hyper-text cells (including tables, $\\LaTeX$, images, videos)\n", "* Live code cells (with interactive widgets)\n", "\n", "Characteristics \n", "\n", "* Over 50 languages supported : Python, R, Octave, BASH, Matlab, Scala, Java, Haskell...\n", "\n", "* Can be visualized on line using [nbviewer](http://norvig.com/ipython/Economics.ipynb). (e.g.: http://norvig.com/ipython/Economics.ipynb ).\n", " * Nbviewer is integrated in GitHub and Zenodo\n", "\n", "* Jupyter Notebooks are JSON files $\\rightarrow$ can be tracked with Git.\n", "\n", "* Nbconvert allows conversion to many formats, including python:\n", " * jupyter nbconvert notebook.ipynb --to python\n", " * jupyter nbconvert notebook.ipynb --to latex\n", " * jupyter nbconvert notebook.ipynb --to markdown\n", " * jupyter nbconvert notebook.ipynb --to pdf\n", " * jupyter nbconvert notebook.ipynb --to slides\n", " * jupyter nbconvert notebook.ipynb --to html\n", "\n", "* Executing from command line:\n", " * jupyter nbconvert --to notebook --execute mynotebook.ipynb\n", "\n", "\n", " \n", "\n", "#### Powerful python libraries\n", "\n", "- [Pandas](http://pandas.pydata.org/) is a powerful library providing high-performance, easy-to-use data structures and data analysis tools. [Examples](http://pandas.pydata.org/pandas-docs/stable/visualization.html).\n", "- [Numpy](http://www.numpy.org/) is the fundamental package for scientific computing with Python:\n", " - N-dimensional array object\n", " - sophisticated (broadcasting) functions\n", " - tools for integrating C/C++ and Fortran code\n", " - useful linear algebra, Fourier transform, and random number capabilities\n", "- [Matplotlib](http://matplotlib.org/) is a plotting library with great flexibility. It has features comparable to Matlab plotting. [Examples](http://matplotlib.org/gallery.html).\n", "- [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) relies on Pandas (see below). [Examples](https://stanford.edu/~mwaskom/software/seaborn/examples/).\n", "- [NetworkX](https://networkx.github.io/) is suited for complex networks analysis and representation. [Examples](http://networkx.github.io/documentation/latest/gallery.html).\n", "- [r2py](http://rpy2.bitbucket.org/) is an interface to R running embedded in a Python process. \n", "\n", "\n", "\n", "\n", "#### And web libraries \n", "\n", "* [Bokeh](http://bokeh.pydata.org/en/latest/) is a Python interactive visualization library that targets modern web browsers for presentation. \n", "* [D3.js](https://d3js.org/) is an open source JavaScript library for creating interactive documents based on data**. D3 helps bringing data to life using HTML, SVG, and CSS. As mentioned above it can be used in Jupyter using matplotlib via [mpld3](http://mpld3.github.io/). \n", "\n", "\n", "\n", "\n", "\n", "\n", "#### Jupyterhub\n", "\n", "* Jupyter Multi Users server (system users, or via GitHub)\n", "* Collaborate in a local folder, mounting a VSPI Share, or with Git.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### x.y.z - R, RStudio and RStudio server\n", "\n", "#### R\n", "\n", "R is a free software environment for statistical computing and graphics. [One of the best](https://en.wikipedia.org/wiki/R_(programming_language).\n", "\n", "\n", "Platforms:\n", "* wide variety of GNU/Linux and UNIX platforms, \n", "* Windows\n", "* MacOS\n", "\n", "Strength: The diversity of quality open extensions (easily installable with [CRAN](https://cran.r-project.org/)).\n", "\n", "#### RStudio\n", "\n", "RStudio is a free and open-source integrated development environment (IDE) for R.\n", "\n", "\n", "\n", "#### R and reproducible research\n", "\n", "\n", "\n", "#### Reproducible research and documents\n", "* *knitr* and *rmarkdown*\n", "* tying together results and their presentation in articles (pdf, word), presentations or web sites \n", "* notably in $\\LaTeX$ (.Rtex) or Markdown (.Rmarkdown)\n", "* well integrated in RStudio\n", "\n", "\n", "#### Rmarkdown\n", "\n", "Include R code chunks in markdown:\n", "\n", " # Prime numbers\n", " \n", " Storing a few prime numbers in a variable:\n", "\n", " ```{r}\n", " primes <- c(2,3,5,7,11,13)\n", " ```\n", " Done.\n", "\n", "First you need to setup document properties in YAML:\n", "\n", " ---\n", " title: \"Rmarkdown example\"\n", " author: \"Jan Krause\"\n", " date: \"24 novembre 2016\"\n", " output: pdf_document\n", " ---\n", "\n", " # Prime numbers\n", " \n", " Storing a few prime numbers in a variable:\n", "\n", " ```{r}\n", " primes <- c(2,3,5,7,11,13)\n", " ```\n", " Done.\n", "\n", "\n", "\n", "#### RStudio Server\n", "\n", "RStudio in your browser.\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "## x.y - Computational workflow management\n", "\n", "Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.\n", "\n", "\n", "### x.y.z - AiiDA\n", "\n", "AiiDA a free software has been developed at EPFL (in material sciences): http://www.aiida.net/\n", "\n", "<center><img src=\"./Images/AiiDA.png\" width=\"400\" height=\"300\" /></center>\n", "\n", "\n", "### x.y.z - SnakeMake : a simple tool\n", "\n", "* simple : nodes are connected through files (inspired by GNU Make)\n", "* complete :\n", " * supports remote files (http(s), sftp, dropbox, googledrive)\n", " * handles data provenance and rule versions, \n", " * parallelization, \n", " * suspend/resume, \n", " * logging, \n", " * creates schema\n", "* flexible :the SnakeFile is an extension of Python\n", "* http://snakemake.bitbucket.org/\n", "\n", "**Simple Rule:**\n", "\n", "\n", " rule sort:\n", " input:\n", " f = \"path/to/dataset.txt\"\n", " output:\n", " f = \"dataset.sorted.txt\"\n", " shell:\n", " \"sort {input.f} > {output.f}\"\n", "\n", "**Simple Rule (two inputs):**\n", "\n", "\n", " rule sort:\n", " input:\n", " f1 = \"dataset1.txt\",\n", " f2 = \"dataset2.txt\"\n", " output:\n", " f = \"dataset.sorted.txt\"\n", " shell:\n", " \"cat {input.f1} {input.f2} > {output.f}\"\n", "\n", "**Simple Rule (here in Python, but R scripts are supported too):**\n", "\n", "\n", " rule sort:\n", " input:\n", " a=\"path/to/dataset.txt\"\n", " output:\n", " b=\"dataset.sorted.txt\"\n", " run:\n", " with open(output.b, \"w\") as out:\n", " for l in sorted(open(input.a)):\n", " print(l, file=out)\n", " \n", "\n", "**More than one rule:**\n", "\n", "\n", " rule result:\n", " input:\n", " 'result.txt'\n", "\n", " rule genrate_cal_2017:\n", " input:\n", " ()\n", " output:\n", " fname = \"tmp/cal.txt\"\n", " shell:\n", " \"cal 2017 > {output.fname}\"\n", "\n", " rule describe:\n", " input:\n", " fname1 = \"DESCRIPTION.txt\",\n", " fname2 = \"tmp/cal.txt\"\n", " output:\n", " fname = \"result.txt\"\n", " shell:\n", " \"cat {input.fname1} {input.fname2} > {output.fname}\"\n", "\n", " \n", "\n", "**Expand (running rules in parallel):**\n", "\n", " DATASETS = [\"D1\", \"D2\", \"D3\", \"D4\", \"D5\", \"D6\"]\n", "\n", " rule all:\n", " input:\n", " expand(\"{dataset}.sorted.txt\", dataset=DATASETS)\n", "\n", " rule sort:\n", " input:\n", " \"{dataset}.txt\"\n", " output:\n", " \"{dataset}.sorted.txt\"\n", " shell:\n", " \"sort {input} > {output}\"\n", " \n", "\n", "**Output : example Graph**\n", "\n", "\n", "\n", "**Output : Log (simplified)**\n", "\n", "| output_file | date | rule | version |\n", "|-----------------------|--------------------------|---------------|----------|\n", "| result.txt | Fri Nov 11 15:48:17 2016 | cleanup | 3.14 |\n", "| tmp/pre-result.txt | Fri Nov 11 15:48:17 2016 | add_head_foot | 1.02 |\n", "| tmp/FOOT.txt | Fri Nov 11 15:48:17 2016 | generate_foot | 5.6 |\n", "| tmp/HEAD.txt | Fri Nov 11 15:48:17 2016 | generate_head | 5.6 |\n", "| tmp/described_cal.txt | Fri Nov 11 15:48:17 2016 | describe | 0.1alpha |\n", "| tmp/cal.txt | Fri Nov 11 15:48:17 2016 | genrate_cal | 8.234 |\n", "\n", "**More about workflows**\n", "\n", "Another tool: Taverna which includes the desktop oriented [Taverna Workbench](https://taverna.incubator.apache.org/download/ (multi-platform and open source), command-line and server applications.\n", "\n", "Finally, **myExperiment** is a platform for sharing scientific workfows, and notably fully supported by Taverna." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "**EPFL platforms**\n", "\n", "- **EPFL SV** sLIMS\n", " - http://sv-it.epfl.ch/slims\n", " - Gaël Anex, Nicolas Argento, Peter Hliva\n", " - Laboratory information management system \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "- **EPFL SCITAS** (Victoria Rezzonico)\n", " - High Performance Computing and data Storage " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.4 - Data and Storage" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.1 - (Meta)data formats" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Metadata\n", "\n", "- Most common generalist metadata formats: [Dublin Core (DCES)](http://dublincore.org/documents/dces/), [Dublin Core (DCMI)](http://dublincore.org/documents/dcmi-terms/), [DataCite Metadata Schema](https://schema.datacite.org/). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Numerous specialized metadata formats are available for most disciplines, the Research Data Alliance [Metadata Directory](http://rd-alliance.github.io/metadata-directory/) is a good starting point.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data format\n", "\n", "Prefer a\n", "\n", "- **standard format**,\n", "- **open** and\n", "- **widely used** \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This way your data will not depend upon a particular software (or company), operating system, or platform. And you will be able to:\n", "- collaborate with more people (on various platforms)\n", "- avoid licensing problems\n", "- maximize the reusability in the future" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Some open formats to take into account\n", "- Portable Document Format **PDF/A, ISO standard**, text [PDF for archiving, no ciphers, included fonts...]\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Text** simple way to encode data. Can be read by most software.\n", " - CSV tables, can be read by most software, and extended using [CSV on the Web](https://www.w3.org/standards/techs/csv) (metadata, datatypes, relation...)\n", " - JSON: Simply structured, less bulky than XML, ideal for data exchange." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* **Geodata**\n", " * [ISO 19115-1:2014](http://www.iso.org/iso/catalogue_detail.htm?csnumber=53798) : the norm.\n", " * [GeoJson.org](http://geojson.org/) : lighter." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **HDF5**, more flexible (not text, but structured and indexed, supports arbitrary metadata, good performances).\n", " - Compatible with many tools (Python, R, Matlab, Mathematica...)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Databases:** \n", " - SQL: [Postgresql](https://www.postgresql.org/) is relational, open and efficient\n", " - BigData: [MongoDB](https://www.mongodb.com/) for volume, velocity, and variety" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data formats list\n", "\n", "Sustainability of digital formats by the US Library of Congress. [This list](http://www.digitalpreservation.gov/formats/) is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.4.2 - Storage, publication and preservation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Data access sustainability\n", "\n", "A Plos One study showed in 2014 that **more than 60% of links to datasets are broken after 10 years** (1)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Another Plos One 2014 article showed that **the bibliography of 1 out of every 5 is impacted by that phenomenon** (2)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "<font size=\"1\">(1) Pepe et al. (2014). How Do Astronomers Share Data? Reliability and Persistence of Datasets\n", "Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers.\n", "PLoS ONE, 9(8). doi:10.1371/journal.pone.0104798 <br />\n", "(2) Klein et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference\n", "Rot. doi:10.1371/journal.pone.0115253</font>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Digital preservation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "At CERN, a 2007 studies (1,2) showed that the error ratio was of $10^{-7}$ (over 2 months).\n", "\n", "Causes are complex and varied: disk errors, RAID errors, memory errors, etc.\n", "\n", "For 1 Gigabyte (1000 Mégabytes), we have:\n", "$10^9 \\cdot 10^{-7} = 10^2 = 100$ bytes of bitrot." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "<font size=\"1\">\n", "(1) https://indico.cern.ch/event/13797/session/0/contribution/3/attachments/115080/163419/Data_integrity_v3.pdf <br />\n", "(2) http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/\n", "</font>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### 2.4.2.1 - Storagae at UNIBAS" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "** Digital Humanites Lab**\n", "\n", "General informations: http://dhlab.unibas.ch/ \n", "\n", "* Organizing the national Data and Service Center for the Humanities (DaSCH): http://dh-center.ch/ \n", "* Services for students and researchers at university of Basel and national institutes.\n", "* Storage options for primary data as and secondary data as databases\n", "* Focus on data from the humanities, audiovisual data\n", "\n", "Contact: [sekretariat-dhlab@unibas.ch](mailto:sekretariat-dhlab@unibas.ch) respectively [info@dasch.swiss](mailto:info@dasch.swiss) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**SciCore**\n", "\n", "General informations: https://scicore.unibas.ch/ \n", "\n", "Services for students and researchers at university of Basel and associated institutes and Swiss Institute of Bioinformatics.\n", "\n", "* Providing **high-performance computing resources** (computing cluster with 8000 cores)\n", "* Providing **high-performance storage** for researchers with large data sets(~1-10 TB) and/or with complex computational requirements (e.g. Linux workflows) and/or subject to special requirements (e.g. sensitive data)\n", "* Providing **storage for projects with large data volume** (over 10 TB, up to 500 TB); this requires dedicated project definition in a discussion with the PI\n", "* Providing **scientific-service hosting (web sites)** for resources with significant back-end requirements (storage and/or calculation)\n", "* Providing various types of **consulting for data analysis and management**.\n", "\n", "Contact for technical questions: [scicore-admin@unibas.ch](mailto:scicore-admin@unibas.ch)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**University Library Basel**\n", "\n", "General informations: http://ub.unibas.ch/ \n", "\n", "* Support and informations using **disciplinary and multi-storage options/repositories**\n", "* Support for **individual solutions**\n", "* Building up **infrastructures** – looking for **test cases**\n", "\n", "Contact: [silke.bellanger@unibas.ch](mailto:silke.bellanger@unibas.ch)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### 2.4.2.2 - Publication and preservation\n", "\n", "#### Research data publication\n", "\n", "“ It is the **release of research data, associated metadata, accompanying documentation, and software code […] for re-use and analysis** in such a manner that they can be discovered on the Web and referred to in a unique and persistent way.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Data publishing occurs **via dedicated data repositories and/or (data) journals** which ensure that the published research objects are well documented, curated, archived for the long term, interoperable, citable, quality assured and discoverable \n", "– all aspects of data publishing that are important for future reuse of data by third party end-users.”" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Austin, C. C., Bloom, T. K., Dallmeier-Tiessen, S., Khodiyar, V., Murphy, F., Nurnberger, A., . . . Whyte, A. (2016). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Comparison: Dryad - Figshare - Zenodo\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data Papers, Data Journals\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Backup vs. Preservation\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### EPFL storage options\n", "\n", "\n", "\n", "EPFL offers many storage options, as described on the VPSI page [Databases, Storage and Virtualization](https://it.epfl.ch/business_service.do?sysparm_document_key=cmdb_ci_service,90cbd58e0ff121009f8579f692050eb7&sysparm_service=Bases_de_donnees_et_Stockage_Serveurs).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "**EPFL Storage Prices 2016**\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Why publish in a data archive?\n", "\n", "**Accelerate science and careers**\n", "\n", "Many studies show there are significant advantages for articles that share their code or data.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Source: Drachen, T.M. et al., (2016). Sharing data increases citations. LIBER Quarterly. 26(2), pp.67–82. DOI: http://doi.org/10.18352/lq.10149" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### Avoid bias in science \n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### Machine learning needs\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "\n", "\n", "\n", "Machine learning is a promising discipline, but it requires access to data. Datamining is not a viable solution.\n", "\n", "Source: Barend Mons, IDCC, Amsterdamm 2016." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data repositories\n", "\n", "- **Zenodo** (hosted by CERN, free) http://zenodo.org\n", " - either EPFL or CHILI community\n", "\n", "<center><img src=\"./Images/Zenodo.png\" width=\"600\" height=\"450\" /></center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Other data repositories\n", "\n", "- **Dryad** («curated», non-profit organisation, partnership with publishers) http://datadryad.org/\n", "\n", "- **Figshare** (commercial, belongs to Macmillian [as does NPG]) http://figshare.com/\n", "\n", "- For more information see **[re3data](http://re3data.org)** in which more than 1'500 data repositoris are described.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data Citation\n", "\n", "- Always use persistent identifiers to avoid broken links (about 60% after 10 years)\n", "- The most common persistent identifier is the DOI (digital object identifier)\n", " - e.g.: http://doi.org/10.5281/zenodo.7525\n", "- Zenodo, Figshare, Dryad and Infoscience can provide DOIs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2.5 - Licences\n", "\n", "A licence allows to define the way your data can be reused. For instance:\n", "\n", "\n", "Creative Commons (**CC0** and **CC-BY**) http://creativecommons.org/ Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) https://wiki.creativecommons.org/wiki/Data\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/8YkbeycRa2A\" frameborder=\"0\" allowfullscreen></iframe>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Integration\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "<font size=\"6\"><center>You can contact us in the future here: \n", "<br /><br /><br />\n", "<br />Contact: <font size=\"6\" color=\"blue\">silke.bellanger@unibas.ch</font> & <font size=\"6\" color=\"blue\">researchdata@epfl.ch</font>\n", "\n", "<br /><br /><br />\n", "\n", "<font size=\"6\"><center>We look forward to hearing from you!</center></font>\n", "\n", "<br /><br /><br />\n", "\n", "<font size=\"6\"><center>Silke, Aude, Jan, and Lorenza</center></font>\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 2", "language": "python", - "name": "python3" + "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", - "version": 3 + "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.2" + "pygments_lexer": "ipython2", + "version": "2.7.14" } }, "nbformat": 4, - "nbformat_minor": 0 + "nbformat_minor": 1 }