diff --git a/OptimizeResearchDataManagement.ipynb b/OptimizeResearchDataManagement.ipynb index 3f60a88..b476e24 100755 --- a/OptimizeResearchDataManagement.ipynb +++ b/OptimizeResearchDataManagement.ipynb @@ -1,1541 +1,1553 @@ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\n", "
\n", "\n", "

\n", "\n", "

Optimizing your research data management

\n", "

Theory, practice and customized advice

\n", "
\n", "

Aude Dieudé and Jan Krause

\n", "

\n", "datamanagementplan@epfl.ch\n", "\n", "
\n", "
Licence Creative Commons, \"CC-By-SA\"
\n", "
Friday May 20th, 2016
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1.1 - Introduction" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Definition, context and best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Introduction: [video](https://www.youtube.com/watch?v=N2zK3sAtr-4)," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Data Management Plan (DMP): definition, actors and context," ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Key examples of DMPs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Inclusion of the participants' questions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1.2 - Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Definition : Research data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "- For the Organization for Economic Cooperation and Development [OCDE](http://www.oecd.org/fr/sti/sci-tech/38500823.pdf), research data are defined as factual recording (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results." + "- For the Organization for Economic Cooperation and Development [OCDE](http://www.oecd.org/fr/sti/sci-tech/38500823.pdf), research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Definition : Data Management Plan" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Data Management Plan (DMP) refers to the strategies put into place to\n", "create, store, share, maintain, archive and preserve research data\n", "throughout their life cycle.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- The DMP describes which data are going to be produced and how each\n", "type of data will be organized, classified, archived, shared, distributed,\n", "secured and preserved in a secure way." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "- Here is a [video](https://www.youtube.com/watch?v=gYDb-GP1CA4), which illustrates how the DMP works concretly:" + "- Here is a [video](https://www.youtube.com/watch?v=gYDb-GP1CA4), which illustrates how the DMP works concretely:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Research Data Lifecycle" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1.3 - Actors, Skills, Policies and Guidelines" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Actors" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "![Actors](Images/Actors2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Skills" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Policies and Guidelines" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Useful resources\n", "The Digital Curation Center has set up many resources to help institutions develop their own institutional policies and guidelines for research data management:\n", "- [DCC Policy tools and guidance](http://www.dcc.ac.uk/resources/policy-and-legal/policy-tools-and-guidance/policy-tools-and-guidance)\n", "- [Five Steps to Developing a Research Data Policy](http://www.dcc.ac.uk/sites/default/files/documents/publications/DCC-FiveStepsToDevelopingAnRDMpolicy.pdf)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Examples of institutional policies:\n", "- [University of Cambridge](http://www.data.cam.ac.uk/research-data-policies)\n", "- [University of Oxford](http://www.admin.ox.ac.uk/media/global/wwwadminoxacuk/localsites/researchdatamanagement/documents/Policy_on_the_Management_of_Research_Data_and_Records.pdf)\n", "- [University of Edinburgh](http://www.ed.ac.uk/information-services/about/policies-and-regulations/research-data-policy)\n", "- [Humboldt-Universität zu Berlin](https://www.cms.hu-berlin.de/de/ueberblick/projekte/dataman/policy/policy-en/rdm-eng-policy)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Requirements regarding research data management\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Editors\n", "\n", "Many editors and scientific journals require, under specific\n", "conditions, the publication of used data to achieve the research project\n", "results (permanent archiving, standardized formats, etc). This is the case,\n", "for instance, with PLoS and Nature Publishing Group. An overview of the\n", "editorial policies are available online on this [Dryad website](http://wiki.datadryad.org/Journal_instructions)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Funding agencies\n", "See next slide" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1.4 - Funders and DMPs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Examples of funders which require DMPs or equivalent:\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Funding agency and DMP : Horizon 2020\n", "\n", "![.](Images/H2020.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- [Horizon 2020](http://research-office.epfl.ch/financements/international/horizon-2020): is the biggest funding agency from the European Commission \n", "with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its\n", "main objective is to promote and support excellence in the scientific field." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Horizon 2020 requires for some research projects the preparation of a [data management plan](http://ec.europa.eu/programmes/horizon2020/en/what-horizon-2020), which is mandatory in order to receive research funding. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "- [As of 2017](https://ec.europa.eu/digital-single-market/en/news/communication-european-cloud-initiative-building-competitive-data-and-knowledge-economy-europe), the Commission will make **open research data the default option**, while ensuring opt-outs, for all new projects of the Horizon 2020 program." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Funding agency and DMP : SNSF\n", "![.](Images/SNSF.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- SNSF Policy and coordination with research communities and other actors [to be established](http://forscenter.ch/wp-content/uploads/2014/11/DART_Slides_iki.pdf):\n", "\n", "- Ongoing developement of a **research data management policy** together with infrastructure policy\n", "\n", "- Submission of **data management plans** with the grant application" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1.5 - DMP best practices" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Best practices examples: DMPonline (UK)\n", "
\n", "
http://dmponline.dcc.ac.uk
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Best practices examples: EPFL (Switzerland)\n", "To provide guidance in preparing a DMP, the **[EPFL-ETHZ checklist](http://library.epfl.ch/files/content/sites/library3/files/research-data/dmp/Data_management_plan_checklist_EPFL_2016.pdf)** includes\n", "four categories to cover questions related to:\n", "- Research Data Acquisition : type, quantity, license, etc.\n", "- Research Data Format : format, metadata, identification, etc.\n", "- Research Data Sharing : embargo, intellectual property, etc.\n", "- Data Preservation : storage, sensitivity of the data, archiving, etc.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 2 - Tools Selection Through the Datacycle\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Part 2.1 - Issues related to data: \n", "\n", "\n", "### Reproducibility issues\n", "According to a Nature study in 2012, **47 out of 53** medical research papers are irreproducible (1)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "A previous study showed in 2009 that **16 out of 18 bioinformatics papers could not be reproduced** entirely (2)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In 2004, it was found that less than **9% of papers share their code** (3)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "(1) Begley, C. G.; Ellis, L. M. (2012). \"Drug development: Raise standards for preclinical cancer research\". Nature 483 (7391): 531–533.
(2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.
(3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. \"Reproducible research in signal processing.\" Signal Processing Magazine, IEEE 26.3 (2009): 37-47

\n", "\n", "[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Data access sustainability\n", "\n", "A Plos One study showed in 2014 that **more than 60% of links to datasets are broken after 10 years** (1)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ - "Another Plos One 2014 article showed that **the bibliography of 1 article out of 5 is impacted by that phenomenon** (2)." + "Another Plos One 2014 article showed that **the bibliography of 1 out of every 5 is impacted by that phenomenon** (2)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "(1) Pepe et al. (2014). How Do Astronomers Share Data? Reliability and Persistence of Datasets\n", "Linked in AAS Publications and a Qualitative Study of Data Practices among US Astronomers.\n", "PLoS ONE, 9(8). doi:10.1371/journal.pone.0104798
\n", "(2) Klein et al. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference\n", "Rot. doi:10.1371/journal.pone.0115253
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Find and reuse" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**It is generally not easy to find pertinent datasets... by lack of a good description.**\n", "\n", - "**Their reuse require an even more precise description.**\n" + "**Their reuse requires an even more precise description.**\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Digital preservation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "At CERN, a 2007 studies (1,2) showed that the error ratio was of $10^{-7}$ (over 2 months).\n", "\n", "Causes are complex and varied: disk errors, RAID errors, memory errors, etc.\n", "\n", "For 1 Gigabyte (1000 Mégabytes), we have:\n", "$10^9 \\cdot 10^{-7} = 10^2 = 100$ bytes of bitrot." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "(1) https://indico.cern.ch/event/13797/session/0/contribution/3/attachments/115080/163419/Data_integrity_v3.pdf
\n", "(2) http://www.zdnet.com/article/data-corruption-is-worse-than-you-know/\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 2.2 - Tools and Solutions\n", "\n", "![.](Images/tools.jpg)\n", "For more tools, see A Selection of Research Data Management Tools Throughout the Data Lifecycle / Jan Krause" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 2.2.1 - A trusted data repository\n", "\n", "Criteria:\n", "- **Broken links**: use persistent identifiers such as **DOIs**,\n", "- **Reliability**: data preservation (e.g. OAIS standard),\n", "- **Visibility**: schema.org for search engines, OAI-PMH2 standard and/or **well known community repository**\n", "- **Searchability**: at least a basic metadata standard (e.g. **DublinCore**)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "## Part 2.2.5 - Data repositories\n", + "## 2.2.2 - Data repositories\n", "\n", "- **Zenodo** (hosted by CERN, free) http://zenodo.org\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Other data repositories\n", "\n", "- **Dryad** («curated», non-profit organisation, partnership with publishers) http://datadryad.org/\n", "\n", "- **Figshare** (commercial, belongs to Macmillian [as does NPG]) http://figshare.com/\n", "\n", "- Form more information see [re3data](http://re3data.org) in which more than 1'500 data repositoris are described." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "## 2.2.2 - Adequate metadata\n", + "## 2.2.3 - Adequate metadata\n", "\n", "- Most common generalist metadata formats: [Dublin Core](http://dublincore.org/documents/dces/), [Qualified Dublin Core](http://dublincore.org/documents/usageguide/qualifiers.shtml), [DataCite Metadata Schema](https://schema.datacite.org/). \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Numerous specilized metadata formats are available for most disciplines, the Research Data Alliance [Metadata Directory](http://rd-alliance.github.io/metadata-directory/) is a good starting point.\n", "\n", "![.](Images/MetadataDirectory.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "## 2.2.3 - Adequate data format\n", + "## 2.2.4 - Adequate data format\n", "\n", "Prefer a\n", "- **standard format**,\n", "- **open** and\n", "- **widely used** \n", "\n", "standard.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This way your data will not depend upon a particular software (or\n", "company), operating system, or platform.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Some open formats to take into account\n", "- Portable Document Format **PDF/A, ISO standard**, text [PDF for archiving, no ciphers, included fonts...]\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **Text CSV** mind the encoding, unicode, e.g. UTF-8 a good solution." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Structured Querry Language (**SQL**). Supports relations between tables." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **HDF5**, more flexible (but structured and indexed, supports arbitrary metadata)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Interactive **Jupyter Notebooks** documents. Richtext, formulas (LaTeX), charts and code. All dynamic. It can also be used for presentations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Jupyter Notebooks: [try.jupyter.org](http://try.jupyter.org)\n", "\n", "![.](Images/jupyterpreview.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Data formats list\n", "\n", "Sustainability of digital formats by the US Library of Congress. [This list](http://www.digitalpreservation.gov/formats/) is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "## Part 2.2.4 - Adequate licences\n", + "## Part 2.2.5 - Adequate licences\n", "\n", "A licence allows to define the way your data can be reused. For instance:\n", "\n", "\n", "Creative Commons (**CC0** and **CC-BY**) http://creativecommons.org/ Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) https://wiki.creativecommons.org/wiki/Data\n", "\n", "![.](Images/CCbyncsa_others.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "## Part 2.2.5 - Collaborative tools\n" + "## Part 2.2.6 - Collaborative tools\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### File sharing\n", "\n", "![.](Images/owncloud.png)\n", "\n", "- Personal/group level: OwnCloud, free software: Mac, Windows, Linux, iOS, Android... Web.\n", " - Your own server: OwnCloud https://owncloud.org/\n", - " - Many plugins: contacts, calendar, collaborative writing, images galleries, etc.\n", + " - Many plugins: contacts, calendar, collaborative writing, image galleries, etc.\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Swiss level: SwitchDrive https://drive.switch.ch/\n", " - Owncloud with 25 Go by user, \n", " - Restricted to Swiss universities members.\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- FIXME: https://cozy.io" + ] + }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### EPFL storage options\n", "\n", "![.](Images/epfl_logo.png)\n", "\n", "EPFL offers many storage options, as described on the VPSI page [Databases, Storage and Virtualization](https://it.epfl.ch/business_service.do?sysparm_document_key=cmdb_ci_service,90cbd58e0ff121009f8579f692050eb7&sysparm_service=Bases_de_donnees_et_Stockage_Serveurs).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Code sharing, branching and versioning\n", "\n", "![.](Images/git.png)\n", "\n", - "[c4science](https://c4science.ch/) is the Swiss collaborative development platform. Accessible to all academic members via Switch AAI, will allow invitation of external colleagues (probably starting in June 2016). c4science offers:" + "[c4science](https://c4science.ch/) is the Swiss collaborative development platform. Accessible to all academic members via Switch AAI, it allows invitation of external colleagues (via a GitHUB, Google or even local account). \n", + "\n", + "c4science offers:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- **managing code versions and branches** with Git, SVN or Mercurial.\n", "- **group work** (and subgroups)\n", "- **project management** (tasks and backlog)\n", "- **documentation** (integrated wiki)\n", "- **communication** (discussions)\n", "- **supports large binary files** (git-lfs) \n", "- **continuous integration** (witn Jenkins)\n", "- and much more." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Scientific workflow management\n", "\n", - "Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce." + "Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.\n", + "\n", + "FIXME: extend the benefits." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "fragment" + "slide_type": "subslide" } }, "source": [ - "At EPFL, development of AiiDA a free software (in material sciences): http://www.aiida.net/\n", + "Taverna is an excellent workflow engine. It includes the desktop oriented [Taverna Workbench](https://taverna.incubator.apache.org/download/ (multi-platform and open source), command-line and server applications:\n", "\n", - "
" + "![Taverna](Images/Taverna_Workbench_example1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { - "slide_type": "subslide" + "slide_type": "fragment" } }, "source": [ - "Taverna is an excellent workflow engine. It includes the desktop oriented [Taverna Workbench](https://taverna.incubator.apache.org/download/ (multi-platform and open source), command-line and server applications:\n", + "In addition, AiiDA a free software has been developed at EPFL (in material sciences): http://www.aiida.net/\n", "\n", - "![Taverna](Images/Taverna_Workbench_example1.png)" + "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ - "Other workflow management tools:\n", - "\n", - "- [Kepler Project](https://kepler-project.org/), a desktop tool.\n", - "- [Pegasus](https://pegasus.isi.edu/), a sever tool.\n", - "\n", "**myExperiment** is a platform for sharing scientific workfows, and fully supported by Taverna." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Collaborative writing tools\n", "\n", "- **[Authorea](https://www.authorea.com/)**: collaborative writing, easy to use, LaTeX supported but not required (EPFL licence provided by the Library) ![.](Images/Authorea.png)\n", "\n", "- **[Sahre LaTeX](https://de.sharelatex.com/)**: collaborative writing based on LaTeX. Suited for LaTeX power users. ![.](Images/ShareLaTeX.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- **[Zotero](https://www.zotero.org/)**: bibliographic management, citation, sharing and discovery tool (SFP cours from 2016 by the Library and [Dr Zotero](http://library.epfl.ch/doctor-zotero/en)) ![.](Images/Zotero.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### EPFL platforms\n", "\n", "- **EPFL SV** sLIMS\n", " - http://sv-it.epfl.ch/slims\n", " - Gaël Anex, Nicolas Argento, Peter Hliva\n", " - Laboratory information management system ![.](Images/SLims.png)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- **EPFL SCITAS** (Victoria Rezzonico)\n", " - High Performance Computing and data Storage ![.](Images/scitas.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ - "## Part 2.2.6 - Why Open Data and Reproducibility?\n", + "## Part 2.2.7 - Why Open Data and Reproducibility?\n", "\n", - "- For better reproducibility & for the sake of science" + "- For better reproducibility & for the sake of science. FIXME: use examples from SISB training." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "- Papers with shared data were cited about 70% more frequently (1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ - "- Open access papers seems to be consistently cited more (2)" + "- Open access papers seem to be consistently cited more (2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "- Papers with code available are cited more than those without code (3)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "- [Five selfish reasons to work reproducibly](http://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "- Increasingly required by funders and editor " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "\n", "(1) Piwowar, H. a et al. Sharing detailed research data is associated with increased citation rate. PloS one. 2, (2007), 308.
\n", "(2) Antelman, Kristin. \"Do open-access articles have a greater research impact?.\" College & research libraries 65.5 (2004):372-382.
\n", "(3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. \"Reproducible research in signal processing.\" Signal Processing Magazine, IEEE 26.3 (2009): 37-47.
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Part 2.3 - Data visualization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

2.3.1 - Interactive data visualization examples

\n", "\n", "\n", "\n", "\n", "\n", "
US Electricity Generation by Power Source (Washington Post, 2015)
Oil Drilling Collapse in the USA (Bloomberg, 2016)
Swiss Federal Office of Topography (SwissTopo, 2016)
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### 2.3.1 - Interactive data visualization examples\n", "\n", "- [US Electricity Generation by Power Source](https://www.washingtonpost.com/graphics/national/power-plants/) (Washington Post, 2015)\n", "- [Oil Drilling Collapse in the USA](http://www.bloomberg.com/graphics/2016-oil-rigs/) (Bloomberg, 2016)\n", "- [Swiss Federal Office of Topography](https://map.geo.admin.ch/?lang=fr&topic=energie&bgLayer=ch.swisstopo.pixelkarte-grau&layers_visibility=false,false,false,false&layers_timestamp=18641231,,,&catalogNodes=2419,2420,2427,2480,2429,2431,2434,2436,2441) (SwissTopo, 2016)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

2.3.2 - Data visualization posters

\n", "\n", "\n", "\n", "\n", "
US Energy Map (EcoWest, 2013)
The American energy spectrum (Eric Fenny, 2009)
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### 2.3.2 - Data visualization posters\n", "\n", "- [US Energy Map](http://ecowest.org/wp-content/uploads/2013/06/Saxum_Energy_Front.jpg) (EcoWest, 2013)\n", "- [The American energy spectrum](http://ericfenny.com/resources/images/data_viz/2a.jpg) (Eric Fenny, 2009)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### 2.3.3 - Visualization software" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "### 2.3.3 - Visualization software\n", "\n", "Visualization tools may be categorized by their flexibility and simplicity of use. Here is a short selection:\n", "\n", - "| Flexible | In-between | Simple of use |\n", + "| Flexible | In-between | Simplicity of use |\n", "| ---------- | ---------- | ------------- |\n", "| Matplotlib | Seaborn | Gephi |\n", - "| NetworkX | Pandas | Tableau |\n" + "| NetworkX | Pandas | Tableau |\n", + "\n", + "FIXME: re-categorize in Gephi + Jupyter (bookeh (mpl), pandas, networkx) \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Gephi\n", "[Gephi](https://gephi.org/) : free multiplatform data analysis software. [More information and examples](https://gephi.org/features/). [See the video presentation](https://player.vimeo.com/video/9726202).\n", "\n", "To explore in more depth, see [video tutorial](https://www.youtube.com/watch?v=yZ0G9jljCto).\n", "\n", "![.](Images/gephi-tutorial-image.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "** Circos **\n", "\n", "[Circos](http://circos.ca/) is an open source desktop application for visualizing data in circular layouts. It is ideal for exploring relationships between objects or positions. [Examples](http://circos.ca/images/published/).\n", "\n", "![.](Images/circos3.jpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Tableau \n", "\n", "[Tableau](https://www.tableau.com) : commercial software, coming with different varieties: desktop, server, cloud, reader, online, or public. See [here](http://www.tableau.com/products/desktop) for example.\n", "\n", "![.](Images/TableauAnimated2.gif)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Seaborn, Pandas, Matplotlib, NetworkX (Python based)\n", "These open Python packages may:\n", "- be combined,\n", "- be used within a [Jupyter Notebook](http://jupyter.org/), for user-friendly interactive work, which can be shared easily.\n", "\n", "![.](Images/jupyterpreview.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- [Pandas](http://pandas.pydata.org/) is a powerful library providing high-performance, easy-to-use data structures and data analysis tools. [Examples](http://pandas.pydata.org/pandas-docs/stable/visualization.html).\n", "- [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) relies on Pandas (see below). [Examples](https://stanford.edu/~mwaskom/software/seaborn/examples/).\n", - "- [NetworkX](https://networkx.github.io/) is suited for complex netrwoks analysis and representation. [Examples](http://networkx.github.io/documentation/latest/gallery.html).\n", - "- [Matplotlib](http://matplotlib.org/) is a ploting library with a great flexibility. It has comparable features to Matlab ploting. [Examples](http://matplotlib.org/gallery.html).\n", + "- [NetworkX](https://networkx.github.io/) is suited for complex networks analysis and representation. [Examples](http://networkx.github.io/documentation/latest/gallery.html).\n", + "- [Matplotlib](http://matplotlib.org/) is a plotting library with great flexibility. It has features comparable to Matlab plotting. [Examples](http://matplotlib.org/gallery.html).\n", "\n", "![.](Images/python.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Web oriented\n", "\n", "##### D3.js\n", "\n", "D3.js](https://d3js.org/) is an open source JavaScript library for creating interactive documents based on data**. D3 helps bringing data to life using HTML, SVG, and CSS. As mentioned above it can be used in conjunction with matplotlib via [mpld3](http://mpld3.github.io/). [D3.js examples](https://github.com/mbostock/d3/wiki/Gallery). In addition, the [rickshaw](http://code.shutterstock.com/rickshaw/) library extends D3 for time-series representation. The [NVD3 project](http://nvd3.org/examples/index.html) offers a collection of reusable charts based on D3.js.\n", "\n", "![.](Images/d3.js.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### Other web oriented tools\n", "\n", "- **Time Series:** \n", " - [Cubism](https://square.github.io/cubism/) and\n", " - [Envision](http://www.humblesoftware.com/envision) are libraries for representing time-series. \n", "\n", "- **Maps:** \n", " - [Kartograph](http://kartograph.org/) is a lightweight framework for building interactive maps applications. The tool has two components: a Python library for creating maps, and and a JavaScript library to create interactive maps on the Web. \n", " - [Polymaps](http://polymaps.org/) is library for making dynamic and interactive maps. [Examples](http://polymaps.org/ex/). \n", " - [Leaflet](http://leafletjs.com/) is mobile-friendly library for interactive maps.\n", "\n", "- **Data applications:** \n", " - [Recline](http://okfnlabs.org/recline/) is library for building data applications in pure Javascript and HTML." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "### 2.3.4 - References regarding data visualization\n", "\n", "Examples (web):\n", "- Specialized visualisations: http://opendata.cern.ch/ \n", "\n", "References (books):\n", "- \"DataFlow: Visualisation Information in Graphic Design\" ISBN:978-3-89955-217-1\n", "- \"DataVision\" / David McCandless and Dorothee Cuneod. 2011. ISBN: 978-2221126752\n", " - see also [on-line examples](http://www.informationisbeautiful.net/).\n", "- \"DataVision 2\" / David McCandless and Dorothee Cuneod. 2014. ISBN: 978-222145920 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 2.4 - Resources for more information\n", "\n", "- [How to prepare a DMP Paris](http://www.univ-paris-diderot.fr/DocumentsFCK/recherche/Realiser_un_DMP_V1.pdf)\n", "\n", "- [Art and humanities](https://docs.google.com/document/d/1WNYDmqEfv8OdiHQvdC63_yocUx7rgqMTOYUoqOt4R8U/edit?pli=1)\n", "\n", "- [INIST-CNRS](http://www.inist.fr/donnees/)\n", "\n", "- [EDINA](http://datalib.edina.ac.uk/xerte/play.php?template_id=2)\n", "\n", "- [MIT](http://libraries.mit.edu/data-management/plan/why/)\n", "\n", "- [Data sharing and management Snafu](https://www.youtube.com/watch?v=N2zK3sAtr-4)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
Thank you!
\n", "\n", "


\n", "\n", "
Any questions and suggestions are welcome :)
\n", "\n", "


\n", "\n", "
\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
You can contact us in the future here: \n", "


\n", "datamanagementplan@epfl.ch
\n", "\n", "


\n", "\n", "
We look forward to hearing from you!
\n", "\n", "


\n", "\n", "
Aude and Jan
\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.5.1+" + "version": "3.5.2" + }, + "widgets": { + "state": {}, + "version": "1.0.0" } }, "nbformat": 4, "nbformat_minor": 0 }