<h2 id="Definition,-context-and-best-practices">Definition, context and best practices<a class="anchor-link" href="#Definition,-context-and-best-practices">¶</a></h2>
<h2 id="Inclusion-of-the-participants'-questions">Inclusion of the participants' questions<a class="anchor-link" href="#Inclusion-of-the-participants'-questions">¶</a></h2>
<li>The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations.</li>
<li>For the Organization for Economic Cooperation and Development <a href="http://www.oecd.org/fr/sti/sci-tech/38500823.pdf">OCDE</a>, research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results.</li>
<li>One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data.</li>
<h3 id="Useful-resources">Useful resources<a class="anchor-link" href="#Useful-resources">¶</a></h3><p>The Digital Curation Center has set up many resources to help institutions develop their own institutional policies and guidelines for research data management:</p>
<ul>
<li><a href="http://dlcm.ch">Digital Lifecycle Management (DLCM) Swiss National Project</a></li>
<li><a href="http://www.dcc.ac.uk/resources/policy-and-legal/policy-tools-and-guidance/policy-tools-and-guidance">DCC Policy tools and guidance</a></li>
<li><a href="http://www.dcc.ac.uk/sites/default/files/documents/publications/DCC-FiveStepsToDevelopingAnRDMpolicy.pdf">Five Steps to Developing a Research Data Policy</a></li>
<h3 id="Examples-of-institutional-policies:">Examples of institutional policies:<a class="anchor-link" href="#Examples-of-institutional-policies:">¶</a></h3><ul>
<li><a href="http://www.data.cam.ac.uk/research-data-policies">University of Cambridge</a></li>
<li><a href="http://www.admin.ox.ac.uk/media/global/wwwadminoxacuk/localsites/researchdatamanagement/documents/Policy_on_the_Management_of_Research_Data_and_Records.pdf">University of Oxford</a></li>
<li><a href="http://www.ed.ac.uk/information-services/about/policies-and-regulations/research-data-policy">University of Edinburgh</a></li>
<li><a href="https://www.cms.hu-berlin.de/de/ueberblick/projekte/dataman/policy/policy-en/rdm-eng-policy">Humboldt-Universität zu Berlin</a></li>
<h2 id="Requirements-regarding-research-data-management">Requirements regarding research data management<a class="anchor-link" href="#Requirements-regarding-research-data-management">¶</a></h2>
<li><a href="http://research-office.epfl.ch/financements/international/horizon-2020">Horizon 2020</a>: is the biggest funding agency from the European Commission
with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its
main objective is to promote and support excellence in the scientific field.</li>
<li>Horizon 2020 requires for some research projects the preparation of a <a href="http://ec.europa.eu/programmes/horizon2020/en/what-horizon-2020">data management plan</a>, which is mandatory in order to receive research funding. </li>
<li><a href="https://ec.europa.eu/digital-single-market/en/news/communication-european-cloud-initiative-building-competitive-data-and-knowledge-economy-europe">As of 2017</a>, the Commission will make <strong>open research data the default option</strong>, while ensuring opt-outs, for all new projects of the Horizon 2020 program.</li>
<h2 id="Best-practices-examples:-EPFL-(Switzerland)">Best practices examples: EPFL (Switzerland)<a class="anchor-link" href="#Best-practices-examples:-EPFL-(Switzerland)">¶</a></h2><p>To provide guidance in preparing a DMP, the <strong><a href="http://library.epfl.ch/files/content/sites/library3/files/research-data/dmp/Data_management_plan_checklist_EPFL_2016.pdf">EPFL-ETHZ checklist</a></strong> includes
four categories to cover questions related to:</p>
<ul>
<li>Research Data Acquisition : type, quantity, license, etc.</li>
<li>Research Data Format : format, metadata, identification, etc.</li>
<li>Research Data Sharing : embargo, intellectual property, etc.</li>
<li>Data Preservation : storage, sensitivity of the data, archiving, etc.<center><img src="./Images/EPFL-checklist.png" width="600" height="450" /></center></li>
<h2 id="Part-2.0---Benefits-of-data-publication">Part 2.0 - Benefits of data publication<a class="anchor-link" href="#Part-2.0---Benefits-of-data-publication">¶</a></h2><h3 id="Accelerate-science-and-careers">Accelerate science and careers<a class="anchor-link" href="#Accelerate-science-and-careers">¶</a></h3><p>Many sutdies show there are significant advantages for articles that share their code or data.</p>
<h3 id="An-example-in-astronomy">An example in astronomy<a class="anchor-link" href="#An-example-in-astronomy">¶</a></h3><p><img src="Images/Hubble.png" alt="Hubble"></p>
<h3 id="A-counter-example-in-the-biomedical-field">A counter example in the biomedical field<a class="anchor-link" href="#A-counter-example-in-the-biomedical-field">¶</a></h3><p><img src="Images/FDA_Turner.png" alt="FDA Turner"></p>
<h2 id="Part-2.1---Issues-related-to-data:">Part 2.1 - Issues related to data:<a class="anchor-link" href="#Part-2.1---Issues-related-to-data:">¶</a></h2><h3 id="Reproducibility-issues">Reproducibility issues<a class="anchor-link" href="#Reproducibility-issues">¶</a></h3><p>According to a Nature study in 2012, <strong>47 out of 53</strong> medical research papers are irreproducible (1).</p>
<p><font size="1">(1) Begley, C. G.; Ellis, L. M. (2012). "Drug development: Raise standards for preclinical cancer research". Nature 483 (7391): 531–533.<br /> (2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.<br /> (3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. "Reproducible research in signal processing." Signal Processing Magazine, IEEE 26.3 (2009): 37-47 </font><br /></p>
<p><font size="1">[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]</font></p>
<h3 id="Data-access-sustainability">Data access sustainability<a class="anchor-link" href="#Data-access-sustainability">¶</a></h3><p>A Plos One study showed in 2014 that <strong>more than 60% of links to datasets are broken after 10 years</strong> (1).</p>
<font size="3">For more tools, see <a href="https://infoscience.epfl.ch/record/211157">A Selection of Research Data Management Tools Throughout the Data Lifecycle / Jan Krause</a></font></p>
<h2 id="2.2.1---A-trusted-data-repository">2.2.1 - A trusted data repository<a class="anchor-link" href="#2.2.1---A-trusted-data-repository">¶</a></h2><p>Criteria:</p>
<ul>
<li><strong>Broken links</strong>: use persistent identifiers such as <strong>DOIs</strong>,</li>
<li><strong>Reliability</strong>: data preservation (e.g. OAIS standard),</li>
<li><strong>Visibility</strong>: schema.org for search engines, OAI-PMH2 standard and/or <strong>well known community repository</strong></li>
<li><strong>Searchability</strong>: at least a basic metadata standard (e.g. <strong>DublinCore</strong>).</li>
<li>Numerous specilized metadata formats are available for most disciplines, the Research Data Alliance <a href="http://rd-alliance.github.io/metadata-directory/">Metadata Directory</a> is a good starting point.</li>
<h3 id="Some-open-formats-to-take-into-account">Some open formats to take into account<a class="anchor-link" href="#Some-open-formats-to-take-into-account">¶</a></h3><ul>
<li>Portable Document Format <strong>PDF/A, ISO standard</strong>, text [PDF for archiving, no ciphers, included fonts...]</li>
<li><strong>Text CSV</strong> simple way to encode tables. Can be read by most software.<ul>
<li>CSV may be extended (metadata, datatpes, relation...) using the <a href="https://www.w3.org/standards/techs/csv">CSV on the Web</a> W3C specification.</li>
<li>Structured Querry Language (<strong>SQL</strong>). Supports relations between tables. <ul>
<li><a href="https://www.postgresql.org/">Postgresql</a> is Open and particularly efficient (<a href="http://postgis.net">PostGIS</a>)</li>
<li>MySQL ( or <a href="https://mariadb.org/">MariaDB</a> ) is supported by the EPFL central IT (<a href="http://mysql.epfl.ch">mysql.epfl.ch</a>)</li>
<li>Interactive <strong>Jupyter Notebooks</strong> documents. Richtext, formulas (LaTeX), charts and code. All dynamic. It can also be used for presentations.<ul>
<h2 id="Data-formats-list">Data formats list<a class="anchor-link" href="#Data-formats-list">¶</a></h2><p>Sustainability of digital formats by the US Library of Congress. <a href="http://www.digitalpreservation.gov/formats/">This list</a> is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)</p>
<h2 id="Part-2.2.5---Adequate-licences">Part 2.2.5 - Adequate licences<a class="anchor-link" href="#Part-2.2.5---Adequate-licences">¶</a></h2><p>A licence allows to define the way your data can be reused. For instance:</p>
<p>Creative Commons (<strong>CC0</strong> and <strong>CC-BY</strong>) <a href="http://creativecommons.org/">http://creativecommons.org/</a> Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) <a href="https://wiki.creativecommons.org/wiki/Data">https://wiki.creativecommons.org/wiki/Data</a></p>
<h4 id="k-anonymity">k-anonymity<a class="anchor-link" href="#k-anonymity">¶</a></h4><h5 id="Definition">Definition<a class="anchor-link" href="#Definition">¶</a></h5><p>"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release" (<a href="https://en.wikipedia.org/wiki/K-anonymity">Source</a>).</p>
<h5 id="Illustration">Illustration<a class="anchor-link" href="#Illustration">¶</a></h5><p>Example including removal and generalization (same source):</p>
<p>To (name and religion were removed, age was generalized):</p>
<table>
<thead><tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
<th>State of domicile</th>
<th>Religion</th>
<th>Disease</th>
</tr>
</thead>
<tbody>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>Cancer</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Kerala</td>
<td>*</td>
<td>Viral infection</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>TB</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>No illness</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Kerala</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>TB</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Cancer</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Viral infection</td>
</tr>
</tbody>
</table>
<p>This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.</p>
<h4 id="l-diversity---motivation">l-diversity - motivation<a class="anchor-link" href="#l-diversity---motivation">¶</a></h4><p>An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:</p>
<ul>
<li><strong>homogeneity attacks</strong>: in the case that a group of lines are homogeneous ,</li>
<li><strong>background knowledge attacks</strong>: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) (<a href="https://en.wikipedia.org/wiki/K-anonymity">source</a>). </li>
</ul>
<p>Imagine the group (extracted from the whole dataset) [table adapted from the one above] :</p>
<table>
<thead><tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
<th>State of domicile</th>
<th>Religion</th>
<th>Disease</th>
</tr>
</thead>
<tbody>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
</tbody>
</table>
<p>If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity.</p>
<h5 id="l-diversity---definition">l-diversity - definition<a class="anchor-link" href="#l-diversity---definition">¶</a></h5><p><strong>The l-diversity Principle</strong> : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.</p>
<p>There are several definition of "well-represented" (<a href="https://en.wikipedia.org/wiki/L-diversity">source</a>).</p>
<h4 id="t-closeness---motivation">t-closeness - motivation<a class="anchor-link" href="#t-closeness---motivation">¶</a></h4><p>L-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be the semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases.</p>
<p>In real data sets attribute values may be skewed or semantically similar. L-diversity hinders leveraging the global distribution of an attribute's data values in order to infer information about sensitive data values. But not every value may exhibit equal sensitivity (adapted form <a href="https://en.wikipedia.org/wiki/T-closeness">source</a>).</p>
<h4 id="t-closeness---Definition">t-closeness - Definition<a class="anchor-link" href="#t-closeness---Definition">¶</a></h4><p><strong>The t-closeness Principle</strong>: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness (<a href="https://en.wikipedia.org/wiki/T-closeness">source</a>).</p>
<h4 id="Anonimization---theory-and-tools">Anonimization - theory and tools<a class="anchor-link" href="#Anonimization---theory-and-tools">¶</a></h4><p><img src="Images/sdc.jpg" alt=""></p>
<p>Statistical Disclosure Control / Hundepool, & al. 2012.</p>
<p>Ebook <a href="http://proquest.safaribooksonline.com/9781118348215">provided by the EPFL library</a>.</p>
<li>EPFL also offers various <a href="https://it.epfl.ch/business_service.do?sysparm_document_key=cmdb_ci_service,90cbd58e0ff121009f8579f692050eb7&sysparm_service=Bases_de_donnees_et_Stockage_Serveurs">storages options</a> , for HPC contact the <a href="http://scitas.epfl.ch/">SCITAS</a> service.</li>
<h3 id="File-sharing-is-not-enough">File sharing is not enough<a class="anchor-link" href="#File-sharing-is-not-enough">¶</a></h3><p>People often need to collaborate at a finer level. More and more.</p>
<h3 id="In-summary">In summary<a class="anchor-link" href="#In-summary">¶</a></h3><p><strong>Text processing</strong> comments / revision mode functionalities are sufficient for good collaboration.</p>
<p><strong>Google Scholar</strong> and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.</p>
<p><strong> $\Rightarrow$ we need something else! </strong></p>
<p><strong><a href="https://de.sharelatex.com/">Share LaTeX</a></strong> is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users. <img src="Images/ShareLaTeX.png" alt="."></p>
<h3 id="Code-sharing,-branching-and-versioning">Code sharing, branching and versioning<a class="anchor-link" href="#Code-sharing,-branching-and-versioning">¶</a></h3><p><img src="Images/git.png" alt="."></p>
<p>Git is a <strong>multi-platform</strong> (Windows, Mac, GNU/Linux) version control tool.</p>
<p>Git Servers</p>
<ul>
<li><a href="https://github.com/">GitHUB</a>, very popular, some date hosted in the US. Closed repositories limited (payment or subject to other conditions).</li>
<li><a href="https://c4science.ch/">c4science</a> is the Swiss collaborative development platform. Unlimited number of repositories (opened / closed). </li>
<h1 id="Git-and-GitHub-are-not-suited-for-long-term-preservation">Git and GitHub are not suited for long term preservation<a class="anchor-link" href="#Git-and-GitHub-are-not-suited-for-long-term-preservation">¶</a></h1><p><img src="Images/git.png" alt="."></p>
<ul>
<li>Some git commands can delete data (namely: <em>rebase</em> and <em>reset --hard</em>)</li>
<li>Repositories can be deleted (including on GitHUB)</li>
<li>A link GitHub $\Rightarrow$ Zenodo can be set, so each release will be automatically made citable through a DOI and preserved in Zenodo.</li>
</ul>
<p><img src="Images/zenodo-logo.png" alt="."></p>
<p>More information : <a href="https://guides.github.com/activities/citable-code/">Making your code citable</a></p>
<p>EPFL offers many storage options, as described on the VPSI page <a href="https://it.epfl.ch/business_service.do?sysparm_document_key=cmdb_ci_service,90cbd58e0ff121009f8579f692050eb7&sysparm_service=Bases_de_donnees_et_Stockage_Serveurs">Databases, Storage and Virtualization</a>.</p>
<h3 id="Computational-workflow-management">Computational workflow management<a class="anchor-link" href="#Computational-workflow-management">¶</a></h3><p>Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.</p>
<h3 id="More-about-workflows">More about workflows<a class="anchor-link" href="#More-about-workflows">¶</a></h3><p>Another tool: Taverna which includes the desktop oriented <a href="https://taverna.incubator.apache.org/download/ (multi-platform and open source">Taverna Workbench</a>, command-line and server applications.</p>
<p>Finally, <strong>myExperiment</strong> is a platform for sharing scientific workfows, and notably fully supported by Taverna.</p>
<h4 id="Gephi">Gephi<a class="anchor-link" href="#Gephi">¶</a></h4><p><a href="https://gephi.org/">Gephi</a> : free multiplatform data analysis software. <a href="https://gephi.org/features/">More information and examples</a>. <a href="https://player.vimeo.com/video/9726202">See the video presentation</a>.</p>
<p>To explore in more depth, see <a href="https://www.youtube.com/watch?v=yZ0G9jljCto">video tutorial</a>.</p>
<h4 id="Tableau">Tableau<a class="anchor-link" href="#Tableau">¶</a></h4><p><a href="https://www.tableau.com">Tableau</a> : commercial software, coming with different varieties: desktop, server, cloud, reader, online, or public. See <a href="http://www.tableau.com/products/desktop">here</a> for example.</p>
<li><a href="http://pandas.pydata.org/">Pandas</a> is a powerful library providing high-performance, easy-to-use data structures and data analysis tools. <a href="http://pandas.pydata.org/pandas-docs/stable/visualization.html">Examples</a>.</li>
<li><a href="https://stanford.edu/~mwaskom/software/seaborn/">Seaborn</a> relies on Pandas (see below). <a href="https://stanford.edu/~mwaskom/software/seaborn/examples/">Examples</a>.</li>
<li><a href="https://networkx.github.io/">NetworkX</a> is suited for complex networks analysis and representation. <a href="http://networkx.github.io/documentation/latest/gallery.html">Examples</a>.</li>
<li><a href="http://matplotlib.org/">Matplotlib</a> is a plotting library with great flexibility. It has features comparable to Matlab plotting. <a href="http://matplotlib.org/gallery.html">Examples</a>.</li>
<h4 id="Web-oriented">Web oriented<a class="anchor-link" href="#Web-oriented">¶</a></h4><h5 id="Bokeh">Bokeh<a class="anchor-link" href="#Bokeh">¶</a></h5><p><a href="http://bokeh.pydata.org/en/latest/">Bokeh</a> is a Python interactive visualization library that targets modern web browsers for presentation.</p>
<h5 id="D3.js">D3.js<a class="anchor-link" href="#D3.js">¶</a></h5><p><a href="https://d3js.org/">D3.js</a> is an open source JavaScript library for creating interactive documents based on data**. D3 helps bringing data to life using HTML, SVG, and CSS. As mentioned above it can be used in Jupyter using matplotlib via <a href="http://mpld3.github.io/">mpld3</a>.</p>
<h3 id="Notes:-3.2.1---Interactive-data-visualization-examples">Notes: 3.2.1 - Interactive data visualization examples<a class="anchor-link" href="#Notes:-3.2.1---Interactive-data-visualization-examples">¶</a></h3><ul>
<li><a href="https://www.washingtonpost.com/graphics/national/power-plants/">US Electricity Generation by Power Source</a> (Washington Post, 2015)</li>
<li><a href="http://www.bloomberg.com/graphics/2016-oil-rigs/">Oil Drilling Collapse in the USA</a> (Bloomberg, 2016)</li>
<li><a href="https://map.geo.admin.ch/?lang=fr&topic=energie&bgLayer=ch.swisstopo.pixelkarte-grau&layers_visibility=false,false,false,false&layers_timestamp=18641231,,,&catalogNodes=2419,2420,2427,2480,2429,2431,2434,2436,2441">Swiss Federal Office of Topography</a> (SwissTopo, 2016)</li>
<h2 id="4---Resources-for-more-information">4 - Resources for more information<a class="anchor-link" href="#4---Resources-for-more-information">¶</a></h2><ul>
<li><p><a href="http://www.univ-paris-diderot.fr/DocumentsFCK/recherche/Realiser_un_DMP_V1.pdf">How to prepare a DMP Paris</a></p>
</li>
<li><p><a href="https://docs.google.com/document/d/1WNYDmqEfv8OdiHQvdC63_yocUx7rgqMTOYUoqOt4R8U/edit?pli=1">Art and humanities</a></p>