<h1 id="Optimizing-Research-Data-Management">Optimizing Research Data Management<a class="anchor-link" href="#Optimizing-Research-Data-Management">¶</a></h1><h2 id="University-of-Basel">University of Basel<a class="anchor-link" href="#University-of-Basel">¶</a></h2><h3 id="Wednesday-March-22-and-Thursday-May-11,-2017">Wednesday March 22 and Thursday May 11, 2017<a class="anchor-link" href="#Wednesday-March-22-and-Thursday-May-11,-2017">¶</a></h3><h4 id="Aude-Dieudé-,-Jan-Krause-,-Lorenza-Salvatori-(EPFL)-&-Silke-Bellanger-(UNIBAS)">Aude Dieudé , Jan Krause , Lorenza Salvatori (EPFL) & Silke Bellanger (UNIBAS)<a class="anchor-link" href="#Aude-Dieudé-,-Jan-Krause-,-Lorenza-Salvatori-(EPFL)-&-Silke-Bellanger-(UNIBAS)">¶</a></h4><p><br />Contact: <font size="4" color="blue">silke.bellanger@unibas.ch</font> & <font size="4" color="blue">researchdata@epfl.ch</font></p>
<h3 id="Definition,-context-and-best-practices">Definition, context and best practices<a class="anchor-link" href="#Definition,-context-and-best-practices">¶</a></h3>
<li>The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations.</li>
<li>For the Organization for Economic Cooperation and Development <a href="http://www.oecd.org/fr/sti/sci-tech/38500823.pdf">OCDE</a>, research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results.</li>
<li>One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data.</li>
<h3 id="Requirements-regarding-research-data-management">Requirements regarding research data management<a class="anchor-link" href="#Requirements-regarding-research-data-management">¶</a></h3>
<h3 id="Publishers">Publishers<a class="anchor-link" href="#Publishers">¶</a></h3><p>Many publishers and scientific journals require, under specific
conditions, the publication of used data to achieve the research project
results (permanent archiving, standardized formats, etc.). This is the case,
for instance, with PLoS and Nature Publishing Group. A list of
editorial policies are available online on this <a href="http://wiki.datadryad.org/Journal_instructions">Dryad website</a>. Note: This page seems to be a one shot publication and is not exhaustive.</p>
<li>Horizon 2020 requires for some research projects the preparation of a <a href="http://ec.europa.eu/programmes/horizon2020/en/what-horizon-2020">data management plan</a>, which is mandatory in order to receive research funding. </li>
<li><a href="https://ec.europa.eu/digital-single-market/en/news/communication-european-cloud-initiative-building-competitive-data-and-knowledge-economy-europe">As of 2017</a>, the Commission will make <strong>open research data the default option</strong>, while ensuring opt-outs, for all new projects of the Horizon 2020 program.</li>
<p>Submission of <strong>data management plans</strong> with the grant application will be <strong>mandatory as of October 2017</strong>. See the <a href="http://www.snf.ch/en/researchinFocus/newsroom/Pages/news-170306-towards-open-research-data.aspx">communication</a>.</p>
<h3 id="Best-practices-examples:-EPFL-(Switzerland)">Best practices examples: EPFL (Switzerland)<a class="anchor-link" href="#Best-practices-examples:-EPFL-(Switzerland)">¶</a></h3><p>To provide guidance in preparing a DMP, the <strong><a href="http://library.epfl.ch/files/content/sites/library3/files/research-data/dmp/Data_management_plan_checklist_EPFL_2016.pdf">EPFL-ETHZ checklist</a></strong> includes
four categories to cover questions related to:</p>
<ul>
<li>Research Data Acquisition : type, quantity, license, etc.</li>
<li>Research Data Format : format, metadata, identification, etc.</li>
<li>Research Data Sharing : embargo, intellectual property, etc.</li>
<li>Data Preservation : storage, sensitivity of the data, archiving, etc.<center><img src="./Images/EPFL-checklist.png" width="600" height="450" /></center></li>
<h3 id="Guidelines-and-Policies,-University-Basel">Guidelines and Policies, University Basel<a class="anchor-link" href="#Guidelines-and-Policies,-University-Basel">¶</a></h3><p>Research data policy is in preparation.</p>
<p>Guidelines regarding good scientific practice: <a href="https://www.unibas.ch/en/Research/Research-in-Basel/Values-and-Principles.html">https://www.unibas.ch/en/Research/Research-in-Basel/Values-and-Principles.html</a></p>
<p>Informations regarding general data and it guidelines: <a href="https://its.unibas.ch/content.cfm?content=586">https://its.unibas.ch/content.cfm?content=586</a></p>
<h3 id="2.1.1.-When-human-beings-are-involved...">2.1.1. When human beings are involved...<a class="anchor-link" href="#2.1.1.-When-human-beings-are-involved...">¶</a></h3><p><img src="Images/humanbeing.png" alt="Human Beings"></p>
<p><strong>Ethics issues arise in many areas of research</strong>.</p>
<p>Research involving the voluntary participation of research subjects and the collection of <strong>data that might be considered as personal</strong>.</p>
<p><a href="http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf">H2020 Programme Guidance How to complete your ethics self-assessment, p.1, 12 July 2016</a></p>
<h4 id="Human-Research-Ethics-Committee-at-EPFL-(HREC)">Human Research Ethics Committee at EPFL (HREC)<a class="anchor-link" href="#Human-Research-Ethics-Committee-at-EPFL-(HREC)">¶</a></h4><p>The role of the <a href="http://research-office.epfl.ch/research-ethics/research-ethics-assessment/epfl-human-research-ethics-committee/hrec">HREC</a> is to <strong>review any research project carried out at EPFL involving non-invasive human research</strong> from an ethical point of view, before the beginning of the project.</p>
<p>Contact person: <a href="https://people.epfl.ch/elisabeth.vandervelde?lang=fr">Esther van der Velde</a></p>
<h4 id="Collecting-consent">Collecting consent<a class="anchor-link" href="#Collecting-consent">¶</a></h4><p>“Research involving human beings may only be carried out if, […], the persons concerned have given their informed consent or, after being duly informed, have not exercised their right to dissent. […] The persons concerned may withhold or revoke their consent at any time, without stating their reasons.” <em>Human Research Act (HRA), article 7. </em></p>
<p>The consent must be:</p>
<ul>
<li>Simple, understandable,</li>
<li>Adapted to the subject (child, teenager...) (HRA Art. 21-22)</li>
<li><a href="http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf">H2020 Programme Guidance : How to complete your ethics self assessment</a>, 12th July 2016. Page 1.</li>
<li>Swiss Academy of Medical Sciences (SAMS) (2015). “Research with human subjects. A manual for practitioners.” 2nd edition, <a href="http://swissethics.ch/doc/swissethics/manual_research_nov2015_e.pdf">http://swissethics.ch/doc/swissethics/manual_research_nov2015_e.pdf</a></li>
<li>Federal Act on Research involving Human Beings (Human Research Act, HRA) of 30 September 2011 (Status as of 1 January 2014). <a href="https://www.admin.ch/opc/en/classified-compilation/20061313/index.html">https://www.admin.ch/opc/en/classified-compilation/20061313/index.html</a></li>
<h3 id="2.1.2.-Data-?-What-data-?-Personal-data-?-Sensitive-data-?">2.1.2. Data ? What data ? Personal data ? Sensitive data ?<a class="anchor-link" href="#2.1.2.-Data-?-What-data-?-Personal-data-?-Sensitive-data-?">¶</a></h3><p><img src="Images/personaldata.png" alt=""></p>
<p><strong>personal data</strong></p>
<ul>
<li>all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)</li>
<li>examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these.</li>
<p>If you work with personal or sensitive data,</p>
<p>you should check the Research Office website: <a href="http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment">Research Office Ethics Assessment</a>, especially the <a href="http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment/ethical-issues-checklists"><strong>checklists</strong></a> (login with Gaspar).</p>
<h3 id="2.1.3-Doing-what-with-data-?">2.1.3 Doing what with data ?<a class="anchor-link" href="#2.1.3-Doing-what-with-data-?">¶</a></h3><p><img src="Images/dataanalysis.png" alt=""></p>
<h5 id="Personal-or-sensitive-data-processing">Personal or sensitive data processing<a class="anchor-link" href="#Personal-or-sensitive-data-processing">¶</a></h5><p><strong>Swiss <a href="https://www.admin.ch/opc/en/classified-compilation/19920153/index.html">Federal Act on Data Protection</a> (FADP) (or Loi sur la Protection des Données LPD), article 3 e.</strong>:
any operation with personal data, irrespective of the means applied and the procedure, and in particular:</p>
<h3 id="2.1.4.-Protecting-and-disclosing-personal-data">2.1.4. Protecting and disclosing personal data<a class="anchor-link" href="#2.1.4.-Protecting-and-disclosing-personal-data">¶</a></h3><h4 id="Protection">Protection<a class="anchor-link" href="#Protection">¶</a></h4><p>Personal data must be protected against unauthorised processing through adequate technical and organisational measures (Swiss FADP, article 7).</p>
<p>Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.</p>
<p>Cross-border disclosure of personal data must be protected against unauthorised processing through adequate technical and organisational measures.</p>
<li><p>Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).</p>
</li>
<li><p>Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).</p>
<li>As of 2018: <a href="http://eur-lex.europa.eu/legal-content/de/TXT/?uri=CELEX%3A32016R0679">REGULATION (EU) 2016/679 repealing Directive 95/46/EC</a></li>
<li><a href="http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf">H2020 Program Guidance : how to complete your ethics self assessment</a>, 12.7.2016</li>
<h4 id="k-anonymity">k-anonymity<a class="anchor-link" href="#k-anonymity">¶</a></h4><h5 id="Definition">Definition<a class="anchor-link" href="#Definition">¶</a></h5><p>"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release" (<a href="https://en.wikipedia.org/wiki/K-anonymity">Source</a>).</p>
<h5 id="Illustration">Illustration<a class="anchor-link" href="#Illustration">¶</a></h5><p>Example including removal and generalization (same source):</p>
<p>To (name and religion were removed, age was generalized):</p>
<table>
<thead><tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
<th>State of domicile</th>
<th>Religion</th>
<th>Disease</th>
</tr>
</thead>
<tbody>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>Cancer</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Kerala</td>
<td>*</td>
<td>Viral infection</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>TB</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>No illness</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Kerala</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>TB</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Cancer</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Viral infection</td>
</tr>
</tbody>
</table>
<p>This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.</p>
<h4 id="l-diversity---motivation">l-diversity - motivation<a class="anchor-link" href="#l-diversity---motivation">¶</a></h4><p>An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:</p>
<ul>
<li><strong>homogeneity attacks</strong>: in the case that a group of lines are homogeneous ,</li>
<li><strong>background knowledge attacks</strong>: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) (<a href="https://en.wikipedia.org/wiki/K-anonymity">source</a>). </li>
</ul>
<p>Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :</p>
<table>
<thead><tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
<th>State of domicile</th>
<th>Religion</th>
<th>Disease</th>
</tr>
</thead>
<tbody>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
</tbody>
</table>
<p>If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity.</p>
<h5 id="l-diversity---definition">l-diversity - definition<a class="anchor-link" href="#l-diversity---definition">¶</a></h5><p><strong>The l-diversity Principle</strong> : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.</p>
<p>There are several definition of "well-represented" (<a href="https://en.wikipedia.org/wiki/L-diversity">source</a>).</p>
<p>By the way, l-diversity has weaknesses to, that is why people invented <strong>t-closeness</strong>.</p>
<h5 id="differential-privacy">differential privacy<a class="anchor-link" href="#differential-privacy">¶</a></h5><p><strong>By linking with another database</strong>: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts.</p>
<p><em>Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1</em> (<a href="https://en.wikipedia.org/wiki/Differential_privacy">source</a>).</p>
<h4 id="Anonymization---theory-and-tools">Anonymization - theory and tools<a class="anchor-link" href="#Anonymization---theory-and-tools">¶</a></h4><p><img src="Images/sdc.jpg" alt=""></p>
<p>Statistical Disclosure Control / Hundepool, & al. 2012. <a href="http://proquest.safaribooksonline.com/9781118348215">Ebook / EPFL library</a>.</p>
<h3 id="2.3.2---Collaborative-writing">2.3.2 - Collaborative writing<a class="anchor-link" href="#2.3.2---Collaborative-writing">¶</a></h3><h4 id="File-sharing-is-not-enough">File sharing is not enough<a class="anchor-link" href="#File-sharing-is-not-enough">¶</a></h4><p>People often need to collaborate at a finer level. More and more.</p>
<p><strong>Text processing</strong> comments / revision mode functionalities are not sufficient for good collaboration.</p>
<p><strong>Google Documents</strong> and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.</p>
<p><strong> $\Rightarrow$ we need something else! </strong></p>
<h4 id="Share-LaTeX">Share LaTeX<a class="anchor-link" href="#Share-LaTeX">¶</a></h4><p><strong><a href="https://de.sharelatex.com/">Share LaTeX</a></strong> is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users. <img src="Images/ShareLaTeX.png" alt="."></p>
<p>Access provided by SWITCH, via the <a href="https://sandstorm.cloud.switch.ch/">Sandstorm platform</a>.</p>
<p>Good, but only if all partners are LaTeX users.</p>
<li>Numerous specialized metadata formats are available for most disciplines, the Research Data Alliance <a href="http://rd-alliance.github.io/metadata-directory/">Metadata Directory</a> is a good starting point.</li>
<h3 id="Some-open-formats-to-take-into-account">Some open formats to take into account<a class="anchor-link" href="#Some-open-formats-to-take-into-account">¶</a></h3><ul>
<li>Portable Document Format <strong>PDF/A, ISO standard</strong>, text [PDF for archiving, no ciphers, included fonts...]</li>
<li><strong>Text</strong> simple way to encode data. Can be read by most software.<ul>
<li>CSV tables, can be read by most software, and extended using <a href="https://www.w3.org/standards/techs/csv">CSV on the Web</a> (metadata, datatypes, relation...)</li>
<li>JSON: Simply structured, less bulky than XML, ideal for data exchange.</li>
<h4 id="Data-formats-list">Data formats list<a class="anchor-link" href="#Data-formats-list">¶</a></h4><p>Sustainability of digital formats by the US Library of Congress. <a href="http://www.digitalpreservation.gov/formats/">This list</a> is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)</p>
<p>Services for students and researchers at university of Basel and associated institutes and Swiss Institute of Bioinformatics.</p>
<ul>
<li>Providing <strong>high-performance computing resources</strong> (computing cluster with 8000 cores)</li>
<li>Providing <strong>high-performance storage</strong> for researchers with large data sets(~1-10 TB) and/or with complex computational requirements (e.g. Linux workflows) and/or subject to special requirements (e.g. sensitive data)</li>
<li>Providing <strong>storage for projects with large data volume</strong> (over 10 TB, up to 500 TB); this requires dedicated project definition in a discussion with the PI</li>
<li>Providing <strong>scientific-service hosting (web sites)</strong> for resources with significant back-end requirements (storage and/or calculation)</li>
<li>Providing various types of <strong>consulting for data analysis and management</strong>.</li>
</ul>
<p>Contact for technical questions: <a href="mailto:scicore-admin@unibas.ch">scicore-admin@unibas.ch</a></p>
<h4 id="2.4.2.2---Publication-and-preservation">2.4.2.2 - Publication and preservation<a class="anchor-link" href="#2.4.2.2---Publication-and-preservation">¶</a></h4><h4 id="Research-data-publication">Research data publication<a class="anchor-link" href="#Research-data-publication">¶</a></h4><p>“ It is the <strong>release of research data, associated metadata, accompanying documentation, and software code […] for re-use and analysis</strong> in such a manner that they can be discovered on the Web and referred to in a unique and persistent way.</p>
<p>Data publishing occurs <strong>via dedicated data repositories and/or (data) journals</strong> which ensure that the published research objects are well documented, curated, archived for the long term, interoperable, citable, quality assured and discoverable
– all aspects of data publishing that are important for future reuse of data by third party end-users.”</p>
<h3 id="Backup-vs.-Preservation">Backup vs. Preservation<a class="anchor-link" href="#Backup-vs.-Preservation">¶</a></h3><p><img src="Images/Preservervation_vs_Storage.png" alt="Preservation vs. Backup"></p>
<h4 id="Why-publish-in-a-data-archive?">Why publish in a data archive?<a class="anchor-link" href="#Why-publish-in-a-data-archive?">¶</a></h4><p><strong>Accelerate science and careers</strong></p>
<p>Many studies show there are significant advantages for articles that share their code or data.</p>
<h2 id="2.5----Licences">2.5 - Licences<a class="anchor-link" href="#2.5----Licences">¶</a></h2><p>A licence allows to define the way your data can be reused. For instance:</p>
<p>Creative Commons (<strong>CC0</strong> and <strong>CC-BY</strong>) <a href="http://creativecommons.org/">http://creativecommons.org/</a> Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) <a href="https://wiki.creativecommons.org/wiki/Data">https://wiki.creativecommons.org/wiki/Data</a></p>