<h3 id="Definition,-context-and-best-practices">Definition, context and best practices<a class="anchor-link" href="#Definition,-context-and-best-practices">¶</a></h3>
<li>The definition of research data is not fixed or rigid: several definitions are possible based on specific fields, institutions, and organizations.</li>
<li>For the Organization for Economic Cooperation and Development <a href="http://www.oecd.org/fr/sti/sci-tech/38500823.pdf">OCDE</a>, research data are defined as factual recordings (numbers, texts, images and sounds), which are used as principal sources for scientific research and which are often recognized by the scientific community as being necessary to validate research results.</li>
<li>One key element to take into consideration during research data management are the legal, ethical and political aspects based on the sensitivity of the data.</li>
<h3 id="Requirements-regarding-research-data-management">Requirements regarding research data management<a class="anchor-link" href="#Requirements-regarding-research-data-management">¶</a></h3>
<h3 id="Publishers">Publishers<a class="anchor-link" href="#Publishers">¶</a></h3><p>Many publishers and scientific journals require, under specific
conditions, the publication of used data to achieve the research project
results (permanent archiving, standardized formats, etc). This is the case,
for instance, with PLoS and Nature Publishing Group. An overview of the
editorial policies are available online on this <a href="http://wiki.datadryad.org/Journal_instructions">Dryad website</a></p>
<li><a href="http://research-office.epfl.ch/financements/international/horizon-2020">Horizon 2020</a>: is the biggest funding agency from the European Commission
with nearly €80 billion of funding available over 7 years from 2014 to 2020. Its
main objective is to promote and support excellence in the scientific field.</li>
<li>Horizon 2020 requires for some research projects the preparation of a <a href="http://ec.europa.eu/programmes/horizon2020/en/what-horizon-2020">data management plan</a>, which is mandatory in order to receive research funding. </li>
<li><a href="https://ec.europa.eu/digital-single-market/en/news/communication-european-cloud-initiative-building-competitive-data-and-knowledge-economy-europe">As of 2017</a>, the Commission will make <strong>open research data the default option</strong>, while ensuring opt-outs, for all new projects of the Horizon 2020 program.</li>
<h3 id="Why-publish-in-a-data-archive?">Why publish in a data archive?<a class="anchor-link" href="#Why-publish-in-a-data-archive?">¶</a></h3><p><strong>Accelerate science and careers</strong></p>
<p>Many sutdies show there are significant advantages for articles that share their code or data.</p>
<h3 id="Best-practices-examples:-EPFL-(Switzerland)">Best practices examples: EPFL (Switzerland)<a class="anchor-link" href="#Best-practices-examples:-EPFL-(Switzerland)">¶</a></h3><p>To provide guidance in preparing a DMP, the <strong><a href="http://library.epfl.ch/files/content/sites/library3/files/research-data/dmp/Data_management_plan_checklist_EPFL_2016.pdf">EPFL-ETHZ checklist</a></strong> includes
four categories to cover questions related to:</p>
<ul>
<li>Research Data Acquisition : type, quantity, license, etc.</li>
<li>Research Data Format : format, metadata, identification, etc.</li>
<li>Research Data Sharing : embargo, intellectual property, etc.</li>
<li>Data Preservation : storage, sensitivity of the data, archiving, etc.<center><img src="./Images/EPFL-checklist.png" width="600" height="450" /></center></li>
<h1 id="Part-2---CHILI-Specific-Topics">Part 2 - CHILI Specific Topics<a class="anchor-link" href="#Part-2---CHILI-Specific-Topics">¶</a></h1><ul>
<h3 id="2.1.1.-Do-you-work-with-personal,-sensitive-data-?">2.1.1. Do you work with personal, sensitive data ?<a class="anchor-link" href="#2.1.1.-Do-you-work-with-personal,-sensitive-data-?">¶</a></h3><p><img src="Images/question.png" alt=""></p>
<ul>
<li>Does your research practice involve collecting, processing and storing information on persons?<ul>
<p>If you answered <strong>yes</strong> to any of the above question, ethical and legal issues apply.</p>
<p>You should check the Research Office Checklist: <a href="http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment">Research Office Ethics Assessment</a>.</p>
<h3 id="2.1.2.-When-human-beings-are-involved...">2.1.2. When human beings are involved...<a class="anchor-link" href="#2.1.2.-When-human-beings-are-involved...">¶</a></h3><p><img src="Images/humanbeing.png" alt="Human Beings"></p>
<p>For instance in such cases : "...collection of personal data, interviews, observations, questionnaires, recordings, tracking or the secondary use of information provided for other purposes, e.g. social media sites, other research projects etc.</p>
<p>In such cases the Human Research Ethics Committee at EPFL (HREC) should be consulted.</p>
<p>The role of the HREC is to review any research project carried out at EPFL involving non-invasive human research from an ethical point of view, before the beginning of the project."</p>
<h3 id="2.1.3.-Data-?-What-data-?-Personal-data-?">2.1.3. Data ? What data ? Personal data ?<a class="anchor-link" href="#2.1.3.-Data-?-What-data-?-Personal-data-?">¶</a></h3><p><img src="Images/personaldata.png" alt=""></p>
<p><strong>personal data (data)</strong></p>
<ul>
<li>all information relating to an identified or identifiable person (Swiss FADP, article 3 a.)</li>
<li>examples: name, address, identification number, e-mail, phone number, medical records... There are various potential identifiers, including full name, pseudonyms, occupation, address or any combination of these.</li>
</ul>
<p><strong>sensitive personal data</strong></p>
<p>According to the Swiss FADP (article 3 c.) data on:</p>
<ol>
<li>religious, ideological, political or trade union-related views or activities,</li>
<li><strong>health, the intimate sphere or the racial origin</strong>,</li>
<li>social security measures,</li>
<li>administrative or criminal proceedings and sanctions;</li>
<h3 id="2.3.1.-Doing-what-with-data-?">2.3.1. Doing what with data ?<a class="anchor-link" href="#2.3.1.-Doing-what-with-data-?">¶</a></h3><p><img src="Images/dataanalysis.png" alt=""></p>
<li>Simple, understandable, in a language adapted to their age information</li>
<li>See form on <a href="http://research-office.epfl.ch/research-ethics-integrity/research-ethics-assessment">Research Office Ethics Assessment</a>.</li>
any operation with personal data, irrespective of the means applied and the procedure, and in particular the collection, storage, use, revision, disclosure, archiving or destruction of data;</p>
<p>Notably:</p>
<ul>
<li>carried out in good faith</li>
<li>only for the purpose indicated at the time of collection (...) </li>
<li>consent must be given expressly in the case of processing of sensitive personal data or personality profiles.</li>
<li>Anyone who processes personal data must make certain that it is correct. He must take all reasonable measures to ensure that data that is incorrect or incomplete in view of the purpose of its collection is either corrected or destroyed.</li>
<li>Any data subject may request that incorrect data be corrected.</li>
<h5 id="Right-to-information">Right to information<a class="anchor-link" href="#Right-to-information">¶</a></h5><p>Any person may request information from the controller of a data file as to whether data concerning them is being processed.</p>
<ul>
<li>of all available data concerning the subject (...), </li>
<li>including the available information on the source of the data (...) as well as the categories of the personal data processed, the other parties involved with the file and the data recipient.</li>
<li>(...) The information must normally be provided in writing, in the form of a printout or a photocopy, and is free of charge. </li>
<h5 id="Protecting">Protecting<a class="anchor-link" href="#Protecting">¶</a></h5><p>(Swiss FADP, article 7) Personal data must be protected against unauthorised processing through adequate technical and organisational measures.</p>
<p><strong>Technical measures : notably it is forbidden to store personal data in countries that are not compatible with Swiss law, such as the US</strong>.</p>
<p>This excludes the usage of many clouds: Dropbox, Google Drive, Microsoft Azure, Amazon S3...</p>
<h3 id="2.1.4.-Disclosing-personal-data">2.1.4. Disclosing personal data<a class="anchor-link" href="#2.1.4.-Disclosing-personal-data">¶</a></h3><p>Personal Data collection and processing implies compliance with the law on privacy and data protection:</p>
<p>Personal data may not be disclosed abroad if the privacy of the data subjects would be seriously endangered thereby, in particular due to the absence of legislation that guarantees adequate protection.</p>
<p>Art. 61Cross-border disclosure
Personal data must be protected against unauthorised processing through adequate technical and organisational measures.</p>
<li>Federal bodies may process personal data for purposes not related to specific persons, and in particular for research, planning and statistics, if:
a. the data is rendered anonymous, as soon as the purpose of the processing permits;
b. the recipient only discloses the data with the consent of the federal body and
c. the results are published in such a manner that the data subjects may not be identified.</li>
<li><p>Federal Act on Data Protection (FADP) of 19 June 1992 (Status as of 1 January 2014) Federal law on data protection] (235.1).</p>
</li>
<li><p>Directive 95/46/EC of the European Parliament & of the Council, of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data (OJ L 281, 23.11.1995, p. 31).</p>
<li><a href="http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-self-assess_en.pdf">H2020 Program Guidance : how to compleate your ethics self assessment</a>, 12.7.2016</li>
<h4 id="k-anonymity">k-anonymity<a class="anchor-link" href="#k-anonymity">¶</a></h4><h5 id="Definition">Definition<a class="anchor-link" href="#Definition">¶</a></h5><p>"A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release" (<a href="https://en.wikipedia.org/wiki/K-anonymity">Source</a>).</p>
<h5 id="Illustration">Illustration<a class="anchor-link" href="#Illustration">¶</a></h5><p>Example including removal and generalization (same source):</p>
<p>To (name and religion were removed, age was generalized):</p>
<table>
<thead><tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
<th>State of domicile</th>
<th>Religion</th>
<th>Disease</th>
</tr>
</thead>
<tbody>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>Cancer</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Kerala</td>
<td>*</td>
<td>Viral infection</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>TB</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>No illness</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Kerala</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>TB</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Cancer</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Male</td>
<td>Karnataka</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Heart-related</td>
</tr>
<tr>
<td>*</td>
<td>Age ≤ 20</td>
<td>Male</td>
<td>Kerala</td>
<td>*</td>
<td>Viral infection</td>
</tr>
</tbody>
</table>
<p>This data has 2-anonymity with respect to the attributes 'Age', 'Gender' and 'State of domicile' since for any combination of these attributes found in any row of the table there are always at least 2 rows with those exact attributes.</p>
<h4 id="l-diversity---motivation">l-diversity - motivation<a class="anchor-link" href="#l-diversity---motivation">¶</a></h4><p>An extension of k-anonymity. Why? To overcome weaknesses of that model, notably:</p>
<ul>
<li><strong>homogeneity attacks</strong>: in the case that a group of lines are homogeneous ,</li>
<li><strong>background knowledge attacks</strong>: when knowledge about a field reduces the set of possible sensible values (e.g. knowing that heart attacks are not frequent in Japanese patients) (<a href="https://en.wikipedia.org/wiki/K-anonymity">source</a>). </li>
</ul>
<p>Imagine the group, or equivalence class, (extracted from the whole dataset) [table adapted from the one above] :</p>
<table>
<thead><tr>
<th>Name</th>
<th>Age</th>
<th>Gender</th>
<th>State of domicile</th>
<th>Religion</th>
<th>Disease</th>
</tr>
</thead>
<tbody>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
<tr>
<td>*</td>
<td>20 < Age ≤ 30</td>
<td>Female</td>
<td>Tamil Nadu</td>
<td>*</td>
<td>AIDS</td>
</tr>
</tbody>
</table>
<p>If it is known that Miss Smith: was part of the study, is aged between 20 and 30, lives in Tamil Nadu. Then it is certain that she has AIDS, even though we have 3-anonymity.</p>
<h5 id="l-diversity---definition">l-diversity - definition<a class="anchor-link" href="#l-diversity---definition">¶</a></h5><p><strong>The l-diversity Principle</strong> : An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.</p>
<p>There are several definition of "well-represented" (<a href="https://en.wikipedia.org/wiki/L-diversity">source</a>).</p>
<p>By the way, l-diversity has weaknesses to, that is why people invented <strong>t-closeness</strong>.</p>
<h5 id="t-closeness---motivation">t-closeness - motivation<a class="anchor-link" href="#t-closeness---motivation">¶</a></h5><p>L-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be the semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases (adapted form <a href="https://en.wikipedia.org/wiki/T-closeness">source</a>).</p>
<h5 id="t-closeness---definition">t-closeness - definition<a class="anchor-link" href="#t-closeness---definition">¶</a></h5><p><strong>The t-closeness Principle</strong>: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness (<a href="https://en.wikipedia.org/wiki/T-closeness">source</a>).</p>
<h5 id="differential-privacy">differential privacy<a class="anchor-link" href="#differential-privacy">¶</a></h5><p><strong>By linking with another database</strong>: Linked the anonymized GIC database (which retained the birthdate, sex, and ZIP code of each patient) with voter registration records, allowed to identify the medical record of the governor of Massachusetts.</p>
<p><em>Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. DOI=10.1007/11787006_1</em> (<a href="https://en.wikipedia.org/wiki/Differential_privacy">source</a>).</p>
<h4 id="Anonymization---theory-and-tools">Anonymization - theory and tools<a class="anchor-link" href="#Anonymization---theory-and-tools">¶</a></h4><p><img src="Images/sdc.jpg" alt=""></p>
<p>Statistical Disclosure Control / Hundepool, & al. 2012.</p>
<p>Ebook <a href="http://proquest.safaribooksonline.com/9781118348215">provided by the EPFL library</a>.</p>
<h2 id="2.3---Reproducibility">2.3 - Reproducibility<a class="anchor-link" href="#2.3---Reproducibility">¶</a></h2><p>According to a Nature study in 2012, <strong>47 out of 53</strong> medical research papers are irreproducible (1).</p>
<p><font size="1">(1) Begley, C. G.; Ellis, L. M. (2012). "Drug development: Raise standards for preclinical cancer research". Nature 483 (7391): 531–533.<br /> (2) Ioannidis JPA, Allison DB, Ball CA, et al. Repeatability of published microarray gene expression analyses. Nat Genet 2009;41(2):149–55.<br /> (3) Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. "Reproducible research in signal processing." Signal Processing Magazine, IEEE 26.3 (2009): 37-47 </font><br /></p>
<p><font size="1">[Slide inspired by https://github.com/saloot/IPythonClass , Amir Hessam Salavati & ,Robin Schiebler 2015 ]</font></p>
<h3 id="A-workflow-for-reproducible-research">A workflow for reproducible research<a class="anchor-link" href="#A-workflow-for-reproducible-research">¶</a></h3><p>Researchers often start to think about reproduciblity at the end of projects. It is sometimes too late: by then numerous versions of code and datasets may be spread in various places (folders, dropbox, usb drives...).</p>
<h3 id="2.4.2---Collaborative-writing">2.4.2 - Collaborative writing<a class="anchor-link" href="#2.4.2---Collaborative-writing">¶</a></h3><h4 id="File-sharing-is-not-enough">File sharing is not enough<a class="anchor-link" href="#File-sharing-is-not-enough">¶</a></h4><p>People often need to collaborate at a finer level. More and more.</p>
<p><strong>Text processing</strong> comments / revision mode functionalities are not sufficient for good collaboration.</p>
<p><strong>Google Scholar</strong> and related tools are not scientific writing oriented, particularly regarding figures, references, citations, bibliography management and interactive figures.</p>
<p><strong> $\Rightarrow$ we need something else! </strong></p>
<h4 id="Share-LaTeX">Share LaTeX<a class="anchor-link" href="#Share-LaTeX">¶</a></h4><p><strong><a href="https://de.sharelatex.com/">Share LaTeX</a></strong> is an alternative to Authorea: collaborative writing based on LaTeX. Suited for LaTeX power users. <img src="Images/ShareLaTeX.png" alt="."></p>
<p>Good, but only if all partners are LaTeX users.</p>
<p>Git is a <strong>multi-platform</strong> (Windows, Mac, GNU/Linux) version control tool.</p>
<p>Git Servers</p>
<ul>
<li><a href="https://github.com/">GitHUB</a>, very popular, some date hosted in the US. Closed repositories limited (payment or subject to other conditions).</li>
<li><a href="https://c4science.ch/">c4science</a> is the Swiss collaborative development platform. Unlimited number of repositories (opened / closed). </li>
<h4 id="Git-and-GitHub-are-not-suited-for-long-term-preservation">Git and GitHub are not suited for long term preservation<a class="anchor-link" href="#Git-and-GitHub-are-not-suited-for-long-term-preservation">¶</a></h4><p><img src="Images/git.png" alt="."></p>
<ul>
<li>Some git commands can delete data (namely: <em>rebase</em> and <em>reset --hard</em>)</li>
<li>Repositories can be deleted (including on GitHUB)</li>
<li>A link GitHub $\Rightarrow$ Zenodo can be set, so each release will be automatically made citable through a DOI and preserved in Zenodo.</li>
</ul>
<p>Guide : <a href="https://guides.github.com/activities/citable-code/">Making your code citable</a></p>
<li>Can be visualized on line using <a href="http://norvig.com/ipython/Economics.ipynb">nbviewer</a>. (e.g.: <a href="http://norvig.com/ipython/Economics.ipynb">http://norvig.com/ipython/Economics.ipynb</a> ).<ul>
<li>Nbviewer is integrated in GitHub and Zenodo</li>
<li><a href="http://pandas.pydata.org/">Pandas</a> is a powerful library providing high-performance, easy-to-use data structures and data analysis tools. <a href="http://pandas.pydata.org/pandas-docs/stable/visualization.html">Examples</a>.</li>
<li><a href="http://www.numpy.org/">Numpy</a> is the fundamental package for scientific computing with Python:<ul>
<li>N-dimensional array object</li>
<li>sophisticated (broadcasting) functions</li>
<li>tools for integrating C/C++ and Fortran code</li>
<li>useful linear algebra, Fourier transform, and random number capabilities</li>
</ul>
</li>
<li><a href="http://matplotlib.org/">Matplotlib</a> is a plotting library with great flexibility. It has features comparable to Matlab plotting. <a href="http://matplotlib.org/gallery.html">Examples</a>.</li>
<li><a href="https://stanford.edu/~mwaskom/software/seaborn/">Seaborn</a> relies on Pandas (see below). <a href="https://stanford.edu/~mwaskom/software/seaborn/examples/">Examples</a>.</li>
<li><a href="https://networkx.github.io/">NetworkX</a> is suited for complex networks analysis and representation. <a href="http://networkx.github.io/documentation/latest/gallery.html">Examples</a>.</li>
<li><a href="http://rpy2.bitbucket.org/">r2py</a> is an interface to R running embedded in a Python process. </li>
<h4 id="And-web-libraries">And web libraries<a class="anchor-link" href="#And-web-libraries">¶</a></h4><ul>
<li><a href="http://bokeh.pydata.org/en/latest/">Bokeh</a> is a Python interactive visualization library that targets modern web browsers for presentation. </li>
<li><a href="https://d3js.org/">D3.js</a> is an open source JavaScript library for creating interactive documents based on data**. D3 helps bringing data to life using HTML, SVG, and CSS. As mentioned above it can be used in Jupyter using matplotlib via <a href="http://mpld3.github.io/">mpld3</a>. </li>
<h3 id="2.4.5---R,-RStudio-and-RStudio-server">2.4.5 - R, RStudio and RStudio server<a class="anchor-link" href="#2.4.5---R,-RStudio-and-RStudio-server">¶</a></h3><h4 id="R">R<a class="anchor-link" href="#R">¶</a></h4><p>R is a free software environment for statistical computing and graphics. <a href="https://en.wikipedia.org/wiki/R_(programming_language">One of the best</a>.</p>
<p>Platforms:</p>
<ul>
<li>wide variety of GNU/Linux and UNIX platforms, </li>
<li>Windows</li>
<li>MacOS</li>
</ul>
<p>Strength: The diversity of quality open extensions (easily installable with <a href="https://cran.r-project.org/">CRAN</a>).</p>
<h4 id="RStudio">RStudio<a class="anchor-link" href="#RStudio">¶</a></h4><p>RStudio is a free and open-source integrated development environment (IDE) for R.</p>
<h4 id="R-and-reproducible-research">R and reproducible research<a class="anchor-link" href="#R-and-reproducible-research">¶</a></h4><p><img src="Images/RStudio_ReprodResearch.png" alt="."></p>
<h4 id="Reproducible-research-and-documents">Reproducible research and documents<a class="anchor-link" href="#Reproducible-research-and-documents">¶</a></h4><ul>
<li><em>knitr</em> and <em>rmarkdown</em></li>
<li>tying together results and their presentation in articles (pdf, word), presentations or web sites </li>
<li>notably in $\LaTeX$ (.Rtex) or Markdown (.Rmarkdown)</li>
<h2 id="2.5---Computational-workflow-management">2.5 - Computational workflow management<a class="anchor-link" href="#2.5---Computational-workflow-management">¶</a></h2><p>Scientific results are often the outcome of complex worflows. Computation operations constitute a graph, which may be difficult to reproduce.</p>
<h3 id="2.5.1---AiiDA">2.5.1 - AiiDA<a class="anchor-link" href="#2.5.1---AiiDA">¶</a></h3><p>AiiDA a free software has been developed at EPFL (in material sciences): <a href="http://www.aiida.net/">http://www.aiida.net/</a></p>
<p>Another tool: Taverna which includes the desktop oriented <a href="https://taverna.incubator.apache.org/download/ (multi-platform and open source">Taverna Workbench</a>, command-line and server applications.</p>
<p>Finally, <strong>myExperiment</strong> is a platform for sharing scientific workfows, and notably fully supported by Taverna.</p>
<li>Numerous specilized metadata formats are available for most disciplines, the Research Data Alliance <a href="http://rd-alliance.github.io/metadata-directory/">Metadata Directory</a> is a good starting point.</li>
<h3 id="Some-open-formats-to-take-into-account">Some open formats to take into account<a class="anchor-link" href="#Some-open-formats-to-take-into-account">¶</a></h3><ul>
<li>Portable Document Format <strong>PDF/A, ISO standard</strong>, text [PDF for archiving, no ciphers, included fonts...]</li>
<li><strong>Text</strong> simple way to encode data. Can be read by most software.<ul>
<li>CSV tables, can be read by most software, and extended using <a href="https://www.w3.org/standards/techs/csv">CSV on the Web</a> (metadata, datatypes, relation...)</li>
<li>JSON: Simply structured, less bulky than XML, ideal for data exchange.</li>
<h4 id="Data-formats-list">Data formats list<a class="anchor-link" href="#Data-formats-list">¶</a></h4><p>Sustainability of digital formats by the US Library of Congress. <a href="http://www.digitalpreservation.gov/formats/">This list</a> is categorized by datatypes (text, audio, image, video, geospacial, dataset, etc.)</p>
<h4 id="Data-access-sustainability">Data access sustainability<a class="anchor-link" href="#Data-access-sustainability">¶</a></h4><p>A Plos One study showed in 2014 that <strong>more than 60% of links to datasets are broken after 10 years</strong> (1).</p>
<p>EPFL offers many storage options, as described on the VPSI page <a href="https://it.epfl.ch/business_service.do?sysparm_document_key=cmdb_ci_service,90cbd58e0ff121009f8579f692050eb7&sysparm_service=Bases_de_donnees_et_Stockage_Serveurs">Databases, Storage and Virtualization</a>.</p>
<h4 id="Why-publish-in-a-data-archive?">Why publish in a data archive?<a class="anchor-link" href="#Why-publish-in-a-data-archive?">¶</a></h4><p><strong>Accelerate science and careers</strong></p>
<p>Many studies show there are significant advantages for articles that share their code or data.</p>
<h2 id="2.7----Licences">2.7 - Licences<a class="anchor-link" href="#2.7----Licences">¶</a></h2><p>A licence allows to define the way your data can be reused. For instance:</p>
<p>Creative Commons (<strong>CC0</strong> and <strong>CC-BY</strong>) <a href="http://creativecommons.org/">http://creativecommons.org/</a> Since CC4.0, sui generis law protecting database content is taken into account (in addition to the form protected by copyright) <a href="https://wiki.creativecommons.org/wiki/Data">https://wiki.creativecommons.org/wiki/Data</a></p>