class: title-slide, center, bottom # Québec-Océan Training Workshop<br>Data Management and Archiving ### _Data Management Best Practices_ <img src="img/logo_quebec_ocean.png"/ height="150"/> ## Philippe Massicotte January 31, 2024 (updated: January 30, 2024) --- <br> <center><img src="img/myname.png" alt="drawing" width="400"/></center> <p align="left"> <b>Research assistant at Takuvik (Laval University)</b><br> - Remote sensing, modelling, data science, data visualization, programming<br> <br>
<small>https://github.com/PMassicotte</small> <br>
<small>philippe.massicotte@takuvik.ulaval.ca</small> <br>
<small>@philmassicotte</small> <br>
<small>https://fosstodon.org/@philmassicotte</small> <br>
<small>www.pmassicotte.com</small> </p> --- # Outlines -- - Open file formats for your data. - Tabular data. - Geographical data. -- - Choosing the tools to read and manipulate your data. -- - Files and data organization. -- - Tidying and formatting data. -- - Backups. -- - Publishing your data. --- class: inverse, center, middle # File formats ## Open file formats for your data. <div id="container"> <div>
</div> <div>
</div> <div>
</div> <div>
</div> </div> --- # File formats The file format used to store data has important implications: - Allows to .background-highlight[re-open] and .background-highlight[re-use] your data in the future: - Softwares might not be cross-platform (Windows/Mac/Linux). - Proprietary file formats can become obsolete or unsupported. <center> <div id="container"> <div>
</div> <div>
</div> <div>
</div> </div> </center> -- .left-column[
] .right-column[**Example**: `.xlsx` files can not be opened in older versions of Microsoft Excel.] --- # File formats .pull-left[ Laboratory computer programs often use proprietary file formats. This likely means that: 1. You will be forced to buy a license .background-highlight[which can be expensive]. 2. You depend on the commitment of the company to support the file format in the future. ] .pull-right[ <center> <figure> <img style="margin:0px auto;display:block" src="img/unsplash/cdc-p33DqVXhWvs-unsplash.jpg" alt="Image showing a woman working in a lab." width = "500"/> </figure> <figcaption>Photo by <a href="https://unsplash.com/@cdc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">CDC</a> on <a href="https://unsplash.com/s/photos/lab-computer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> ] --- # Old-school computing in laboratories When you depend on profit companies. .pull-left[ > At the Bodega Marine Laboratory at the University of California, Davis, some computers still run on **Microsoft Windows XP (released in 2001)**, because of the need to maintain compatibility with a scanning laser confocal microscope and other imaging equipment, says lab director Gary Cherr. > **To work with current Windows versions, the team would have to replace the whole microscope. The marginal potential gains aren’t yet worth the US$400,000 expense**, Cherr reasons. Source: [Old-school computing: when your lab PC is ancient](https://www.nature.com/articles/d41586-021-01431-y) ] .pull-right[ <center> <figure> <img style="margin:0px auto;display:block" src="img/unsplash/misael-moreno-fN6K30xtiKE-unsplash.jpg" alt="Photo showing a microscope." width = "250"/> </figure> <figcaption>Photo by <a href="https://unsplash.com/@moreno303?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Misael Moreno</a> on <a href="https://unsplash.com/s/photos/microscope?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> ] --- # File formats Ideally, the chosen file format should have these characteristics: -- 1. .background-highlight[Non-proprietary]: open source. -- 2. .background-highlight[Unencrypted]: unless it contains personal or sensitive data. -- 3. .background-highlight[human-readable]: the file should be human-readable **or** have tools available for reading and writing. -- 4. .background-highlight[Performance]: consideration for efficient read and write operations, especially for large datasets, is crucial for optimal performance (less important if you work with small datasets). --- # Common open-source text file formats Tabular plain text file formats (.background-highlight[standard text documents that contain unformatted text]): - `.CSV`: Comma (or semicolon) separated values. - `.TAB`: Tab separated values. - `.TXT` and `.DAT`: Plain text files (.background-highlight[data delimiter is not known]). All these file formats can be opened using a simple text editor. --- # Examples of CSV and TSV files <!-- Screenshots made using maim --> <!-- maim ~/Desktop/penguins_tsv_format.png -g 1300x725+4070+210 --> This dataset contains 4 variables (columns). .background-highlight[The first line generally contains the names of the variables.] .pull-left[ A comma-separated values file (`.csv`). <figure> <img src="img/penguins_csv_format.png" alt="Image showing a commas separated values file."/> </figure> ] .pull-right[ A tabs separated values file (`.tsv`). <figure> <img src="img/penguins_tsv_format.png" alt="Image showing a tabs separated values file."/> </figure> ] <small> Data source: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. </small> --- ## Common open-source geographic file formats These files contain information on geographic features such as .background-highlight[points], .background-highlight[lines] or .background-highlight[polygons]. There are a ton of [geographical file formats](https://gisgeography.com/gis-formats/), but here are some that are particularly popular. -- - ESRI shapefile (`.SHP`) - Technically, the shapefile format is not open. .background-highlight[It is however widely used and often considered the standard]. -- - The GeoPackage format (`.gpkg`) [is an interesting open format](https://geocompr.robinlovelace.net/read-write.html?q=geopack#file-formats). -- - GeoJSON (`.json`, `.geojson`, JSON variant with simple geographical features) -- - GeoTIFF (`.tif`, `.tiff`, TIFF variant enriched with GIS relevant metadata) -- - GeoParquet (`.parquet`) is an incubating [Open Geospatial Consortium (OGC) standard](https://geoparquet.org/) standard that adds interoperable geospatial types (Point, Line, Polygon) to [Apache Parquet](https://parquet.apache.org/). --- # The GeoJSON format (Polygons) .pull-left[ This is a simple GeoJSON file defining 3 points that form a polygon. <small> ```json { "type": "Polygon", "coordinates": [ [30, 10], [10, 30], [40, 40], [30, 10] ] } ``` </small> **Create your own GeoJSON file online:** - [https://geojson.io/](https://geojson.io/) ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-1-1.svg" style="display: block; margin: auto;" /> ] --- # The GeoJSON format <img src="index_files/figure-html/geojson-1.svg" width="85%" style="display: block; margin: auto;" /> --- # The GeoTIFF format > GeoTIFF is a public domain metadata standard that allows **georeferencing information to be embedded within a TIFF file.** The potential additional information includes map projection, coordinate systems, ellipsoids, datums, and everything else necessary to establish the exact spatial reference for the file. > > [Wikipedia](https://en.wikipedia.org/wiki/GeoTIFF) <center> <figure> <img style="margin:0px auto;display:block" src="img/unsplash/nasa-HWIOLU7_O6w-unsplash.jpg" alt="Photo showing a coastal view taken by satellite." width = "450"/> </figure> <figcaption>Photo by <a href="https://unsplash.com/@nasa?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">NASA</a> on <a href="https://unsplash.com/s/photos/satellite?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # The GeoTIFF format (SST) A GeoTIFF can contain information such as the Sea Surface Temperature (SST). <img src="index_files/figure-html/geotiff3-1.png" style="display: block; margin: auto;" /> --- # The GeoTIFF format (SST) A closer look allows us to better visualize the values (i.e. water temperature) within each pixel. <img src="index_files/figure-html/geotiff4-1.png" style="display: block; margin: auto;" /> --- # A note on geospatial data .background-highlight[It is usually a better idea to work with spatial objects (ex.: GeoTIFF) rather than tabular data.] .pull-left[ **Geographic data presented in a tabular form:**
longitude
latitude
sst
-66.425
49.975
−1.60
-66.375
49.975
−1.57
-66.325
49.975
−1.52
-66.275
49.975
−1.47
-66.225
49.975
−1.42
-66.175
49.975
−1.38
] .pull-right[ **It is much easier to work with _spatial_ data:** - Geometric operations - Geographic Projection - Data extraction - Joining - And much more! ] --- # Suggested readings .pull-left[ <figure> <img src="img/geocomputation_with_r.png" alt="Book cover of geocomputation with R.", height="450"> </figure> ] .pull-right[ <figure> <img src="img/spatial_data_science_with_r.jpg" alt="Book cover of spatial data science with R.", height="450"> </figure> ] --- class: inverse, center, middle # Efficient tools for reading large datasets in R <center> <figure> <img src="https://media.giphy.com/media/B1uajA01vvL91Urtsp/giphy.gif" width = "350"/> </figure> </center> --- # Efficient tools for reading large datasets - .background-highlight[Data analysis is an iterative process that can be time-consuming when working with large dataset.] - It is worth spending some time to find efficient tools to work with such large data. .left-column[ <center>
</center> ] .right-column[ - R is my main programming environment, so here are some recommendations to be efficient when reading files. - However, you can easily read all these file formats in your preferred programming language. ] --- # R data importation tools - For tabular data (`.CSV`, `.TXT`, `.TAB`, `.DAT`): - `readr`: https://readr.tidyverse.org/index.html - `data.table`: https://rdatatable.gitlab.io/data.table/ <br> - For geographic data: - Shapefiles, gpkg, KMZ and KML: `sf` - GeoJSON: `jsonlite`, `sf`, `geojson` and `geojsonsf` - GeoTIFF: `terra` and `stars` - For NetCDF: `terra`, `ncdf4`, `tidync` and `stars` --- # Efficient reading tools <img src="index_files/figure-html/benchmark-1.svg" style="display: block; margin: auto;" /> --- class: inverse, center, middle # File naming and project organization <br> <div id="container"> <div>
</div> <div>
</div> <div>
</div> </div> --- # File naming: who can relate to this? <center> <figure> <img style="margin:0px auto;display:block" src="img/files_naming.png" width = "700" alt="Laptop showing a file explorer with files badly named."/> </figure> </center> --- # File naming basic rules There are a few rules to adopt when naming files: - Do not use special characters: **~ ! @ # $ % ^ & * ( ) ; < > ? , [ ] { } é è à** - No spaces. This will ensure that the files will be recognized on most operating systems and software. <center> <figure> <img style="margin:0px auto;display:block" src="img/unsplash/hafidh-satyanto-1lDajtn_r7E-unsplash.jpg" alt="Photo made by Hafidh Satyanto showing old electrical wires pilled on a table." width = "400"/> </figure> <figcaption>Photo by: <a href="https://unsplash.com/@satyanto?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Hafidh Satyanto</a> on <a href="https://unsplash.com/s/photos/old-tech?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # File naming basic rules Why using special characters and spaces is a bad idea. <img src="img/invalid_multibyte_string.png" alt="Screenshot of an Excel file with invalid code characters.", width="650"> -- <small> ```r r$> read_csv("myfile.csv") Rows: 104937 Columns: 1 Error in nchar(x, "width") : invalid multibyte string, element 1 ``` </small> --- # File naming basic rules For sequential numbering, .background-highlight[use leading zeros to ensure files sort properly]. - For example, use `0001`, `0002`, `1001` instead of `1`, `2`, `1001`. <br> <center> <figure> <img style="margin:0px auto;display:block" src="img/files_sorting.png" width = "800" alt="Laptop showing a file explorer with files badly named."/> </figure> </center> --- # When file naming goes wrong! .pull-left[ <center> <figure> <img style="margin:0px auto;display:block" src="img/python_file_naming.png" alt="Image showing a scientist that looks confused over his experiment." width = "500"/> </figure> <figcaption>Source: <a href="https://bit.ly/2M8cViI">https://bit.ly/2M8cViI</a></figcaption> </center> ] .pull-right[ > The glitch caused results of a common chemistry computation to vary depending on the operating system used, causing discrepancies among **Mac**, **Windows**, and **Linux** systems. > ...the glitch, had to do with how different operating systems sort files. > ] --- # When file naming goes wrong! Data files were sorted differently depending on the operating system where the Python scripts were executed. <center> <figure> <img style="margin:0px auto;display:block" src="img/files_sorting2.png" width = "500" alt="Screenshot showing how the same file names are sorted differently on Windows and on Linux."/> </figure> <figcaption><b>Original image from:</b> Bhandari Neupane, J. et al. Characterization of Leptazolines A-D, Polar Oxazolines from the Cyanobacterium Leptolyngbya sp., Reveals a Glitch with the “Willoughby-Hoye” Scripts for Calculating NMR Chemical Shifts. Org. Lett. 21, 8449-8453 (2019).</figcaption> </center> --- # File naming basic rules -- - Be consistent and descriptive when naming your files. -- - Separate file names with `_` or `-` to add useful information about the data: -- - Project name. -- - The sampling locations. -- - Type of data/variable. -- - Date (YYYY-MM-DD). -- .background-highlight[Always use the ISO format:] <big>**YYYY**</big>-<medium>**MM**</medium>-<small>**DD**</small> (large
small). --
12-04-09 (2012-04-09 _or_ 2004-12-09 or 2009-04-12, or ..., 6 possibiles combination in total) --
2012-04-09 (2012 April 9th) --- # File naming basic rules (examples) --
`data.csv` (not descriptive enough) --
`temperature_1.csv` (what is the meaning of **1** ?, no number padding!) --
`temperature_20160708` (no file extension provided) --
`station01_temperature_20160708.csv` -- **Interesting ressources:** [How to name files - Jennifer Bryan (YouTube)](https://www.youtube.com/watch?v=ES1LTlnpLMk) [How to name files - Jennifer Bryan (Slides)](https://speakerdeck.com/jennybc/how-to-name-files-the-sequel) --- class: inverse, center, middle # Working with data from other people <center> <figure> <img src="https://media.giphy.com/media/3oxRmGXbquXKz6DNPq/giphy.gif" width = "400"/> </figure> </center> --- ## Preserve information: keep your raw data raw Basic recommendations to preserve the raw data for future use: -- - Do not make any changes or corrections to the original raw data file. -- - .background-highlight[Use a scripted language (R, Python, Matlab, etc.) to perform analysis or make corrections and save that information in a separate file.] -- - If you want to do some analyses in Excel, make a copy of the file and do your calculations and graphs in the copy. <small>Source: https://dataoneorg.github.io/Education/bestpractices/preserve-information-keep</small> --- ## Preserve information: keep your raw data raw If a script changes the content of a raw data file and **saves it in the same file**, .background-highlight[likely, the script will not work the second time because the structure of the file has changed]. <center> <figure> <img src="img/keep_raw_data_raw_01.png" width = "950" alt="Drawning showing that data can change after the data is cleaned."/> </figure> </center> --- # Project directory structure -- - Choosing a logical and consistent way to organize your data files makes it easier for you and your colleagues to find and use your data. -- - Consider using a specific folder to store raw data files. -- - In my workflow, I use a folder named `raw` in which I consider files as .background-highlight[read-only]. -- - Data files produced by code are placed in a folder named `clean`. --- # Project directory structure <center> <figure> <img src="img/project_structure.png" height = "450" alt="Image showing how I do organize my files in a data project."/> </figure> </center> --- class: inverse, center, middle # Tidy data <br> <center> <img src="img/unsplash/alevision-co-d-1FY75fh_s-unsplash.jpg" class="centerImage" height="400" alt="Many shoe pair's organized on a wall."> <figcaption> Photo by <a href="https://unsplash.com/@alevisionco?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">alevision.co</a> on <a href="https://unsplash.com/s/photos/organized?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # Why do we want tidy data? -- - Often said that .background-highlight[80% of the data analysis is dedicated to cleaning and data preparation!] -- - Well-formatted data allows for quicker .background-highlight[visualization], .background-highlight[modeling], .background-highlight[manipulation] and .background-highlight[archiving]. <center> <figure> <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/tidydata_3.jpg" width="600" alt="On the left is a happy cute fuzzy monster holding a rectangular data frame with a tool that fits the data frame shape. On the workbench behind the monster are other data frames of similar rectangular shape, and neatly arranged tools that also look like they would fit those data frames. The workbench looks uncluttered and tidy. The text above the tidy workbench reads “When working with tidy data, we can use the same tools in similar ways for different datasets…” On the right is a cute monster looking very frustrated, using duct tape and other tools to haphazardly tie data tables together, each in a different way. The monster is in front of a messy, cluttered workbench. The text above the frustrated monster reads “...but working with untidy data often means reinventing the wheel with one-time approaches that are hard to iterate or reuse.”"> <figcaption> <b>Artwork by</b> <a href="https://twitter.com/allison_horst?s=20">@allison_horst</a> </figcaption> </figure> </center> --- # Tidy data The main idea is that data should be organized in columns with .background-highlight[each column representing only a single type of data] (character, numerical, date, etc.). <center> <figure> <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/tidydata_1.jpg" width="700" alt="Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure."> <figcaption> <b>Artwork by</b> <a href="https://twitter.com/allison_horst?s=20">@allison_horst</a> </figcaption> </figure> </center> --- # How data is often structured - Many researchers structure their data in such a way that it is easily manipulated by a human, .background-highlight[but not so much programatically]. - A common problem is that the columns represent values, not variable names. - Often occurs with datasheets containing species abundance. <center> <figure> <img src="img/species_wide.png" width = "800" alt="Image showing a frequent method to enter count data in spreadsheet software."/> </figure> </center> --- # How data should be structured After proper transformations, the data is now tidy ([or in normal form](https://en.wikipedia.org/wiki/Database_normalization)). .background-highlight[Each column is a variable, each row is an observation.] <center> <figure> <img src="img/species_wide_to_long.png" style="max-width: 1024px; height: auto; " alt="Image showing a frequent method to enter count data in spreadsheet software."/> </figure> </center> --- # Keep your data as rectangle tables If you use a spreadsheet program, .background-highlight[keep your data arranged as rectangular tables]. Otherwise, .background-highlight[it makes data importation difficult]. <center> <img src="img/unsplash/lukas-blazek-mcSDtbWXUZU-unsplash.jpg" class="centerImage" height="400" alt="An open laptop with a spreadsheet program."> <figcaption> Photo by <a href="https://unsplash.com/@goumbik?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Lukas Blazek</a> on <a href="https://unsplash.com/s/photos/spreadsheet?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # Keep your data as rectangle tables These two examples show the same data. One is arranged as two tables whereas the other is correctly formatted into a single rectangle table. <center> .pull-left[ This sheet has two tables. <img src="img/data_rectangle_1.png" class="centerImage" width="850" alt="Data in a computer spreadsheet program."> ] .pull-right[ This sheet has one table. <img src="img/data_rectangle_2.png" class="centerImage" width="250" alt="Data in a computer spreadsheet program."> ] </center> --- # Keep your data as rectangle tables Do not be that person 😩😖😠😤💢😣🤦♀️🤦♂️😑😓 <center> <img src="img/data_rectangle_3.png" class="centerImage" width="800" alt="Data in a computer spreadsheet program."> </center> --- class: inverse, center, middle # Variable names ## How to choose variable names when creating data files? <br> <center> <img src="img/unsplash/chris-ried-ieic5Tq8YMk-unsplash.jpg" class="centerImage" height="350" alt="Image of a laptop screen displaying python code."> <figcaption> Photo by <a href="https://unsplash.com/@cdr6934?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Chris Ried</a> on <a href="https://unsplash.com/s/photos/programming-python?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # Variable names -- .background-highlight[Be consistent with variable name capitalization:]
`temperature`, `precipitation`
`Temperature`, `Precipitation` -- .background-highlight[Avoid mixing name capitalization:]
`temperature`, `Precipitation`
`temperature_min`, `TemperatureMax` --- # Variable names -- - Do not forget to provide information about abbreviations. - `tmin` vs `temperature_minimum` -- - Do not use special characters or spaces (same as for file names). -- - Explicitly state the unit of each variable: - `depth_m`, `chla_mg_m2` -- - Be consistent with variable names across files: - `temp` vs `temperature` --- # Missing values -- - .background-highlight[Missing values should be simply represented by space in your data files.] -- - R, Python, Matlab and other programming languages deal well with this. -- - If not possible, use a standardized code to represent missing values: - `NA`, `NaN` -- -
.background-highlight[Do not use a numerical value (ex.: **-999**) to indicate missing values.] - This can create situations where missing values will be included in calculations. - Ex.: the average of `c(1, NA, 3)` is different than the average of `c(1, -999, 3)`. --- # Visualization -- - Once data is tidy, .background-highlight[perform a visual inspection] to make sure there are no obvious errors in your data. -- - A picture is worth a thousand words. - <span class = "background-highlight">Always, always, always plot the data!</span> -- - A histogram can be used to represent the distribution of numerical data. --- # Visualization In this example, we see that there is an outlier in the data. Measuring device fault? Manual entry error? <img src="index_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Backups ## It is not _if_, but _when_ your hard drive will fail. <br> <center> <img src="img/unsplash/art-wall-kittenprint-9Wq1HpghQ4A-unsplash.jpg" class="centerImage" height="300" alt="An open hard drive on a table."> <figcaption> Photo by <a href="https://unsplash.com/@artwall_hd?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Art Wall - Kittenprint</a> on <a href="https://unsplash.com/s/photos/backup?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # Backups vs Archives -- **Backups** > Backup is a copy of data created to restore said data in case of damage or loss. **The original data is not deleted after a backup is made.** -- **Archives** > An archive is a copy of data created for reference purposes. **Although not required, the original is often deleted after an archive is made.** <credit>Source: <a href="https://www.networkworld.com/article/3285652/backup-vs-archive-why-its-important-to-know-the-difference.html">https://www.networkworld.com/article/3285652/backup-vs-archive-why-its-important-to-know-the-difference.html</a></credit> --- # Importance of backups -- - .background-highlight[**Disk space is much cheaper than the time you invested in collecting, cleaning and analyzing your data.**] -- - It is important to have .background-highlight[redundancy] in your data. -
**A copy of your working directory in another directory on the same hard drive is not redundancy!** -- - Backups should not be only done on your computer (use cloud services) - Google Drive - Microsoft OneDrive (1TB of space if a student at Université Laval) - Dropbox - MEGA --- # Importance of backups - Use an incremental strategy to backup your data (.background-highlight[ideally daily]). - [rsync](https://fr.wikipedia.org/wiki/Rsync) - [SyncBack](https://www.2brightsparks.com/syncback/syncback-hub.html) - [Duplicati](https://www.duplicati.com/) - [Syncthing](https://syncthing.net/) - I keep 3 months of data. --- # Source code management - Backups of the source code used to generate data are also important. - Git is a version control system used to keep track of changes in computer files. - Primarily used for source code management in software development. - Coordinating work on those files among multiple peoples. <div id = "container"> <div><img src="https://miro.medium.com/max/1812/1*pXPseZOkwPqHGKaBXGS3sQ.png" height="100"/></div> <div><img class="middle-img" src="https://about.gitlab.com/images/press/logo/png/old-logo-no-bkgrd.png"/ height="100"/></div> <div><img src="https://i2.wp.com/wptavern.com/wp-content/uploads/2016/10/bitbucket-logo.png?ssl=1"/ height="100"/></div> </div> --- class: inverse, center, middle # Publishing your data ## Making your data available to the community <br> <center> <img src="img/unsplash/tim-mossholder-ZYBl6VnUd_0-unsplash.jpg" class="centerImage" height="300" alt="A wooden door with an open sign."> <figcaption> Photo by <a href="https://unsplash.com/@timmossholder?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Tim Mossholder</a> on <a href="https://unsplash.com/s/photos/open-data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # Publishing your data Many journals and [funding agencies](http://www.science.gc.ca/eic/site/063.nsf/fra/h_97610.html) now require to have archiving strategies. Why? -- - Makes your data shareable (do not forget that research is funded with public money). -- - Makes your data discoverable. -- - Makes your data citable (.background-highlight[DOI, Digital Object Identifier]). - Collecting and producing data is difficult and requires a lot of resources (technical and financial). - Publishing your data allows other people to credit you for your hard work. -- - Others can find and correct errors in your data. -- - Data can be reused in other studies to build up knowledge. --- # Publishing your data There are at least two different ways to make your data available: 1. In a dedicated data paper. 2. In an appendix along with your paper (.background-highlight[assuming that your paper is published in an open-access journal]). - [The Directory of Open Access Journals](https://www.doaj.org/) is useful to search for open access journals. <br> <center> <figure> <img src="img/doaj_logo.png" width = "600"/> </figure> <figcaption>https://www.doaj.org/</figcaption> </center> --- # Public announcement -- <p style="font-size:32px">.background-highlight[**Summary tables in a PDF article are not very useful!**]</p> <center> <figure> <img src="https://media.giphy.com/media/j9Y9vsklHWtjgHOtLk/giphy.gif" width = "400"/> </figure> </center> -- You should rather provide the data in a way that is easily importable into a programming language as supplementary information (for example, a `CSV` file). --- # What is a data paper? - .background-highlight[Data presented in an appendix are rarely reviewed by peers.] - Data papers are interesting alternatives to publish data: - **Peer-reviewed** (high-quality data). - Generally open access (obliviously!). - Data are citable with a DOI. > A data paper is a **peer-reviewed document** describing a dataset, published in a peer-reviewed journal. It takes effort to prepare, curate and describe data. Data papers provide recognition for this effort by means of a scholarly article. > > https://www.gbif.org/data-papers --- # What is a data paper? A data paper is similar to a traditional scientific paper. <center> <figure> <img src="img/essd.png" width = "450"/> </figure> </center> --- # What is a data paper? The data associated with the paper is available online. <center> <figure> <img src="img/seanoe.png" width = "450"/> </figure> </center> --- <img src="index_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> --- # Open repositories There are many options available to publish your data. - Polar Data Catalogue (https://www.polardata.ca/) - Scholars Portal Dataverse (https://dataverse.scholarsportal.info/) - Federated Research Data Repository (https://www.frdr-dfdr.ca/repo/?locale=fr) - Pangaea (https://www.pangaea.de/) - Dryad (https://datadryad.org) - Catalogue de données ouverte OGSL (https://ogsl.ca/fr/) - Zenodo (https://zenodo.org/) - Figshare (https://figshare.com/) - Seanoe (https://www.seanoe.org/) - NFS Arctic Data Center (https://arcticdata.io/) - The Dataverse Project (https://dataverse.org/) --- class: inverse, center, middle # Take home messages <br> <center> <img src="img/unsplash/glenn-carstens-peters-RLw-UC03Gwc-unsplash.jpg" class="centerImage" height="400" alt="A hand with a pencil writing in a book."> <figcaption> Photo by <a href="https://unsplash.com/@glenncarstenspeters?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Glenn Carstens-Peters</a> on <a href="https://unsplash.com/s/photos/hand-writing?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center> --- # Take home messages -- - Choose non-proprietary file formats (ex.: `CSV`). -- - Give your files and variables meaningful names. -- - Tidy and visually explore your data to remove obvious errors. -- - .background-highlight[**Backups your data externally as often as possible.**] - Your hard drive will eventually crash, for sure! -- - Use a version control system (git) for your analysis scripts. -- - When possible, share the data and the scripts that were used in your research papers. --- class: inverse, center, middle <br> <center> <img src="img/unsplash/wilhelm-gunkel-AKQlYooS72w-unsplash.jpg" class="centerImage" height="500" alt="Paper sheet in an old typewriter with thank you written in a different language."> <figcaption> Photo by <a href="https://unsplash.com/@wilhelmgunkel?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Wilhelm Gunkel</a> on <a href="https://unsplash.com/s/photos/thank-you?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a> </figcaption> </center>