Québec-Océan Training Workshop Data Management and Archiving

# Québec-Océan Training Workshop<br>Data Management and Archiving

### _Data Management Best Practices_

## Philippe Massicotte

January 31, 2024 (updated: January 30, 2024)

---

<br>

<b>Research assistant at Takuvik (Laval University)</b><br>

- Remote sensing, modelling, data science, data visualization, programming<br>

<br>

</p>

---

# Outlines

- Open file formats for your data.

- Tabular data.
  - Geographical data.

- Choosing the tools to read and manipulate your data.

- Files and data organization.

- Tidying and formatting data.

- Backups.

- Publishing your data.

---

# File formats

## Open file formats for your data.

---

# File formats

The file format used to store data has important implications:

- Allows to .background-highlight[re-open] and .background-highlight[re-use] your data in the future:

- Softwares might not be cross-platform (Windows/Mac/Linux).

- Proprietary file formats can become obsolete or unsupported.

.left-column[<svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:100px;width:75px;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#3c3c3c;overflow:visible;position:relative;"><path d="M48 448V64c0-8.8 7.2-16 16-16H224v80c0 17.7 14.3 32 32 32h80V448c0 8.8-7.2 16-16 16H64c-8.8 0-16-7.2-16-16zM64 0C28.7 0 0 28.7 0 64V448c0 35.3 28.7 64 64 64H320c35.3 0 64-28.7 64-64V154.5c0-17-6.7-33.3-18.7-45.3L274.7 18.7C262.7 6.7 246.5 0 229.5 0H64zm90.9 233.3c-8.1-10.5-23.2-12.3-33.7-4.2s-12.3 23.2-4.2 33.7L161.6 320l-44.5 57.3c-8.1 10.5-6.3 25.5 4.2 33.7s25.5 6.3 33.7-4.2L192 359.1l37.1 47.6c8.1 10.5 23.2 12.3 33.7 4.2s12.3-23.2 4.2-33.7L222.4 320l44.5-57.3c8.1-10.5 6.3-25.5-4.2-33.7s-25.5-6.3-33.7 4.2L192 280.9l-37.1-47.6z"/></svg>]
.right-column[**Example**: `.xlsx` files can not be opened in older versions of Microsoft Excel.]

---

# File formats

Laboratory computer programs often use proprietary file formats.

This likely means that:

1. You will be forced to buy a license .background-highlight[which can be expensive].

2. You depend on the commitment of the company to support the file format in the future.

]

<center>
    <figure>
      <img style="margin:0px auto;display:block" src="img/unsplash/cdc-p33DqVXhWvs-unsplash.jpg" alt="Image showing a woman working in a lab." width = "500"/>
    </figure>
    <figcaption>Photo by <a href="https://unsplash.com/@cdc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">CDC</a> on <a href="https://unsplash.com/s/photos/lab-computer?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
    </figcaption> 
]

---

# Old-school computing in laboratories

When you depend on profit companies.

> At the Bodega Marine Laboratory at the University of California, Davis, some computers still run on **Microsoft Windows XP (released in 2001)**, because of the need to maintain compatibility with a scanning laser confocal microscope and other imaging equipment, says lab director Gary Cherr.

> **To work with current Windows versions, the team would have to replace the whole microscope. The marginal potential gains aren’t yet worth the US$400,000 expense**, Cherr reasons.

Source: [Old-school computing: when your lab PC is ancient](https://www.nature.com/articles/d41586-021-01431-y)

]

<center>
    <figure>
      <img style="margin:0px auto;display:block" src="img/unsplash/misael-moreno-fN6K30xtiKE-unsplash.jpg" alt="Photo showing a microscope." width = "250"/>
    </figure>
    <figcaption>Photo by <a href="https://unsplash.com/@moreno303?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Misael Moreno</a> on <a href="https://unsplash.com/s/photos/microscope?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
    </figcaption> 
</center>
]

---

# File formats

Ideally, the chosen file format should have these characteristics:

1. .background-highlight[Non-proprietary]: open source.

2. .background-highlight[Unencrypted]: unless it contains personal or sensitive data.

3. .background-highlight[human-readable]: the file should be human-readable **or** have tools available for reading and writing.

4. .background-highlight[Performance]: consideration for efficient read and write operations, especially for large datasets, is crucial for optimal performance (less important if you work with small datasets).

---

# Common open-source text file formats

Tabular plain text file formats (.background-highlight[standard text documents that contain unformatted text]):

- `.CSV`: Comma (or semicolon) separated values.

- `.TAB`: Tab separated values.

- `.TXT` and `.DAT`: Plain text files (.background-highlight[data delimiter is not known]).

All these file formats can be opened using a simple text editor.

---

# Examples of CSV and TSV files

This dataset contains 4 variables (columns). .background-highlight[The first line generally contains the names of the variables.]

<figure>
      <img src="img/penguins_csv_format.png" alt="Image showing a commas separated values file."/>
  </figure>
]

<figure>
      <img src="img/penguins_tsv_format.png" alt="Image showing a tabs separated values file."/>
  </figure>
]

<small>
Data source: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/.
</small>

---

## Common open-source geographic file formats

These files contain information on geographic features such as .background-highlight[points], .background-highlight[lines] or .background-highlight[polygons]. There are a ton of [geographical file formats](https://gisgeography.com/gis-formats/), but here are some that are particularly popular.

- ESRI shapefile (`.SHP`)

- Technically, the shapefile format is not open. .background-highlight[It is however widely used and often considered the standard].

- The GeoPackage format (`.gpkg`) [is an interesting open format](https://geocompr.robinlovelace.net/read-write.html?q=geopack#file-formats).

- GeoJSON (`.json`, `.geojson`, JSON variant with simple geographical features)

- GeoTIFF (`.tif`, `.tiff`, TIFF variant enriched with GIS relevant metadata)

- GeoParquet (`.parquet`) is an incubating [Open Geospatial Consortium (OGC) standard](https://geoparquet.org/) standard that adds interoperable geospatial types (Point, Line, Polygon) to [Apache Parquet](https://parquet.apache.org/).

---

# The GeoJSON format (Polygons)

This is a simple GeoJSON file defining 3 points that form a polygon.

<small>
```json
{     
    "type": "Polygon", 
    "coordinates": [
        [30, 10], 
        [10, 30], 
        [40, 40], 
        [30, 10]
    ]
}
```
</small>

**Create your own GeoJSON file online:**

- [https://geojson.io/](https://geojson.io/)
  ]

]

---

# The GeoJSON format

---

# The GeoTIFF format

> GeoTIFF is a public domain metadata standard that allows **georeferencing information to be embedded within a TIFF file.** The potential additional information includes map projection, coordinate systems, ellipsoids, datums, and everything else necessary to establish the exact spatial reference for the file.
>
> [Wikipedia](https://en.wikipedia.org/wiki/GeoTIFF)

<center>
    <figure>
      <img style="margin:0px auto;display:block" src="img/unsplash/nasa-HWIOLU7_O6w-unsplash.jpg" alt="Photo showing a coastal view taken by satellite." width = "450"/>
    </figure>
    <figcaption>Photo by <a href="https://unsplash.com/@nasa?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">NASA</a> on <a href="https://unsplash.com/s/photos/satellite?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
    </figcaption> 
</center>

---

# The GeoTIFF format (SST)

A GeoTIFF can contain information such as the Sea Surface Temperature (SST).

---

# The GeoTIFF format (SST)

A closer look allows us to better visualize the values (i.e. water temperature) within each pixel.

---

# A note on geospatial data

.background-highlight[It is usually a better idea to work with spatial objects (ex.: GeoTIFF) rather than tabular data.]

**Geographic data presented in a tabular form:**

<div id="hxuzubnuiy" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#hxuzubnuiy table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#hxuzubnuiy thead, #hxuzubnuiy tbody, #hxuzubnuiy tfoot, #hxuzubnuiy tr, #hxuzubnuiy td, #hxuzubnuiy th {
  border-style: none;
}

#hxuzubnuiy p {
  margin: 0;
  padding: 0;
}

#hxuzubnuiy .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 20px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: 50%;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#hxuzubnuiy .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#hxuzubnuiy .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#hxuzubnuiy .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#hxuzubnuiy .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#hxuzubnuiy .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#hxuzubnuiy .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#hxuzubnuiy .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: cold;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#hxuzubnuiy .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: cold;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#hxuzubnuiy .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#hxuzubnuiy .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#hxuzubnuiy .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#hxuzubnuiy .gt_spanner_row {
  border-bottom-style: hidden;
}

#hxuzubnuiy .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#hxuzubnuiy .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#hxuzubnuiy .gt_from_md > :first-child {
  margin-top: 0;
}

#hxuzubnuiy .gt_from_md > :last-child {
  margin-bottom: 0;
}

#hxuzubnuiy .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#hxuzubnuiy .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
}

#hxuzubnuiy .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#hxuzubnuiy .gt_row_group_first td {
  border-top-width: 2px;
}

#hxuzubnuiy .gt_row_group_first th {
  border-top-width: 2px;
}

#hxuzubnuiy .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#hxuzubnuiy .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D3D3D3;
}

#hxuzubnuiy .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#hxuzubnuiy .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#hxuzubnuiy .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#hxuzubnuiy .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#hxuzubnuiy .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D3D3D3;
}

#hxuzubnuiy .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#hxuzubnuiy .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#hxuzubnuiy .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#hxuzubnuiy .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#hxuzubnuiy .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#hxuzubnuiy .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#hxuzubnuiy .gt_left {
  text-align: left;
}

#hxuzubnuiy .gt_center {
  text-align: center;
}

#hxuzubnuiy .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#hxuzubnuiy .gt_font_normal {
  font-weight: normal;
}

#hxuzubnuiy .gt_font_bold {
  font-weight: bold;
}

#hxuzubnuiy .gt_font_italic {
  font-style: italic;
}

#hxuzubnuiy .gt_super {
  font-size: 65%;
}

#hxuzubnuiy .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#hxuzubnuiy .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#hxuzubnuiy .gt_indent_1 {
  text-indent: 5px;
}

#hxuzubnuiy .gt_indent_2 {
  text-indent: 10px;
}

#hxuzubnuiy .gt_indent_3 {
  text-indent: 15px;
}

#hxuzubnuiy .gt_indent_4 {
  text-indent: 20px;
}

#hxuzubnuiy .gt_indent_5 {
  text-indent: 25px;
}
</style>
<table class="gt_table" data-quarto-disable-processing="false" data-quarto-bootstrap="false">
  <thead>
    
    <tr class="gt_col_headings">
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="longitude">longitude</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="latitude">latitude</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="sst">sst</th>
    </tr>
  </thead>
  <tbody class="gt_table_body">
    <tr><td headers="longitude" class="gt_row gt_right">-66.425</td>
<td headers="latitude" class="gt_row gt_right">49.975</td>
<td headers="sst" class="gt_row gt_right">−1.60</td></tr>
    <tr><td headers="longitude" class="gt_row gt_right">-66.375</td>
<td headers="latitude" class="gt_row gt_right">49.975</td>
<td headers="sst" class="gt_row gt_right">−1.57</td></tr>
    <tr><td headers="longitude" class="gt_row gt_right">-66.325</td>
<td headers="latitude" class="gt_row gt_right">49.975</td>
<td headers="sst" class="gt_row gt_right">−1.52</td></tr>
    <tr><td headers="longitude" class="gt_row gt_right">-66.275</td>
<td headers="latitude" class="gt_row gt_right">49.975</td>
<td headers="sst" class="gt_row gt_right">−1.47</td></tr>
    <tr><td headers="longitude" class="gt_row gt_right">-66.225</td>
<td headers="latitude" class="gt_row gt_right">49.975</td>
<td headers="sst" class="gt_row gt_right">−1.42</td></tr>
    <tr><td headers="longitude" class="gt_row gt_right">-66.175</td>
<td headers="latitude" class="gt_row gt_right">49.975</td>
<td headers="sst" class="gt_row gt_right">−1.38</td></tr>
  </tbody>
  
  
</table>
</div>

]

**It is much easier to work with _spatial_ data:**

- Geometric operations

- Geographic Projection

- Data extraction

- Joining

- And much more!
  ]

---

# Suggested readings

<figure>
      <img src="img/geocomputation_with_r.png" alt="Book cover of geocomputation with R.", height="450">
  </figure>
]

<figure>
      <img src="img/spatial_data_science_with_r.jpg" alt="Book cover of spatial data science with R.", height="450">
  </figure>
]

---

# Efficient tools for reading large datasets in R

---

# Efficient tools for reading large datasets

- .background-highlight[Data analysis is an iterative process that can be time-consuming when working with large dataset.]

- It is worth spending some time to find efficient tools to work with such large data.

<center>
  <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:200px;width:226.95px;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#3c3c3c;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>
</center>
]

- R is my main programming environment, so here are some recommendations to be efficient when reading files.

- However, you can easily read all these file formats in your preferred programming language.
  ]

---

# R data importation tools

- For tabular data (`.CSV`, `.TXT`, `.TAB`, `.DAT`):

- `readr`: https://readr.tidyverse.org/index.html

- `data.table`: https://rdatatable.gitlab.io/data.table/

<br>

- For geographic data:

- Shapefiles, gpkg, KMZ and KML: `sf`

- GeoJSON: `jsonlite`, `sf`, `geojson` and `geojsonsf`

- GeoTIFF: `terra` and `stars`

- For NetCDF: `terra`, `ncdf4`, `tidync` and `stars`

---

# Efficient reading tools

---

# File naming and project organization

<br>

---

# File naming: who can relate to this?

<center>
  <figure>
      <img style="margin:0px auto;display:block" src="img/files_naming.png" width = "700" alt="Laptop showing a file explorer with files badly named."/>
  </figure>
</center>
  
---
  
# File naming basic rules
  
There are a few rules to adopt when naming files:
  
- Do not use special characters: **~ ! @ # $ % ^ & * ( ) ; < > ? , [ ] { } é è à**
  
- No spaces.

This will ensure that the files will be recognized on most operating systems and software.

<center>
    <figure>
      <img style="margin:0px auto;display:block" src="img/unsplash/hafidh-satyanto-1lDajtn_r7E-unsplash.jpg" alt="Photo made by Hafidh Satyanto showing old electrical wires pilled on a table." width = "400"/>
    </figure>
    <figcaption>Photo by: <a href="https://unsplash.com/@satyanto?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Hafidh Satyanto</a> on <a href="https://unsplash.com/s/photos/old-tech?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>  
</center>

---

# File naming basic rules

Why using special characters and spaces is a bad idea.

<small>
```r
r$> read_csv("myfile.csv")
Rows: 104937 Columns: 1
Error in nchar(x, "width") : invalid multibyte string, element 1
```
</small>

---

# File naming basic rules

For sequential numbering, .background-highlight[use leading zeros to ensure files sort properly].

- For example, use `0001`, `0002`, `1001` instead of `1`, `2`, `1001`.

<br>

---

# When file naming goes wrong!

<center>
    <figure>
      <img style="margin:0px auto;display:block" src="img/python_file_naming.png" alt="Image showing a scientist that looks confused over his experiment." width = "500"/>
    </figure>
    <figcaption>Source: <a href="https://bit.ly/2M8cViI">https://bit.ly/2M8cViI</a></figcaption>  
</center>
]

> The glitch caused results of a common chemistry computation to vary depending on the operating system used, causing discrepancies among **Mac**, **Windows**, and **Linux** systems.

> ...the glitch, had to do with how different operating systems sort files.
> ]

---

# When file naming goes wrong!

Data files were sorted differently depending on the operating system where the Python scripts were executed.

<center>
    <figure>
      <img style="margin:0px auto;display:block" src="img/files_sorting2.png" width = "500" alt="Screenshot showing how the same file names are sorted differently on Windows and on Linux."/>
    </figure>
    <figcaption><b>Original image from:</b> Bhandari Neupane, J. et al. Characterization of Leptazolines A-D, Polar Oxazolines from the Cyanobacterium Leptolyngbya sp., Reveals a Glitch with the “Willoughby-Hoye” Scripts for Calculating NMR Chemical Shifts. Org. Lett. 21, 8449-8453 (2019).</figcaption>  
</center>

---

# File naming basic rules

- Be consistent and descriptive when naming your files.

- Separate file names with `_` or `-` to add useful information about the data:

- Project name.

- The sampling locations.

- Type of data/variable.

- Date (YYYY-MM-DD).

.background-highlight[Always use the ISO format:] <big>**YYYY**</big>-<medium>**MM**</medium>-<small>**DD**</small> (large <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#d5695d;overflow:visible;position:relative;"><path d="M438.6 278.6c12.5-12.5 12.5-32.8 0-45.3l-160-160c-12.5-12.5-32.8-12.5-45.3 0s-12.5 32.8 0 45.3L338.8 224 32 224c-17.7 0-32 14.3-32 32s14.3 32 32 32l306.7 0L233.4 393.4c-12.5 12.5-12.5 32.8 0 45.3s32.8 12.5 45.3 0l160-160z"/></svg> small).

---

# File naming basic rules (examples)

**Interesting ressources:**

[How to name files - Jennifer Bryan (YouTube)](https://www.youtube.com/watch?v=ES1LTlnpLMk)

[How to name files - Jennifer Bryan (Slides)](https://speakerdeck.com/jennybc/how-to-name-files-the-sequel)

---

# Working with data from other people

<center>
  <figure>
      <img src="https://media.giphy.com/media/3oxRmGXbquXKz6DNPq/giphy.gif" width = "400"/>
  </figure>
</center>
  
---
  
## Preserve information: keep your raw data raw
  
Basic recommendations to preserve the raw data for future use:
  
--
  
- Do not make any changes or corrections to the original raw data file.

- .background-highlight[Use a scripted language (R, Python, Matlab, etc.) to perform analysis or make corrections and save that information in a separate file.]

- If you want to do some analyses in Excel, make a copy of the file and do your calculations and graphs in the copy.

<small>Source: https://dataoneorg.github.io/Education/bestpractices/preserve-information-keep</small>

---

## Preserve information: keep your raw data raw

If a script changes the content of a raw data file and **saves it in the same file**, .background-highlight[likely, the script will not work the second time because the structure of the file has changed].

---

# Project directory structure

- Choosing a logical and consistent way to organize your data files makes it easier for you and your colleagues to find and use your data.

- Consider using a specific folder to store raw data files.

- In my workflow, I use a folder named `raw` in which I consider files as .background-highlight[read-only].

- Data files produced by code are placed in a folder named `clean`.

---

# Project directory structure

---

# Tidy data

<br>

Photo by <a href="https://unsplash.com/@alevisionco?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">alevision.co</a> on <a href="https://unsplash.com/s/photos/organized?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

</figcaption>

</center>

---

# Why do we want tidy data?

- Often said that .background-highlight[80% of the data analysis is dedicated to cleaning and data preparation!]

- Well-formatted data allows for quicker .background-highlight[visualization], .background-highlight[modeling], .background-highlight[manipulation] and .background-highlight[archiving].

<center>
<figure>
  <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/tidydata_3.jpg" width="600" alt="On the left is a happy cute fuzzy monster holding a rectangular data frame with a tool that fits the data frame shape. On the workbench behind the monster are other data frames of similar rectangular shape, and neatly arranged tools that also look like they would fit those data frames. The workbench looks uncluttered and tidy. The text above the tidy workbench reads “When working with tidy data, we can use the same tools in similar ways for different datasets…” On the right is a cute monster looking very frustrated, using duct tape and other tools to haphazardly tie data tables together, each in a different way. The monster is in front of a messy, cluttered workbench. The text above the frustrated monster reads “...but working with untidy data often means reinventing the wheel with one-time approaches that are hard to iterate or reuse.”">
<figcaption>
    <b>Artwork by</b> <a href="https://twitter.com/allison_horst?s=20">@allison_horst</a>
</figcaption>
</figure>
</center>

---

# Tidy data

The main idea is that data should be organized in columns with .background-highlight[each column representing only a single type of data] (character, numerical, date, etc.).

<center>
<figure>
  <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/tidydata_1.jpg" width="700" alt="Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.">
<figcaption>
    <b>Artwork by</b> <a href="https://twitter.com/allison_horst?s=20">@allison_horst</a>
</figcaption>
</figure>
</center>

---

# How data is often structured

- Many researchers structure their data in such a way that it is easily manipulated by a human, .background-highlight[but not so much programatically].

- A common problem is that the columns represent values, not variable names.
  - Often occurs with datasheets containing species abundance.

---

# How data should be structured

After proper transformations, the data is now tidy ([or in normal form](https://en.wikipedia.org/wiki/Database_normalization)). .background-highlight[Each column is a variable, each row is an observation.]

---

# Keep your data as rectangle tables

If you use a spreadsheet program, .background-highlight[keep your data arranged as rectangular tables]. Otherwise, .background-highlight[it makes data importation difficult].

<figcaption>
Photo by <a href="https://unsplash.com/@goumbik?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Lukas Blazek</a> on <a href="https://unsplash.com/s/photos/spreadsheet?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  
</figcaption>

</center>

---

# Keep your data as rectangle tables

These two examples show the same data. One is arranged as two tables whereas the other is correctly formatted into a single rectangle table.

<center>
.pull-left[
This sheet has two tables.
<img src="img/data_rectangle_1.png" class="centerImage" width="850" alt="Data in a computer spreadsheet program.">
]

.pull-right[
This sheet has one table.
<img src="img/data_rectangle_2.png" class="centerImage" width="250" alt="Data in a computer spreadsheet program.">
]

</center>

---

# Keep your data as rectangle tables

Do not be that person 😩😖😠😤💢😣🤦‍♀️🤦‍♂️😑😓

---

# Variable names

## How to choose variable names when creating data files?

<br>

<figcaption>
Photo by <a href="https://unsplash.com/@cdr6934?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Chris Ried</a> on <a href="https://unsplash.com/s/photos/programming-python?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

</figcaption>

</center>

---

# Variable names

---

# Variable names

- Do not forget to provide information about abbreviations.

- `tmin` vs `temperature_minimum`

- Do not use special characters or spaces (same as for file names).

- Explicitly state the unit of each variable:

- `depth_m`, `chla_mg_m2`

- Be consistent with variable names across files:
  - `temp` vs `temperature`

---

# Missing values

- .background-highlight[Missing values should be simply represented by space in your data files.]

- R, Python, Matlab and other programming languages deal well with this.

- If not possible, use a standardized code to represent missing values:
  - `NA`, `NaN`

- <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#ffae00;overflow:visible;position:relative;"><path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480H40c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24V296c0 13.3 10.7 24 24 24s24-10.7 24-24V184c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/></svg> .background-highlight[Do not use a numerical value (ex.: **-999**) to indicate missing values.]

- This can create situations where missing values will be included in calculations.
  - Ex.: the average of `c(1, NA, 3)` is different than the average of `c(1, -999, 3)`.

---

# Visualization

- Once data is tidy, .background-highlight[perform a visual inspection] to make sure there are no obvious errors in your data.

- A picture is worth a thousand words.
  - <span class = "background-highlight">Always, always, always plot the data!</span>

- A histogram can be used to represent the distribution of numerical data.

---

# Visualization

In this example, we see that there is an outlier in the data. Measuring device fault? Manual entry error?

---

# Backups

## It is not _if_, but _when_ your hard drive will fail.

<br>

Photo by <a href="https://unsplash.com/@artwall_hd?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Art Wall - Kittenprint</a> on <a href="https://unsplash.com/s/photos/backup?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

</figcaption>

</center>

---

# Backups vs Archives

**Backups**

> Backup is a copy of data created to restore said data in case of damage or loss. **The original data is not deleted after a backup is made.**

**Archives**

> An archive is a copy of data created for reference purposes. **Although not required, the original is often deleted after an archive is made.**

<credit>Source: <a href="https://www.networkworld.com/article/3285652/backup-vs-archive-why-its-important-to-know-the-difference.html">https://www.networkworld.com/article/3285652/backup-vs-archive-why-its-important-to-know-the-difference.html</a></credit>

---

# Importance of backups

- .background-highlight[**Disk space is much cheaper than the time you invested in collecting, cleaning and analyzing your data.**]

- It is important to have .background-highlight[redundancy] in your data.
  - <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#ffb82a;overflow:visible;position:relative;"><path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480H40c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24V296c0 13.3 10.7 24 24 24s24-10.7 24-24V184c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/></svg> **A copy of your working directory in another directory on the same hard drive is not redundancy!**

- Backups should not be only done on your computer (use cloud services)

- Google Drive
  - Microsoft OneDrive (1TB of space if a student at Université Laval)
  - Dropbox
  - MEGA

---

# Importance of backups

- Use an incremental strategy to backup your data (.background-highlight[ideally daily]).

- [rsync](https://fr.wikipedia.org/wiki/Rsync)

- [SyncBack](https://www.2brightsparks.com/syncback/syncback-hub.html)

- [Duplicati](https://www.duplicati.com/)

- [Syncthing](https://syncthing.net/)

- I keep 3 months of data.

---

# Source code management

- Backups of the source code used to generate data are also important.

- Git is a version control system used to keep track of changes in computer files.
  - Primarily used for source code management in software development.
  - Coordinating work on those files among multiple peoples.

---

# Publishing your data

## Making your data available to the community

<br>

Photo by <a href="https://unsplash.com/@timmossholder?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Tim Mossholder</a> on <a href="https://unsplash.com/s/photos/open-data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

</figcaption>

</center>

---

# Publishing your data

Many journals and [funding agencies](http://www.science.gc.ca/eic/site/063.nsf/fra/h_97610.html) now require to have archiving strategies. Why?

- Makes your data shareable (do not forget that research is funded with public money).

- Makes your data discoverable.

- Makes your data citable (.background-highlight[DOI, Digital Object Identifier]).
  - Collecting and producing data is difficult and requires a lot of resources (technical and financial).
  - Publishing your data allows other people to credit you for your hard work.

- Others can find and correct errors in your data.

- Data can be reused in other studies to build up knowledge.

---

# Publishing your data

There are at least two different ways to make your data available:

1. In a dedicated data paper.

2. In an appendix along with your paper (.background-highlight[assuming that your paper is published in an open-access journal]).

- [The Directory of Open Access Journals](https://www.doaj.org/) is useful to search for open access journals.

<br>

<center>
  <figure>
      <img src="img/doaj_logo.png" width = "600"/>
  </figure>
  <figcaption>https://www.doaj.org/</figcaption>
</center>

---

# Public announcement

<p style="font-size:32px">.background-highlight[**Summary tables in a PDF article are not very useful!**]</p>

You should rather provide the data in a way that is easily importable into a programming language as supplementary information (for example, a `CSV` file).

---

# What is a data paper?

- .background-highlight[Data presented in an appendix are rarely reviewed by peers.]
- Data papers are interesting alternatives to publish data:

- **Peer-reviewed** (high-quality data).
  - Generally open access (obliviously!).
  - Data are citable with a DOI.

> A data paper is a **peer-reviewed document** describing a dataset, published in a peer-reviewed journal. It takes effort to prepare, curate and describe data. Data papers provide recognition for this effort by means of a scholarly article.
>
> https://www.gbif.org/data-papers

---

# What is a data paper?

A data paper is similar to a traditional scientific paper.

<center>
  <figure>
      <img src="img/essd.png" width = "450"/>
  </figure>
</center>
  
---
  
# What is a data paper?
  
The data associated with the paper is available online.

---

---

# Open repositories

There are many options available to publish your data.

- Polar Data Catalogue (https://www.polardata.ca/)
- Scholars Portal Dataverse (https://dataverse.scholarsportal.info/)
- Federated Research Data Repository (https://www.frdr-dfdr.ca/repo/?locale=fr)
- Pangaea (https://www.pangaea.de/)
- Dryad (https://datadryad.org)
- Catalogue de données ouverte OGSL (https://ogsl.ca/fr/)
- Zenodo (https://zenodo.org/)
- Figshare (https://figshare.com/)
- Seanoe (https://www.seanoe.org/)
- NFS Arctic Data Center (https://arcticdata.io/)
- The Dataverse Project (https://dataverse.org/)

---

# Take home messages

<br>

Photo by <a href="https://unsplash.com/@glenncarstenspeters?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Glenn Carstens-Peters</a> on <a href="https://unsplash.com/s/photos/hand-writing?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

</figcaption>

</center>

---

# Take home messages

- Choose non-proprietary file formats (ex.: `CSV`).

- Give your files and variables meaningful names.

- Tidy and visually explore your data to remove obvious errors.

- .background-highlight[**Backups your data externally as often as possible.**]
  - Your hard drive will eventually crash, for sure!

- Use a version control system (git) for your analysis scripts.

- When possible, share the data and the scripts that were used in your research papers.

---

<br>

Photo by <a href="https://unsplash.com/@wilhelmgunkel?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Wilhelm Gunkel</a> on <a href="https://unsplash.com/s/photos/thank-you?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

</figcaption>

</center>