Title Slide

Data Management Best Practices

Conférence sur la gestion des données

Philippe Massicotte

April 8, 2025


Logo of the CEISM at ulaval Laval University logo Takuvik logo

Badge showing the name of the presenter (Philippe Massicotte).

Research assistant at Takuvik (Laval University)

Remote sensing, modeling, data science, data visualization, programming

https://github.com/PMassicotte

philippe.massicotte@takuvik.ulaval.ca

@philmassicotte

https://fosstodon.org/@philmassicotte

www.pmassicotte.com

Outlines

  • Open file formats for your data.

    • Tabular data.
    • Geographical data (very briefly).
  • Files and data organization.

  • Tidying and formatting data.

  • Backups.

  • Publishing your data.

File formats

File formats

The file format used to store data has important implications:

  • Allows to re-open and re-use your data in the future:

    • Software’s might not be cross-platform (Windows/Mac/Linux).

    • Proprietary file formats can become obsolete or unsupported.

Old-school computing in laboratories

Laboratory computer programs often use proprietary file formats.

This likely means that:

  1. You are forced to buy a license which can be expensive.

  2. You depend on the commitment of the company to support the file format in the future.

Old-school laboratory

When you depend on profit companies

At the Bodega Marine Laboratory at the University of California, Davis, some computers still run on Microsoft Windows XP (released in 2001), because of the need to maintain compatibility with a scanning laser confocal microscope and other imaging equipment, says lab director Gary Cherr.

To work with current Windows versions, the team would have to replace the whole microscope. The marginal potential gains aren’t yet worth the US$400,000 expense, Cherr reasons.

Old-school microscope

File formats

Ideally, the chosen file format should have these characteristics:

  1. Non-proprietary: open source.

  2. Unencrypted: unless it contains personal or sensitive data.

  3. Human-readable: the file should be human-readable or have open source tools available for reading and writing.

  4. Performance: consideration for efficient read and write operations, especially for large datasets, is crucial for optimal performance.

Common open-source text file formats

Tabular plain text file formats:

  • .CSV: Comma (or semicolon) separated values.

  • .TAB: Tab separated values.

  • .TXT and .DAT: Plain text files (data delimiter is not known).

All these file formats can be opened using a simple text editor.

Examples of CSV and TSV files

The Palmer penguin representation by Allison Horst. The image shows three penguins: Adelie, Chinstrap, and Gentoo.

This dataset contains 4 variables (columns). The first line generally includes the names of the variables.

A comma-separated values file (.csv).

Screenshot of a CSV file with the content of the Palmer penguins dataset

A tabs separated values file (.tsv).

Screenshot of a TSV file with the content of the Palmer penguins dataset

Geographic data

In sciences, researchers often collect or utilize spatial data to analyze patterns, relationships, and distributions across geographic areas.

  1. Mental health symptoms: Mapping areas with higher anxiety or depression prevalence.

  2. Access to services: Proximity to clinics or mental health resources.

  3. Environmental factors: Urban density, noise, or socioeconomic data.

Geographic file formats

  • Spatial data differs from tabular data in that it contains information on geographic features such as points, lines or polygons.

  • There are many geographical file formats, but here are some that are particularly popular.

  • ESRI shapefile (.SHP)

    • Technically, the shapefile format is not open. It is however widely used.
  • GeoJSON (.json, .geojson, JSON variant with simple geographical features)

  • GeoPackage (.gpkg)

Geographical data

The .geojson format is an open-source JSON-based format for storing geographical features and their attributes.

Fichier cartographique relatif à la localisation des installations du réseau de la santé et des services sociaux.

[
  {
    "ETAB_NOM_A": "CIUSSS de la Capitale-Nationale",
    "ADRESSE": "2915, avenue du Bourg-Royal",
    "geometry": { "type": "Point", "coordinates": [249916.9008, 5191041.7226] }
  },
  {
    "ETAB_NOM_A": "CHU de Québec – UL",
    "ADRESSE": "11, côte du Palais",
    "geometry": { "type": "Point", "coordinates": [250544.4158, 5186451.0535] }
  },
  {
    "ETAB_NOM_A": "Centre d'hébergement et de soins de longue durée Côté-J",
    "ADRESSE": "880, avenue Painchaud",
    "geometry": { "type": "Point", "coordinates": [246695.8913, 5183458.8756] }
  }
]

https://publications.msss.gouv.qc.ca/msss/fichiers/statistiques/cartes/Etablissements.zip

https://publications.msss.gouv.qc.ca/msss/fichiers/statistiques/cartes/Etablissements.zip

https://publications.msss.gouv.qc.ca/msss/fichiers/statistiques/cartes/Etablissements.zip

File naming and project organization

File naming: who can relate?

Laptop showing a file explorer with files badly named.

File naming basic rules

There are a few rules to adopt when naming files:

  • Do not use special characters: ~ ! @ # $ % ^ & * ( ) ; < > ? , [ ] { } é è à
  • No spaces.

This will ensure that the files are recognized by most operating systems and software.

  • meeting notes(jan2023).docx
  • meeting_notes_jan2023.docx

File naming basic rules

For sequential numbering, use leading zeros to ensure files sort properly.

  • For example, use 0001, 0002, 1001 instead of 1, 2, 1001.

Image showing how files are sorted in a file explorer.

When file naming goes wrong!

The glitch caused results of a common chemistry computation to vary depending on the operating system used, causing discrepancies among Mac, Windows, and Linux systems.

…the glitch, had to do with how different operating systems sort files.

When file naming goes wrong!

Data files were sorted differently depending on the operating system where the Python scripts were executed.

Image showing how files are sorted in a file explorer on Windows and Linux.

File naming basic rules

  • Be consistent and descriptive when naming your files.

  • Separate parts of file names with _ or - to add useful information about the data:

    • Project name.

    • The sampling locations.

    • Type of data/variable.

    • Date (YYYY-MM-DD).

  • Always use the ISO 8601 format: YYYY-MM-DD (large small).

  • 12-04-09 (2012-04-09 or 2004-12-09 or 2009-04-12, or …, There are a total of 6 possible combinations.)

  • 2012-04-09 (2012 April 9th)

File naming basic rules (examples)

Imagine that you have to create a file containing temperature data from a weather station.

  • data.csv (not descriptive enough)

  • temperature_1 (what is the meaning of 1 ?, no number padding!)

  • temperature_20160708 (no file extension provided)

  • station01_temperature_20160708.csv

Working with data from other people

Preserve information: keep your raw data raw

Basic recommendations to preserve the raw data for future use:

  • Do not make any changes or corrections to the original raw data files.

  • Use a scripted language (R, Python, Matlab, etc.) to perform analysis or make corrections and save that information in a separate file.

  • If you want to do some analyses in Excel, make a copy of the file and do your calculations and graphs in the copy.

Preserve information: keep your raw data raw

If a script changes the content of a raw data file and saves it in the same file, likely, the script will not work the second time because the structure of the file has changed.

Image showing that data can change after the data is cleaned. The column name is separated into two columns firstname and lastname.

Project directory structure

  • Choosing a logical and consistent way to organize your data files makes it easier for you and your colleagues to find and use your data.

  • Consider using a specific folder to store raw data files.

  • In my workflow, I use a folder named raw in which I consider files as read-only.

  • Data files produced by code are placed in a folder named clean.

Project directory structure

Schematic showing a project directory structure.

Project directory structure

Schematic showing a project directory structure. Same at the previous slide, but with faded colors.

Tidy data

AI generated image of a book and pencils on a desk.

Why do we want tidy data?

  • Often said that 80% of the data analysis is dedicated to cleaning and data preparation!

  • Well-formatted data allows for quicker visualization, modeling, manipulation and archiving.

Artwork by @allison_horst

Tidy data

The main idea is that data should be organized in columns with each column representing only a single type of data (character, numerical, date, etc.).

Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.

Artwork by @allison_horst

How data is often structured

  • Many researchers structure their data in such a way that it is easily manipulated by a human, but not so much programatically.

  • A common problem is that the columns represent values, not variable names.

Example: mental health survey across various locations.

Image showing a frequent method to enter count data in spreadsheet software.

How data should be structured

After proper transformations, the data is now tidy (or in normal form). Each column is a variable, and each row is an observation.

Image showing a frequent method to enter count data in spreadsheet software.

Multiple variables in one column

The column type contains both the gender and the age group of the individuals.

Multiple variables stored in one column.

It is better to have two separate columns: gender and age_group.

Multiple variables stored in two columns.

Keep your data as rectangle tables

If you use a spreadsheet program, keep your data arranged as rectangular tables. Otherwise, it makes data importation difficult.

AI generated image of a piece of paper showing a spreadsheet.

Keep your data as rectangle tables

These two examples show the same data. One is arranged as two tables whereas the other is correctly formatted into a single rectangle table.

This sheet has two tables

Data in a computer spreadsheet program with two blocks of data.

This sheet has one table

Data in a computer spreadsheet program with one block of data.

Keep your data as rectangle tables

Do not be that person 😩😖😠😤😣🤦‍♀️🤦‍♂️😑😓

Data in a computer spreadsheet program. Data is not arranged in a rectangle table and there is a graphic placed inside the sheet.

Variable names

AI generated image of a cartoonish man in front of what looks like a computer with variable names on the screen.

Variable names

Be consistent with variable name capitalizasion:

temperature, precipitation

Temperature, Precipitation

Avoid mixing name capitalization:

temperature, Precipitation

temperature_min, TemperatureMax

Variable names

  • Try avoiding abbreviations.

    • anx vs anxiety_score
    • dep vs depression_severity
  • Explicitly state the unit of each variable:

    • consultation_time_hrs, medication_dose_mg, treatment_weeks
  • Be consistent with variable names across files:

    • patient_id vs participant_id
    • assessment_date vs evaluation_date

Variable names

Do not use special characters or spaces (same as for file names).

Screenshot showing an error message in R when trying to read a file with a special character.

r$> read_csv("myfile.csv")
Rows: 104937 Columns: 1
Error in nchar(x, "width") : invalid multibyte string, element 1

Missing values

  • Missing values should be simply represented by spaces in your data files.

  • R, Python, Matlab and other programming languages deal well with this.

  • If not possible, use a standardized code to represent missing values:

    • NA, NaN
  • Do not use a numerical value (ex.: -999) to indicate missing values.

    • This can create situations where missing values will be silently included in calculations.
    • Ex.: the average of these two vectors are different:
      • [1, NA, 3] = 2
      • [1, -999, 3] = -331.6

Visualization

  • Once data is tidy, perform a visual inspection to make sure there are no obvious errors in your data.

  • A picture is worth a thousand words.

    • Always, always, always plot the data!
  • A histogram can be used to represent the distribution of numerical data.

Visualization

In this example, we see that there is an outlier in the data. Measuring device fault? Manual entry error?

Histogram of air temperature. A histogram is an accurate representation of the distribution of numerical data. Here we can observe that there is an outlier in the data.

Backups

It is not if, but when your hard drive will fail.

AI generated image of a hard drive failing with a person crying because of losing data.

Backups vs Archives

Backups

A backup is a copy of data created to restore said data in case of damage or loss. The original data is not deleted after a backup is made.1

Archives

The series of managed activities necessary to ensure continued access to digital materials for as long as necessary. Digital preservation is defined very broadly and refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological change.

Those materials may be records created during the day-to-day business of an organization; ‘born-digital’ materials created for a specific purpose(e.g., teaching resources); or the products of digitization projects. This definition specifically excludes the potential use of digital technology to preserve the original artefacts through digitization.2

Importance of backups

  • Disk space is much cheaper than the time you invested in collecting, cleaning and analyzing your data.

  • It is important to have redundancy in your data.

  • A copy of your working directory in another directory on the same hard drive is not redundancy!

  • Backups should not be only done on your computer (use cloud services).

  • Google Drive

  • Microsoft OneDrive (1TB of space if working at Laval University)

  • Dropbox

Important

Check with your institution or funding agency to see if they have a policy on data storage and backup. You may be required to use a specific service for sensitive data.

Importance of backups

The 3-2-1 backup rule

  • 3 total copies of your data (the original and two backups).

  • 2 different media types for the backups (e.g., an external hard drive and cloud storage).

  • 1 copy stored offsite, to protect against local disasters like fire or theft.

Importance of backups

Use an incremental strategy to backup your data (ideally daily):

I keep three months of data at three different locations:

  1. On my computer.

  2. On an external hard drive.

  3. On a cloud service provided by my university.

Restoring from an incremental backup

List of Duplicati backup jobs.

Source code management

  • Backups of the source code used to generate data are also important.

  • Git is a version control system used to keep track of changes in computer files.

    • Primarily used for source code management in software development.
    • Coordinating work on those files among multiple people.

Publishing your data

Publishing your data

Many journals and funding agencies now require to have archiving strategies. Why?

  • Share your data (publicly funded research should be accessible).

  • Make your data discoverable.

  • Make your data citable (using DOI, Digital Object Identifier).

    • Data collection is resource-intensive.
    • Publishing allows others to credit your work.
  • Others can find and fix errors in your data.

  • Data can be reused in other studies.

Publishing your data

The traditional way to publish data is to include it as supplementary information with your paper.

  • In an appendix along with your paper (assuming that your paper is published in an open-access journal).
  • Data presented in an appendix are rarely reviewed by peers.

The Directory of Open Access Journals is useful for searching for open access journals.

Directory of Open Access Journals logo. The logo has three shapes on the left side and the text 'DOAJ' on the right side.

Public announcement

Summary tables in a PDF article are not very useful!

GIF showing a person throwing a book over his shoulder.

You should rather provide the data in a way that is easily importable into a programming language as supplementary information (for example, a CSV file).

What is a data paper?

Another way to publish data is to write a data paper.

  • Data papers are interesting alternatives to publish data:

    • Peer-reviewed (high-quality data, in theory!).
    • Generally open access (obliviously!).
    • Data are citable with a DOI.

A data paper is a peer-reviewed document describing a dataset, published in a peer-reviewed journal. It takes effort to prepare, curate and describe data. Data papers provide recognition for this effort by means of a scholarly article.

https://www.gbif.org/data-papers

What is a data paper?

A data paper is similar to a traditional scientific paper.

Screenshot of the Earth System Science Data journal with an article featuring many authors.

What is a data paper?

The data associated with the paper is available online with an associated DOI.

Screenshot of the Seanoe website showing a dataset with a DOI.

Bar plot showing the number of downloads per country for each dataset.

Open data repositories

Take home messages

Take home messages

  • Choose non-proprietary file formats (ex.: CSV).

  • Give your files and variables meaningful names.

  • Tidy and visually explore your data to remove obvious errors.

  • Backups your data externally as often as possible.

    • Your hard drive will eventually crash, for sure!
  • Use a version control system (git) for your analysis scripts.

  • When possible, share the data and the scripts that were used in your research papers.