class: center, middle, inverse, title-slide # The Green Edge ice camp data paper ## ๐
Green Edge legacy meeting (Nice, France) ### Philippe Massicotte et al. ### 2019-11-06 --- <br> <center><img src="img/myname.png" class="centerImage" width="500"></center> Yes, I am that person that sent you all these emails concerning the ice camp data paper. **By the way, thank you to those who answered.** --- class: my-one-page-font, inverse, center, middle # Working with data from others <center><img src="https://media.giphy.com/media/3oxRmGXbquXKz6DNPq/giphy.gif" class="centerImage" width="400"></center> --- # Why is this tedious? -- - Many sources of data. -- - Different formats (xls, csv, mat, dat, etc.). -- - Acquired at different temporal/spatial scales. -- - Metadata rarely available. -- - **Data structure highly variable among the researchers.** --- # The consequences of working with non-structured data > Most data scientists spend only 20% of their time on actual data analysis and 80% of their time finding, cleaning, and reorganizing huge amounts of data, which is an inefficient data strategy. - Uniformizing data is time-consuming *(unfortunately not often recognized)*. - This is why we have worked hard to uniformize the Green Edge data in order to **publish it in a data paper**. --- class: my-one-page-font, inverse, center, middle # What is a data paper? *A peer-reviewed document describing a dataset, published in a peer-reviewed journal.* --- # Main goals of a *(the ice camp)* data paper -- 1. **Compile** and **structure** the data to make it **easily reusable** by other researchers. -- 2. Make the data **discoverable** for the community. -- 3. Make the data **citable**. --- class: my-one-page-font, inverse, center, middle # The Green Edge ice camp data A tale by numbers: how to navigate *(almost)* blindly in the data of others. --- # The data: a few statistics -- - 3 000+ raw data files that have been regrouped into `\(\approx\)` 300 CSV files. -- - 5 000+ lines of code just to *tidy* the data. -- - **25 000 000+ observations!** --- # Data availability The data was deposed online [Sea scientific open data publication (SEANOE)](https://www.seanoe.org/). DOI: 10.17882/59892 <center><img src="img/seanoe.png" class="centerImage" height="350"></center> --- class: my-one-page-font, inverse, center, middle # The Green Edge data papers *I wish I had a great subtitle to write here...* --- # Two data papers in preparation 1. Ice camp 2015-2016: Philippe Massicotte 2. Amundsen 2016: Flavienne Bruyant <center><br> <svg style="height:200;fill:#B2CCE5;" viewBox="0 0 384 512"><path d="M224 136V0H24C10.7 0 0 10.7 0 24v464c0 13.3 10.7 24 24 24h336c13.3 0 24-10.7 24-24V160H248c-13.2 0-24-10.8-24-24zm57.1 120H305c7.7 0 13.4 7.1 11.7 14.7l-38 168c-1.2 5.5-6.1 9.3-11.7 9.3h-38c-5.5 0-10.3-3.8-11.6-9.1-25.8-103.5-20.8-81.2-25.6-110.5h-.5c-1.1 14.3-2.4 17.4-25.6 110.5-1.3 5.3-6.1 9.1-11.6 9.1H117c-5.6 0-10.5-3.9-11.7-9.4l-37.8-168c-1.7-7.5 4-14.6 11.7-14.6h24.5c5.7 0 10.7 4 11.8 9.7 15.6 78 20.1 109.5 21 122.2 1.6-10.2 7.3-32.7 29.4-122.7 1.3-5.4 6.1-9.1 11.7-9.1h29.1c5.6 0 10.4 3.8 11.7 9.2 24 100.4 28.8 124 29.6 129.4-.2-11.2-2.6-17.8 21.6-129.2 1-5.6 5.9-9.5 11.5-9.5zM384 121.9v6.1H256V0h6.1c6.4 0 12.5 2.5 17 7l97.9 98c4.5 4.5 7 10.6 7 16.9z"/></svg> <svg style="height:200;fill:#B2CCE5;" viewBox="0 0 576 512"><path d="M402.6 83.2l90.2 90.2c3.8 3.8 3.8 10 0 13.8L274.4 405.6l-92.8 10.3c-12.4 1.4-22.9-9.1-21.5-21.5l10.3-92.8L388.8 83.2c3.8-3.8 10-3.8 13.8 0zm162-22.9l-48.8-48.8c-15.2-15.2-39.9-15.2-55.2 0l-35.4 35.4c-3.8 3.8-3.8 10 0 13.8l90.2 90.2c3.8 3.8 10 3.8 13.8 0l35.4-35.4c15.2-15.3 15.2-40 0-55.2zM384 346.2V448H64V128h229.8c3.2 0 6.2-1.3 8.5-3.5l40-40c7.6-7.6 2.2-20.5-8.5-20.5H48C21.5 64 0 85.5 0 112v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V306.2c0-10.7-12.9-16-20.5-8.5l-40 40c-2.2 2.3-3.5 5.3-3.5 8.5z"/></svg> </center> --- # The Green Edge ice camp data paper, it is... -- - 109 authors (โ12-15 commented on the paper, thank you!) -- - 43 affiliations/institutions -- - **Too many emails to figure out the data!** --- # Many institutions involved <img src="index_files/figure-html/unnamed-chunk-2-1.svg" width="100%" style="display: block; margin: auto;" /> --- # Data papers Submitted to **Earth System Science Data (ESSD)** > Earth System Science Data (ESSD) is an international, interdisciplinary journal for the publication of articles on original research data (sets), furthering the reuse of high-quality data of benefit to Earth system sciences. Impact factor: **10.951** Impact factor (5-years): **9.899** --- # The structure of ice camp data paper Divided into sections to present: 1. Physical data 2. Underwater bio-optical data 3. Nutrients 4. Bacteria and Phytoplankton 5. Zooplankton --- # One more thing... -- <SPAN STYLE="color: #FCD116; font-size: 50pt";><b>The paper has been accepted!</b></SPAN><br><br> <center><img src="https://media.giphy.com/media/j0P7Dxb4Azvdkm0ziR/giphy.gif" class="centerImage" width="500"></center> --- # The reviews are very positive > The quality checks are appropriate and the process of reviewing the data is up-to-date and grants for **usefulness of the data** to other potential scientists. > **The presentation is of high quality** and I donโt see any inconsistencies that could raise suspects that the data are erroneous. > ... the data presented hold potential for being reused in the future for comparison and further elaboration. > The authors produced **an impressive, integrated data set**. --- class: my-one-page-font, inverse, center, middle # Lessons learned and recommendations After all, it would not be fun! --- # Lessons learned (1/2) Although initial recommendations on good practices about data collection were communicated to all scientists, **extensive efforts had to be performed to assemble the data**. -- Thanks to Marie-Pier Amyot! <center><img src="https://media.giphy.com/media/3oz8xSXvzK6P9lNPby/giphy.gif" class="centerImage" width="300"></center> --- # Lessons learned (2/2) To reduce possible errors and increase the overall efficiency of science: -- 1. **A uniformed data management plan** should be prepared and distributed (**and respected!**) prior to each mission. -- 2. Dedicated data management specialists should be involved from the beginning of the project. --- # Basic recommendations for data management (1/2) -- - Whenever possible, **do not use proprietary file formats** such as Matlab or Excel: - Use plain text files (ex.: CSV, TXT or TAB) --- # Basic recommendations for data management (2/2) If you use a spreadsheet program (ex.: Excel), keep your data arranged as rectangular tables. Otherwise, **it makes data importation difficult.** .pull-left[ <center><img src="https://exceljet.net/sites/default/files/styles/function_screen/public/images/formulas/join%20tables%20with%20INDEX%20and%20MATCH%202.png?itok=HK29obOJ" class="centerImage" height="300"></center> ] .pull-right[ <center><img src="https://www.excel-easy.com/examples/images/split/split-worksheet.png" class="centerImage" height="300"></center> ] --- <br> <div class="holder"> <img src="img/greenedge.png" /> </div> <br> <SPAN STYLE="color: #B2CCE5; font-size: 45pt";>Many thanks to those who helped directly or indirectly with the data!</SPAN>