class: center, middle, inverse, title-slide .title[ # Open Science, Open Methods, and Open Source communities ] .subtitle[ ## Serrapilheira/ICTP-SAIFR Training Program in Quantitative Biology and Ecology ] .author[ ### Andrea Sánchez-Tapia & Sara Mortara ] .date[ ### August 11 2022 ] --- ## In this Introduction to Scientific Programming module we: <svg viewBox="0 0 512 512" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> kept the integrity of the raw data and made a distinction between raw data and processed data -- <svg viewBox="0 0 512 512" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> executed data analysis using scripts, allowing us to have a record of data cleaning and processing, and a transparent methodology -- <svg viewBox="0 0 512 512" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> used version control for better collaboration and error tracking -- <svg viewBox="0 0 512 512" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M173.898 439.404l-166.4-166.4c-9.997-9.997-9.997-26.206 0-36.204l36.203-36.204c9.997-9.998 26.207-9.998 36.204 0L192 312.69 432.095 72.596c9.997-9.997 26.207-9.997 36.204 0l36.203 36.204c9.997 9.997 9.997 26.206 0 36.204l-294.4 294.401c-9.998 9.997-26.207 9.997-36.204-.001z"></path></svg> used reproducible tools for writing --- class: center, middle # what is open science (and why we talk about it in a R for biology/ecology course)? --- ## as scientists, we + Design experiments + Take data in the field and the lab + Analyze them + Discuss, write, submit, publish manuscripts + Write grants and apply for funding, we manage these resources + Review manuscripts + Mentor, train other scientists, have collaborations + Extension and scientific communication activities --- class: middle, center ## The way we work can help us do attend all these demands or be an obstacle: __good practices__ ## Open Science can be a framework to think about these good practices --- ## what is Open Science A set of practices that aim to make all the research products available publicly __from the original data and the methods used, to their publication__ Open data and content would be __used__, __modified__, __shared__ openly/freely by any\* person, for any* purpose <!--this "any" part could be problematic--> --- ## why open science? + __transparency__, __quality__, __reproducibility__ -- + more robust results, available for review, correction from your peers and any citizen (potentially) -- + __collaboration__, reanalysis, replication of results -- + return to society of the investment in scientific activities - be ethical --- ## Six open science principles 1. Open __data__ -- 1. Open source __tools__ - not only free but also with open code -- 1. Open __methods__, sharing the details, tools to guarantee the transparency and reproducibility -- 1. Open __access__ publication -- 1. Open __peer review__ -- 1. Open __education fonts__ + diversity, equity, inclusion, accessibility (__Open scholarship__) --- class: middle, center ## 1. __Open data__ ## Scientific data acquisition, management, and maintenance --- ## what constitutes scientific data? -- + _raw_ data (measurements, recordings, images, files) -- + __experimental protocols__ (plans, procedures, instrument calibration) -- + data __cleaning__, __processing__, and __analysis__ -- + _processed_ data --- ## data and metadata recording + in the field, the lab, the herbarium... -- + lab/field notebooks have to be a __permanent__, __well organized__, __understandable__, __complete__, that allows replication by others -> __open lab notebook__ -- + lab notebook should be kept in a safe place -- + backup! __data management plans__ --- ## maintenance and sharing + was the __digitation__ correct? -- + local _backups_, web repositories, institution repositories -- + associated to publications (e.g. Dryad, https://www.datadryad.org/) -- + long-term reference -> `DOI` <!-- Os dados devem poder ser __compartilhados__: para correção, repetição, replicação, reprodução, reanálise dos experimentos e para colaboração em novos trabalhos. --> --- ## who is responsible for data _integrity_? ____Any individual involved in the development and execution of the research and data processing____ + The principal investigator + Advisors + Students + Lab assistants + Field assistants and people who do the measurement on the field. --- ## metadata: data about data The __what__, __where__, __how__, __when__, __who__, and __why__ the data were taken + The __methods__ and the __rationale__ that lead to the data treatment methods <!--detalhes sobre o processamento dos dados brutos --> + __Materials used__ + Locality + Additional observations and notes + Adequate sample label and ID of every data collected --- background-image: url(https://www.go-fair.org/wp-content/themes/go-fair/images/logo.svg) background-position: 80% 80% ## a framework for open data: FAIR criteria + Wilkinson et al 2016 [The FAIR Guiding Principles for scientific data management and stewardship](https://www.nature.com/articles/sdata201618) + https://www.go-fair.org/fair-principles/ + FAIR data: + __F__indable + __A__ccessible + __I__nteroperative + __R__eusable --- ## Findable (F): + F1. (Meta)data are assigned a globally unique and persistent identifier + F2. Data are described with rich metadata (defined by R1 below) + F3. Metadata clearly and explicitly include the identifier of the data they describe + F4. (Meta)data are registered or indexed in a searchable resource+ __global, unique, and persistent identifier__ (DOI, ORCID) --- ## accessible (A): + A1. (Meta)data are retrievable by their identifier using a standardised communications protocol + A1.1 The protocol is open, free, and universally implementable + A1.2 The protocol allows for an authentication and authorisation procedure, where necessary + A2. Metadata are accessible, even when the data are no longer available --- ## interoperable (I) + I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. + I2. (Meta)data use vocabularies that follow FAIR principles + I3. (Meta)data include qualified references to other (meta)data --- ## reusable (R) + R1. (Meta)data are richly described with a plurality of accurate and relevant attributes + R1.1. (Meta)data are released with a clear and accessible data usage license + R1.2. (Meta)data are associated with detailed provenance + R1.3. (Meta)data meet domain-relevant community standards #### There are __standard protocols__ in Biology and Ecology (DarwinCore, EML - Ecological Metadata Language) --- background-image: url(https://images.squarespace-cdn.com/content/v1/5d3799de845604000199cd24/1567592828276-IZWQDX1H6DRCD85GRSWJ/CARE+Principles.png?format=1000w) background-position: 90% 10% background-size: 30% ## Openness is not enough https://www.gida-global.org/care + __C__ollective Benefit: Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data. + __A__uthority to Control: Indigenous Peoples’ rights and interests in Indigenous data must be recognised and their authority to control such data be empowered. + __R__esponsibility: Those working with Indigenous data have a responsibility to share how those data are used to support Indigenous Peoples’ self-determination and collective benefit. + __E__thics: Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem. ### Data should be __as open as possible, and as closed as necessary__ --- class: center, middle ## 2. __*Open source* tools__ --- ## _libre_ software + free as in freedom and open code -- + should be able to: __used__, __copied__, __studied__, __modified__, __redistributed__ -- + __without restrictions__ -- + __without discrimination__ to groups of people or fields -- + respecting and keeping original __licenses__ --- class: middle, center # Can you substitute all your workflow tools for open source software? -- ... most probably not --- class: center, middle # 3. __Open methods__ --- ## open methods + registration, data processing, analysis -- + each step should be described -- + publications should include all the necessary information to -- + be understood by the reader (trust and robustness) -- + allow other scientists to (attempt to) replicate the results -- + __metadata__ are key here --- ## replication and reproducibility [Nature](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970) asked 1500 scientists: is there a reproducibility crisis? .pull-left[ <img src="figs/reproducibility-graphic-online1.jpeg" width="400" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="figs/reproducibility-graphic-online3a.png" width="400" style="display: block; margin: auto;" /> ] --- class: middle .pull-left[ <img src="figs/reproducibility-graphic-online4.jpg" width="350" /> ] -- .pull-right[ <img src="figs/reproducibility-graphic-online3b.png" width="350" /> + malpractice and pressure for publication of new and successful results (bias) + lack of transparency in data and methods/code ] --- ## some basic rules + don't edit or modify your graphs to modify the results -- + don't tweak analysis methods to obtain desired results -- + don't omit data that do not support your (desired) conclusions -- + do not fabricate data -- + don't modify your data -- + don't report the same results in different publications (_"salami science"_) --- .pull-left[ <img src="figs/reproducibility-graphic-online5.jpg" width="350" /> ] -- .pull-right[ + learn better about __methods __ and __experimental design__ + better __mentoring/supervision__ and __teaching__ + data and code __quality control__ (peer review!) + __incentives*__ for better practices ] --- ## __but not everyone was taking decisions about this... __ <img src="figs/reproducibility-graphic-online6.jpg" width="400" style="display: block; margin: auto;" /> --- ## steps towards open methods + give priority to script based tools -- + use version control systems like `git` -- + document every step -- + publish protocols and code within the ethical limits -- + promote methods and code peer review -- ## solves all the problems? no (but it's a good first step) --- ## Open tools or methods? | | open tools| closed tools| |--:|--:|--:| | open methods| __ideal__| ...| | closed methods| ...| the worse| -- + Open software and tools can be used in "closed" and non-reproducible ways. -- + Users of closed software can take steps to open their methods --- class: center, middle # 4. Open publishing --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/Open_Access_logo_PLoS_transparent.svg/142px-Open_Access_logo_PLoS_transparent.svg.png) background-size: 20% background-position: 80% 20% ## open access journals 1. __Diamond__ available for free and without charging the authors (no Author processing charges APC) -- 1. __Golden__ available for free, charging the authors -- 1. __Green__ allow authors to deposit versions in their own repositories (elf-archiving) -- 1. __Hybrid__ mix of open and closed manuscripts -- #### Won't be a topic today but beware predatory journals! --- ### Open access articles are cited more often <img src="https://the-turing-way.netlify.app/_images/open-access-citations.jpg" style="display: block; margin: auto;" /> (McKiernan et al 2016) --- ## New publishing modes 4. _Pre-prints_ and journals who accept them <!--, versões previamente publicadas na internet porém sem revisão por pares --> -- 3. Journals who require data to be in a repository (e.g. __Dryad__) -- 5. Journals who accept -> require the code used in analysis -- 6. __Reproducible manuscripts__ -- 7. __Open peer-review__ (that's the whole __fifth__ pillar of Open Science) ex. __PeerJ__ --- # 6. Open Education The sixth pillar of open science refers to __open education.__ Create __open teaching materials__, available for all irrespective of their background, origins, social and economic situation. Expansion towards Open scholarship. Diversity, Equity, Inclusion, Accessibility. --- ## How is your workflow today? -- + Would it benefit from changes towards openness? -- + Who is responsible for your data integrity? -- + How are your data being saved, backed up, and managed? -- + Do you have collaborations from sharing your data? Could you share the data in their current state? -- + Do you work with sensitive data? Are there contexts in your work in which you have to take data privacy into account, or you shouldn't share your data? -- + Do you document your data and methods thouroughly? -- + When you use point-and-click methods do you take provisions to document all the steps? --- ## Open source and open science communities + Communities of practice -- + How to __get help__ (how to __ask questions__, give reproducible examples) -- + How to __contribute__ (ex. scikit-learn [contributing guidelines](https://scikit-learn.org/stable/developers/contributing.html?highlight=contributing), rOpenSci [contributing guide](https://contributing.ropensci.org/)) -- + How to interact in general. Codes of conduct [Carpentries](https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html) -- + How to teach others (ex. become a __Certified Carpentries Instructor__) -- + How to __develop packages__, do __package review__ (ex. rOpenSci https://devguide.ropensci.org/) -- + How to contribute to __CRAN Taskviews__ https://cran.r-project.org/web/views/ https://github.com/cran-task-views/ctv --- ## references + Gorgolewski KJ, Poldrack RA (2016) A Practical Guide for Improving Transparency and Reproducibility in Neuroimaging Research. PLoS Biol 14(7): e1002506. https://doi.org/10.1371/journal.pbio.1002506 + Kraker, P., Leony, D., Reinhardt, W., & Beham, G. (2011). The Case for an Open Science in Technology Enhanced Learning. International Journal of Technology Enhanced Learning, 6(3), 643–654 + http://opendefinition.org/od/2.1/en/ + https://nikokriegeskorte.org/2016/02/15/the-four-pillars-of-open-science/ + Open Science Framework https://osf.io/ --- ## references + Blog OpenScience.com https://openscience.com/ + NIH - Instruction in Responsible Conduct of Research https://oir.nih.gov/sourcebook/ethical-conduct/responsible-conduct-research-training/ + Zook, Matthew, et al. “Ten Simple Rules for Responsible Big Data Research”. PLOS Computational Biology, vol. 13, núm. 3, marzo de 2017, p. e1005399. PLoS Journals, doi:10.1371/journal.pcbi.1005399. - Erin C. McKiernan, Philip E. Bourne, C. Titus Brown, Stuart Buck, Amye Kenall, Jennifer Lin, Damon McDougall, Brian A. Nosek, Karthik Ram, Courtney K. Soderberg, Jeffrey R. Spies, Kaitlin Thaney, Andrew Updegrove, Kara H. Woo, and Tal Yarkoni. Point of View: How open science helps researchers succeed. eLife, Jul 2016. doi:10.7554/eLife.16800.