According to Marwick (2017a), the field of archaeology has had a long-term commitment to empirical tests of reproducibility by returning to excavation and survey sites, but has only recently started to make progress in testing reproducibility of statistical and computational results. As in many other fields, data sharing has had increased attention over the past decade and sharing code and analytical materials only over the last few years. As noted by Marwick (2018), the Journal of Archaeological Science adopted a “data disclosure” policy in 2013 and author guidelines were updated only in 2018 to encourage sharing of “software, code, models, algorithms, protocols, methods and other useful materials related to the project.”
Open access to archeological data is sometimes problematic due to cultural sensitivities or issues of ownership (copyright or international stakeholders) and impact of exposure (e.g., risks of looting). Data publishing is also limited due to costs (key discipline repositories are fee-based) and researcher motivations. Community norms do not encourage/reward data and code publishing and no journals require archaeologists to make code and data available by default. Discipline-specific repositories include the Archaeological Data Service, the Archaeological Record (tDAR), and Open Context.
Marwick (2017a) outlines a set of basic principles to improve computational reproducibility in archaeological research. These are similar to guidelines provided in other fields:
- Make data and code openly available
- Publish only the data used in the analysis
- Use a programming language to write scripts for analysis and visualization
- Use version control
- Document and share the computational environment
- Archive in online repositories that issue persisted identifiers
- Use open licenses
Marwick and the archaeology community have adopted the concept of “research compendium” refer to data/code packages. This concept originated with Gentleman and Temple Lang (2004): “We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,…), and as a means for distributing, managing and updating the collection.”
Marwick (2017a) describes a specific case study to illustrate his principles for reproducibility and demonstrate the research compendium concept:
- Using a combination of R + knitr/Rmarkdown + Github + Docker + Figshare
- Licenses: CC0 for data, MIT for code
- Figshare was selected because of either the fee to publish in discipline repositories or technical limitations integrating with Github.
Marwick (2017a) suggests that R is widely used by archaeologists “who share code with their publications” in part because it is free, widely used in academic research including statistics and includes support for experimental packages. He selected Git because commits can be used to indicate exact version of code used during submissions (note, started with a private repository that was opened after publication). He selected Docker because of convenience, building his image based on the existing rOpenSci image.
He found that the primary issue is the time required to learn the various tools and recommends incentivizing training in and practice of reproducible research. He also recommends changing editorial standards of journals by requiring submission of “research compendia”.
The Archaeology community has organized “reproducible research” Carpentry workshops and contributed to the open source curriculum. For example:
Example research compendia¶
This research compendium has been published by d’Alpoim Guedes and Bocinsky via a combination of Github and Zenodo. The paper, compendium, and data are each published as separate citable artifacts. The data package includes all raw (downloaded) and derived data generated by the analysis (~3GB). The code is packaged as an R package. The environment is provided via a Dockerfile that adds packages on top of the rocker/geospatial:3.5.1 image. The image has been pushed to Dockerhub and is therefore immediately re-runnable.
The authors provide multiple methods of re-running the analysis: by cloning and running the Github repository locally, via the published Docker image, or by building and running the Docker image locally. The primary entrypoint is a single R-Markdown script.
Data is downloaded from multiple sources during execution.
- The R FedData package is used to dynamically download data published from the NOAA Global Historical Climatology Network based on spatial constriants
- Instrument data published vi NOAA FTP server (URL)
- An Excel spreadsheet published as supplemental data vi Science (URL)
- They also use data from The Digital Archaeological Record (tDAR) (requires authentication)
- Elevation data via the Google Elevation API
This compendium suggests the following use cases:
- Support for rocker-project images
- Ability for researchers to dynamically and programmatically register immutable published datasets
- Support for authenticated data sources and
- Ability to register data from FTP services
- Ability to store arbitrary credential information (e.g., in Home)
- Support for projects where Github is the active working environment
- Support for re-using the Github README for Tale description
- Association and display of citation information for associated materials
- Automatic citation of source data, where possible
- Separate licenses for code and data
The following are examples of “research compendia” from Archaeology:
Marwick (2018) reports on three pilot studies exploring data sharing in archaeology. He discusses the ethics of data sharing due to work with local and indigenous communities and other stakeholders and describes archaeology as a “restricted data-sharing and data-poor field.”
Archaeology Data Service/Digital Antiquity 2011 Guides to Good Practice. Electronic document, http://guides.archaeologydataservice.ac.uk/
Journal of Archaeological Science 2018 Guide for Authors. Journal of Archaeological Science. Electronic document; https://www.elsevier.com/journals/journal-of-archaeological-science/0305-4403/guide-for-authors (via wayback)
Kansa, Eric C., and Kansa, Sarah W. 2013 Open Archaeology: We All Know That a 14 Is a Sheep: Data Publication and Professionalism in Archaeological Communication. Journal of Eastern Mediterranean Archaeology and Heritage Studies 1 (1):88–97
Marwick, B. J. (2017a) Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation. Archaeol Method Theory (2017) 24: 424. https://doi.org/10.1007/s10816-015-9272-9
Marwick, B. et al. (2017b) Open science in archaeology. SAA Archaeological Record, 17(4), pp. 8-14.
Marwick (2017c) Using R and Related Tools for Reproducible Research in Archaeology. In Kitzes, J., Turek, D., & Deniz, F. (Eds.) The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press. https://www.practicereproducibleresearch.org/case-studies/benmarwick.html
Marwick, B., & Birch, S. 2018 A Standard for the Scholarly Citation of Archaeological Data as an Incentive to Data Sharing. Advances in Archaeological Practice 1-19. https://doi.org/10.1017/aap.2018.3 https://doi.org/10.17605/OSF.IO/KSRUZ (code/data)
Marwick, B., Boettiger, C., & Mullen, L. (2017d). Packaging data analytical work reproducibly using R (and friends). The American Statistician https://doi.org/10.1080/00031305.2017.1375986
Nüst, Daniel, Carl Boettiger, and Ben Marwick. 2018. “How to read a research compendium.” arXiv:1806.09525
Open Digital Archaeology Textbook . https://o-date.github.io/draft/book/