2018-11-28: Tale Format design

Tommy, Kacper, Craig

Tale format discussion:

Review of Tommy’s draft:

  • Proposal:
    • Tales described in RDF and serialized via JSON-LD
      • Q. CW: Are we pushing a JSON-LD file to DataONE/Dataverse?
    • Toss it into a “Bagit” zip
  • BagIt
    • Archival
    • Directory structure w/ manifest (MDF5s), bagit.txt with version/character encoding
    • Every Bagit has a data folder
    • DataONE uses Bagit when downloading packages
    • Q. CW Are we talking about publishing BagIts to dataone?
      • TT: This would be used on Export
  • Discussion of use cases:
    • Export from WT to laptop
    • Publish to DataONE
      • EML
      • Write prov information to package resource map
    • Publish to Dataverse
      • SWORD + XML
      • Native API
      • []Write prov information to a file](http://guides.dataverse.org/en/latest/api/native-api.html#id65)
    • Interoperability with Odum CoRe2
      • There are real “tales” in the wild that do not conform to a strict specification
      • https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29911
      • “Tale by convention”
    • Interoperability with Binders/Capsules/O2R ERC
      • Comparing Binder to Tale
        • Publish a Git repo to a remote repository ala Zenodo
        • Get a list of files and directories from remote repo via repo2docker
        • Git repo = filesystem/directory/files
          • Environment description: apt.txt, etc
          • binder.yml: repo2docker version etc.
      • Capsules
        • zip file with “uuid name.version”
        • Predefined structure
          • code, environment (Dockerfile)
          • YML format, simpler than ours
    • Kacper:
      • We need to decide if it’s declarative or imperitive
        • imperitive: repo2docker. Binder specification is in code.
        • declarative:
    • Import a Capsule:
      • Looks like a Tale, but with many tags and authors
      • Build Docker image from environment
    • TT: discussion of DataONE package
      • We have EML document and Tale yaml
      • When we ingest data, we use the EML document (title, authors, etc)
      • Different than importing a capsule
    • CW: what do we/they have:
      • Externally referenced/registered datasets
        • Datalad git-annex could do this for Binder
      • We have knowledge of the compute environment
        • Capsules have Dockerfiles
        • Binders implicitly are Git repos with env spec baked in
      • Q. What distinguishes a “Tale” from a data package:
        • Prov information
      • CodeOcean enforces a directory structure
        • data, code, environment
      • WT can support directory hierarchy
        • DataONE and Dataverse cannot
    • CW: Does a Tale have to have an environment?
      • KK: For Binder, you get a Jupyter notebook by default.
      • TT: It needs at least a reference
      • What if https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/29911 has a Dockerfile?
        • Is the Dockerfile in data?
        • Do we build the image?
          • KK: With current import, we treat it as data.
            • Need to detect whether it is data or a tale
            • If data + code, import as data
            • If environment, it’s a tale
          • Tale by convention vs specification
    • CW: Should we impose structure/
      • code, data, environment, etc.
      • TT: It isn’t BagIt
  • codemeta?
    • https://codemeta.github.io/
  • Discussion of adapters during publication
    • TT: We need to commit to support them
  • What does export look like
    • Current: curl -X GET –header ‘Accept: application/zip’ –header ‘Girder-Token: 7xNaPOb1xweww1QCNrM8WRGuf7gwDkF4ld0D2jid4jZiDKuSEyovDw7aTid8UeLE’ ‘https://girder.dev.wholetale.org/api/v1/tale/59f0b91584b7920001b46f2e/export’
    • Download zip file
    • Unzip
    • runtale.sh
      • cd data
      • docker run repo2docker:v .
      • go get the data… and mount it…
      • docker run -v $(pwd)/data:/data -P my/image
      • Open browser to port?
    • What about the data?
      • Ala CodeOcean – put it in the Zip?
      • Can we use DataLad with WT as a provider?
    • Who cares/What do I do with a prov trace?
      • Verify that the runs are correct
      • Check that your inputs/outputs match their inputs outputs
      • Visualize the prov trace
      • User wants to verify their own stuff
      • Q. Did I reproduce the experiment?
        • Diffing the prov graph
        • DataLad does this too with Git diff
    • What metadata do I care about?
    • Is datasets.txt a new feature of repo2docker
  • Contraints
    • File path information
  • Goal:
    • Create slides/description consumable by PI team
    • Relate to existing initiatives
    • Inputs:
      • RDA package specification
      • codemeta
      • schema.org
      • BagIt
      • Dataverse
      • DataONE
      • Binder
      • CodeOcean
      • bdbags
      • Prov
      • O2R/ERC?
      • DataLad?