2018-12-05: Tale Format design

Tommy, Kacper, Craig

  • Tommy
    • Looked at Dataverse main metadata file/DC vocab terms covering licenses (type/file), name, author, etc.
    • Still shooting for an RDF representation
  • Review of notes:
    • Distinction between export and publishing
    • TT: When we do publish, we will have some sort of metadata document
  • Craig:
    • 12/04 Dataverse community call covered publishing Git repos and file hierarchy support
      • Long-standing issue to support Zenodo-like Github publication
      • Gigantum, Binder, WT talking about publishing to them
      • Phase 1 recommendation is to publish Git repos as zipfiles
        • Gigantum is publishing Git repos to dataverse as zipfiles to preserve hierarchy because Dataverse does not support file hierarchy.
      • Re-iterate the Odum CoRE2 case, which is adding a Dockerfile to an existing Dataverse collection/dataset as a way to provide the environment
  • Discussion
    • TT: DC relates to the document we need to create for the DVN SWORD API
      • http://guides.dataverse.org/en/latest/api/sword.html#create-a-dataset-with-an-atom-entry
    • KK: They support adding dataset as zip – zip is unpacked, zipped-zip is published as a zipfile
    • CW: RDF is for internal structure which gets translated to each publisher?
      • TT: may not have a 1:1 mapping
    • CW:
      • Publish metadata to DV/DataONE in their terms/format (i.e., their metadata) for discovery, etc.
        • Facilitates discovery – maximizes the value of integration with their system
      • Publishing “tale.rdf” or “tale.yaml” allows us to know what we’re dealing with – serialize/deserialization export/import
      • Spectrum of “tales” that might be in Dataverse/DataONE
        • Created outside of WT
          • Data + Code and no environment
          • Dockerfile + Data + code
          • Binder-like structure
          • We would use the DataONE/DV metadata during tale import
        • Tale – tale.rdf/etc
          • Can tales be run elsewhere?
          • What we have that is different is data handling and mounting
    • Q. What does DataONE publish format look like today?
      • Data package has 4? objects:
        • EML document: high-level overview of what’s in the package, used by Metacat UI
        • Datafile(s): mydata.txt
        • ORE/RDF: describes the relationships between files. Package doesn’t exist without this file.
        • Metadata document for every file
      • Discussion of DataONE API:
        • https://releases.dataone.org/online/api-documentation-v2.0/apis/index.html
    • Q. Why don’t we publish a zipfile to DataONE?
      • TT to follow-up – quality/stigma, etc.
    • Q. Publish to Zenodo via GitHub is always a zipfile?
      • Yes
    • Q. Do we need to be able to handle zipped repos?
      • Yes, if we want to be able to run Binders, Gigantums
  • Taking for granted that we publish a “tale.yml/rdf” file in the package to each repo
    • Primary purpose is for us to read it back in.
    • Q. Why go through the process of RDF-izing it if we’re the only ones using
      • Why deserialize RDF to put things into Girder in JSON?
      • Cost/value?
        • RDF is expensive to create – making sure that things are mapping
        • URI space + resolution – hosting a service to handle concepts
        • YAML is cheap and easy but non-standard
        • schema.org
        • DC is non-controversial / low hanging
          • Mapping “tale.yaml” into DC/Schema.org space is super easy
  • Provenance
    • Q. How is provenance stored in DataONE?
      • Always added to ORE document
        • RDF/XML
      • Q. Is provenance related to the file or all files in the dataset
        • Provenance defined between two objects in a dataset
      • https://www.dataone.org/best-practices/provenance
    • Q. In Dataverse?
      • Provenance file or text-free prov description?
        • http://guides.dataverse.org/en/latest/api/native-api.html?highlight=provenance#provenance
      • JSON format
        • http://guides.dataverse.org/en/latest/user/dataset-management.html#data-provenance
    • Q. Do we really need provenance information included in the “tale.yml”
    • Q. What does WT do about provenance info?
      • Use case: Diffing prov between runs
  • Gist:
    • Dublin Core/schema.org
    • Provenance
    • External data