Introducing DBL Metadata 2.0

Synopsis

This article provides a summary of recent changes to the metadata format for entries archived in the DBL.

Background

The metadata format for DBL began as a description of bundles which were originally text only, and uploaded via Paratext only. The metadata was a structured dump of information pertinent to Paratext projects.

With the addition of the audio entry type, the metadata was elevated from a Paratext artifact to an independent structure. Audio entries were assumed to be closely related to an existing text entry, so the audio metadata represented a subset of what was available in the text metadata, with some additions specific to audio data.

As things evolved, it became apparent that DBL would need to accommodate more media types than text and audio, which resulted in the need to unify the text and audio metadata formats as well as make it as generic as possible in order to minimize disruption when introducing new entry types.

At the same time it became obvious that there were problems with the description of bundle contents. The actual data represented by the metadata followed an implied structure created as a side effect of exporting the data from Paratext. So (e.g.) different formats of scripture (Protestant/Catholic/Orthodox) were held in numbered directories with files named by convention, one per book of the bible.

The implied structure doesn’t suit non-text entry types, and even for text entry types it doesn’t support valid forms of publishable material (e.g., a study Bibles with exegetical material). Forms of publishable material have in the past been called ‘canons’, but in light of the fact that they could be both more and less than commonly accepted canons, they are referred to in this document (and the metadata 2.0 schema) as ‘publications’.

Metadata 2.0

The main thrust of the changes made in the metadata 2.0 format are to satisfy the 2 main issues described above:

Providing a generic schema for all entry types, and
Providing a full and precise definition of publishable forms.

No implied relationship

Entries will now carry all of their metadata with them. Audio (video, etc.) will not be required to have a relationship to an existing text entry in order to be archived. This is achieved by having all entry types having their own <language>, <country>, and <identification> element groups. <relationship> elements will still be supported, but for informational purposes only.

Full description of bundle contents

Metadata 2.0 now contains a manifest of all of the data files that comprise the bundle. The manifest is a description of the data payload only, and implies nothing about publications.

An example of the manifest

<manifest>
  <container uri="release">
    <container uri="USX_1">
      <resource uri="GEN.usx" size="311684" 
        checksum="f4c99d742eb544112d6198d72a0455c0" mimeType="application/xml"/>
    </container>
  </container>
</manifest>

In this example we see the description of a single file (GEN.usx) held within the folders release/USX_1/. The file comes with mime-type information, size in bytes, and checksum (in md5 form). This information allows consumers of the bundles to verify that they have the complete and correct set of data that they are concerned with. In the case of corruption or incomplete download of data, received data can be compared with the manifest to determine what is missing or corrupted.

The <container> elements are a convenience. The same manifest presented above could be stated as:

<manifest>
  <resource uri="release/USX_1/GEN.usx" size="311684" 
    checksum="f4c99d742eb544112d6198d72a0455c0" mimeType="application/xml"/>
</manifest>

Full description of publications

Publications are more fully described in Metadata 2.0 as well. An example publication might become:

<names>
  <name id="book-mat">
    <short>Matthew</short>
    <long>The Gospel According to Matthew</long>
    <abbr>Mat</abbr>
  </name>
  <name id="book-mrk">
    <short>Mark</short>
    <long>The Gospel According to Mark</long>
    <abbr>Mrk</abbr>
  </name>
  <name id="book-luk">
    <short>Luke</short>
    <long>The Gospel According to Luke</long>
    <abbr>Luk</abbr>
  </name>
  <name id="book-jhn">
    <short>John</short>
    <long>The Gosepl According to John</long>
    <abbr>Jhn</abbr>
  </name>
  </names>
<publications>
  <publication id="1">
    <name>Gospels</name>
    <nameLocal>Gospels</nameLocal>
    <abbreviation></abbreviation>
    <abbreviationLocal></abbreviationLocal>
    <description>The four book of the Gospels only</description>
    <descriptionLocal></descriptionLocal>
    <canonicalContent>
      <book code="MAT"/>
      <book code="MRK"/>
      <book code="LUK"/>
      <book code="JHN"/>
    </canonicalContent>
    <structure>
      <content name="book-mat" src="release/USX_1/MAT.usx"/>
      <content name="book-mrk" src="release/USX_1/MRK.usx"/>
      <content name="book-luk" src="release/USX_1/LUK.usx"/>
      <content name="book-jhn" src="release/USX_1/JHN.usx"/>
    </structure>
  </publication>
</publications>

The <publications> element works in tandem with the <names> element to describe the publication and structure.

A <publication> has descriptive element (‘name’, and so on) and <canonicalContent> and <structure> elements. The canonicalContent is an unordered set that describes the books of scripture included in the publication using the three-letter code convention. The structure element on the other hand is an ordered set of ‘content’ elements that define the file (part of the manifest) holding the data of the book and a pointer via the ‘name’ attribute to a <name> element.

An astute reader will notice that the ‘src’ attribute maps in files structured using the historical convention (publication ‘1’ has it’s data files in folder ‘USX_1’). We intend to follow this convention for the sake of our publishing partners, but consumers should be aware that it allows for describing file layouts that do not follow convention. Consumers are encouraged to modify their workflow accordingly as soon as possible.

You may also realize that the structure element can now also describe content that does not correspond to a canonical book.

Support for Legacy Formats

There is an acknowledgement that the changes introduced with metadata 2.0 schema may cause some disruption in downstream workflows, and therefore DBL provides support for converting metadata 2.0 documents to the preceding versions for both text and audio entry types.

This support is introduced as a generic mechanism, but in practice can only be used to convert metadata 2.0 documents to text/1.5 or audio/1.2. It can also migrate entry metadata in text/1.5 or audio/1.2 to metadata 2.0 format.

Through the use of the DBL API, a call of GET /api/entries/<entry_id>/revision/latest/metadata.xml will return the metadata.xml document for the entry specified by <entry_id>. The undocumented modifier typeversion=1.5 would do a transformation of the existing metadata.xml to a schema compliant 1.5 form.

Caveat: The 2.0 metadata is more feature-full than previous forms. There may be missing information not included in metadata converted from previous schemas that will cause the metadata to fail strict validation (all elements will be included, but some required values would fail). Further, a round trip of a 2.0 document to (e.g.) a text/1.5 schema and back again would be a lossy operation.

It is strongly recommended that downstream partners upgrade their workflow as soon as possible to allow for metadata 2.0, but in the case that this is not convenient, this auto-conversion feature allows for an interim solution. DBL will not continue to provide this feature indefinitely.

Conclusion

In this article we have described the main changes to the manifest schema, especially from the perspective of consumers of bundle data.

Schemas for validation of metadata are available (in Relax-NG format) at the canonical URL http://api.thedigitalbiblelibrary.org/static/text/2.0/metadata.rng (‘text’ here could be ‘audio’, etc.)

The documentation for DBL bundles will also be updated to reflect these changes (pending).