Notes on Deposit from Applications

It may be useful to accept data files from outside applications for deposit in Dryad. The motivating use case is the Open Tree of Life web application, but it is easy to see how that case might generalize.

Requirements
An application here means just about anything:


 * a desktop application that wants to upload to Dryad
 * a web application on some other site, perhaps on an intranet
 * a publisher's data management system
 * another digital repository

While these are all similar in that data moves from a program that is not a web browser into Dryad, we can distinguish two main cases: assisted manual deposit to Dryad, and fully automated deposit.

Assisted manual deposit
A user running an outside application (web or installed) has prepared one or more data files that are associated with an article that is already published or will be published soon. Rather than have the user download the data files to their computer, start a Dryad submission, and then upload the files to Dryad, it would be useful for the outside application to be able to transfer the files directly to Dryad. Not only does this streamline file transfer and make it more reliable, but it also permits direct transmission of any metadata the application may have.

A user would select "upload to Dryad" from the application, and then proceed with an ordinary Dryad deposit, filling out the metadata forms and providing payment as usual. The only difference from deposit initiated from the Dryad web site would be that the Dryad deposit dialog would be pre-populated with data and metadata coming from the application.

Fully automated deposit
There are situations where taking the user to the Dryad site to complete the deposit is not possible or necessary. These include bulk transfers (many deposits, as in the repository-to-repository case) and situations where an application wants to take charge of all data and metadata entry, for branding or workflow reasons (as when data is shepherded by a journal or laboratory data management system).

Questions

 * 1) What protocol to use:  Plain old HTTP POST or PUT, or SWORD v2? We will focus on SWORD, since Dryad already has a commitment to it. See SWORD Submission
 * 2) How to package the payload: multiple HTTP requests, MIME multipart, zip, tgz?  Use BagIt? We will use BagIt, for compatibility with other systems (including DataONE)
 * 3) How to communicate any metadata: atom+xml, bag-info.txt, some kind of Dublin Core Application Profile, OAI-ORE? Dublin Core Application Profile, using the same schema we use for export to TreeBASE.
 * 4) What to do about authentication?

Protocol
SWORD has some features that make it attractive for this application:
 * It is designed for this purpose
 * It's well known among people who know about this kind of thing
 * There might be other repositories besides Dryad that this would work with (I have no evidence of this)
 * There is Java library support for it
 * If BagIt is used, the Packaging: header can declare that this is the case
 * The support for MIME multipart is good
 * The "deposit receipt" feature could be useful
 * The depositing application might be able to make use of the Service Statement, which provides a list of collections (although IIUC Dryad only has one collection)

The Dryad / Treebase handshake uses a simple HTTP POST, not SWORD, but this leads in effect to an ad hoc protocol that imposes a documentation and support burden and does not generalize well.

SWORD is very flexible, almost too flexible, so it would be nice to know how other people use it; unfortunately my web searches turned up very few uses of it. There is DSPACE support and perhaps this is the main application of it.

The BagIt packaging format (see below) provides a sort of extension to the protocol. One can post a minimal package consisting of only the BagIt overhead files and a fetch.txt file that lists files that are to be retrieved by Dryad from URLs provided by the application. (Obviously this requires that the application have an HTTP server.) You would want to use this feature if the combined sized of the data files was large, and if it were helpful to move the responsibility for sequencing the transfers from the application to Dryad. That is, instead of the application doing multiple PUTs or POSTs for the separate files (as suggested by the SWORD protocol), it would POST the BagIT package, and then Dryad would GET the files from the application.

Using SWORD v2 from Dryad will probably have to wait until Dryad merges in a recent version of DSPACE. This is work in progress.

See more details on SWORD Submission.

Packaging
Usually there will be more than one data file to transfer. Together the data files form a 'package'. For Open Tree there will be a NeXML file and possibly matrices and other files.

Metadata of various kinds can live either inside the package, as one of its files, or outside it, as a parallel data stream. SWORD accommodates either mode.

Packaging options include
 * Multiple HTTP requests - these could be POST, PUT, or even GET requests (initiated by Dryad). For precedent see arXiv.org SWORD/APP Deposit.  This option is supported by SWORD v2.
 * MIME multipart - SWORD supports this for atom metadata + package, but does SWORD support multipart for multiple media entry files? Awkward interaction with compression, you'd have to use HTTP compression or base64
 * ZIP format - this has good library support in Java
 * gzip-compressed tar format (tgz) - I think this has good library support in Java

The likely choice here is ZIP since it is mentioned in many of the examples in the SWORD documentation. I haven't checked the SWORD library but I am expecting it will have good support for ZIP packaging.

BagIt
BagIt is a convention where the package (whatever format it's in) is given a bit of structure and a few management files are added to it. Whether use of BagIt is a good thing is orthogonal to the issue of how multiple files are packaged for delivery. This is because BagIt doesn't specify any particular packaging format. If BagIt is used then some kind of packaging is necessary, since BagIt involves associating multiple overhead files with the data payload file(s). N.b. BagIt requires precomputation of hash values (md5 or sha1) for all content files, a sort of annoying overhead.

Metadata format
There are two kinds of metadata, that associated with the mechanics of transmission (such as media types and hash values), and that associated with the content (such as the DOI of the associated article). Some applications such as Open Tree will have very little of the latter, probably only the DOI for the article and maybe a focal clade for the phylogenetic study (which is something Dryad wants to know about).

The likely choice here is a simple XML file implementing a Dublin Core Application Profile could be used, something similar to Dryad's DC application profile. Dryad's application profile is designed for export, not import, and requires some information that may not be available to all applications. It could either be revised to make most fields optional, or we could create a second profile for import.

Here are some metadata format choices we considered and rejected:
 * Metadata is out of scope of SWORD, but SWORD (actually AtomPub) allows placement of arbitrary property/value metadata in an entry (or media entry) file, which is transmitted outside of the package as media type application/atom+xml
 * BagIt allows placement of property/value metadata in the bag-info.txt file, which is part of the package (for precedent see here)
 * METS from Library of Congress, related to MARC
 * OAI-ORE has a theory of metadata that is supported natively by DSPACE

Authentication
The assisted manual deposit and fully automated deposit cases are very different. For a fully automated deposit the application will need to authenticate itself to Dryad. But manual deposit is different.

One might think that the application would need to authenticate itself to Dryad before being allowed to upload file. As precedent, use of the arXiv deposit API (which uses SWORD) requires that the user be authenticated for deposit API calls. But this need not be the case. Dryad could allow submission to a holding area, and provide a receipt that could be used by the user after logging in to the Dryad site to recover the materials from the holding area. This arrangement obviates the need for Dryad to trust the client, and the need for the client software to authenticate with Dryad. I think it should be fairly secure and easy to use.

The setup relies on the fact that there is a subsequent Dryad login step. For fully automatic deposits it wouldn't work. But that's not the use case at hand.

Threat analysis: One kind of threat would be that the application might alter the Dryad web site, or a user's account, in a way that should be disallowed. But if a deposit merely moves the files to a holding space on the Dryad site, and provides the application with a Dryad URL for the content, then any state visible to any user would only be affected as a result of using the URL, which is voluntary. Of course using the URL would require that the user be authenticated to Dryad; but not the application. The URL would take the user to Dryad to log in and begin the Dryad phase of the deposit based on the uploaded files.

Another kind of threat is denial of service. One could imagine a rogue or malicious application uploading huge files, consuming Dryad's bandwidth and filling its disks. But this would be possible through an authenticated upload interface as well; and if someone wanted to mount a DoS attack, surely there are simpler ways.

A desktop application runs with the full authority of the user, so could if desired be given the ability to handle Dryad credentials. A web application such as Open Tree could in principle manage user Dryad credentials, but this should be avoided if at all possible for security reasons - a breach of Open Tree's security shouldn't put Dryad at risk. A web application could be given an application key if this was deemed necessary, but the case for requiring this is not clear.