Journal Metadata Processing Technology

Overview
The Dryad journal-submit modules allows the import of metadata that is mailed from integrated journals to Dryad. The email is parsed for article publication metadata and that metadata is used to populate the form that the submitter seems when he or she comes to Dryad and inputs an accepted manuscript number.

Workflow

 * 1) Journal sends email with article metadata to NESCent email and to the article author. The email follows the format specified in Journal Metadata.
 * 2) The email is addressed to journal-submit@datadryad.org. Godaddy is configured to forward this email to journal-submit@nescent.org.
 * 3) journal-submit@nescent.org is configured to route the email to the dryad.journal.submit Gmail address and also pipe the content to a `curl` command that POSTs the email content to the journal-submit webapp (on the production server).  Likewise, the email to journal-submit-dev is piped to a script (journalWebAppDev.sh) that pipes the content to a `curl` command that POSTs the content to the journal-submit webapp that runs on dev.datadryad.org.
 * 4) The journal-submit webapp reads the byte stream with the email content and detects the journal's name
 * 5) The webapp then looks up the journal name in the journal-submit configuration file, DryadJournalSubmission.properties, to learn the `parsingScheme` to use.  The value of this parameter is used to match on the parsing classes in the journal-submit's codebase
 * 6) When a particular parser is determined to be the correct one to use, the webapp uses that to parse the remainder of the email, putting metadata values into a ParsingResult object
 * 7) The ParsingResult class is then used to export the metadata into an XML file; the location of this file is determined by the `metadataDir` value from the DryadJournalSubmission.properties configuration file.
 * 8) Once this file is written (using the manuscript number as the name of the file), the journal-submit webapp is done with its process and control moves to the standard DSpace/Dryad submission process, which uses the written XML file to populate the submission form when the author comes to Dryad to complete the submission process.
 * 9) Later, a submitter will use the Submission System to retrieve the parsed metadata and initiate a new data submission.

Configuration
Below is a sample configuration from the journal-submit webapp's configuration file, DryadJournalSubmission.properties. Each journal handled by the submission system needs an entry in this file (even if the journal is not an integrated journal (i.e., doesn't send Dryad the article metadata in an email format)).

journal.order=amNat, BJLS, bmcEvoBio, bmjOpen, ecoApp, ecoMono, ecology, evolution, EvolApp, ecoFrontiers, intCompBio, EvolBiol, heredity, jhered, jpaleo, mbe, MolEcol, MolEcolRes, mpe, paleobio, sysBio
 * 1) all the journals configured in this file (using their parsing scheme code)

journal.amNat.fullname = The American Naturalist journal.amNat.metadataDir = /opt/dryad/submission/journalMetadata/amNat journal.amNat.parsingScheme = amNat journal.amNat.integrated=true journal.amNat.notifyOnReview=ryantestAmNatReview@scherle.org journal.amNat.notifyOnArchive=ryantestAmNat@scherle.org
 * 1) American Naturalist
 * 1) directory in which the resulting XML metadata file is stored
 * 1) the parsing scheme (used to match against parsing class)
 * 1) whether we can receive article metadata emails from the journal
 * 1) who to notify when a submission is reviewed
 * 1) who to notify when a submission is archived

The location of the DryadJournalSubmission.properties file is also configurable via Maven profiles. The default value should be in ${DRYAD_HOME}/config/DryadJournalSubmission.properties, but if a different location is desired the following should be changed in the Maven profile used to built the project:

/opt/dryad/config/DryadJournalSubmission.properties
 * 1) the location of the configuration file for journal-submit webapp

One might want to do this to have a different set of email addresses used for notification purposes so non-project staff don't receive emails from the development instance of Dryad.

Testing
The application `curl` can be used to test the journal-submit module on a development machine.

curl --data-binary @message.test http://localhost:9999/journal-submit

Indicate that the data sent should be in binary form with the `--data-binary` parameter and pass a reference to a file name using the @ symbol (message.test in the example refers to a file on the local file system -- it should be relative to the place from which the script is run (or be an absolute file system path)). The module will output the XML that it generates or a stacktrace indicating the problem it found parsing the data submitted.

As a place to start debugging... the journal-submit webapp was originally written to concatenate textual values into an XML document, rather than using an XML-aware library. Recently an XML library was added to check the concatenated string to make sure it is well-formed XML. If problems appear in the future, this is a good place to start looking into them (since this check imposes restrictions that individual parsers may or may not have handled correctly). This check was added to the EmailParser class, which is an abstract class all parsers should implement.

Relation to DSpace
The journal-submit is a separate webapp (DSpace module) but it is related to the standard DSpace submission process (which Dryad has heavily modified)

Email Parsers
The following is a list of the current set of email parsers.


 * EmailParserForAmNat - This parser and the one for ManuscriptCentral have a similar structure, but different email fields. AmNat has a larger set of field tags. The tags map directly to XML element names.
 * Authors' names are: first last, degree/title (e.g. Dr., Prof.). Names are separated by semicolons.
 * Classification terms are: Major: minor. Terms are separated by semicolons.
 * EmailParserForBmcEvoBio - This parser is similar to the ManuscriptCentral parser. It differs that it ignores line breaks in the abstract field (line breaks may be used to separate sections in the abstract). It also accepts author lists joined by 'and' and separated by commas, rather than joined by semicolons.
 * Authors are first last. Names are separated by commas, final author joined by 'and' (no comma).
 * Keywords (Classification terms) are: Major: minor. Terms are separated by line breaks.
 * EmailParserForEcoApp - Restructured parser that attempts to better separate the parsing and XML generation stages. Email tags are mapped to java classes (in the xml child package), which are subclasses of the xom Element class. XOM is not used for the final output - after each element is constructed, it is serialized and appended into a field in the ParsingResult returned object.
 * Authors' names are: first last. Names are separated by commas, final author joined by 'and' (with preceding comma).
 * Parsing of Classification terms are currently unknown (do not appear in example messages)
 * EmailParserForManuscriptCentral - Similar to AmNat, but a smaller set of tags.
 * Authors' name are: last, first. Names are separated by semicolons
 * Keywords (Classification terms) are comma-separated.
 * Also handles fields specific to Genomic Resources Notes Technology (e.g. MS Citation Title, MS Citation Authors)

Managing Content in the Review Workflow
The "in review" workflow stage is a holding ground for data submissions associated with manuscripts in review.

For approving/rejecting items which are in the review stage the following command can be used

./{dspace.dir}/bin/dspace review-item {-i workflow_id|-m manuscript_number} -a {true|false}

This class requires 2 parameters. The first parameter indicates the item, and can take one of two forms:
 * -i the id of the workflow item (workflow_item_id instead of item_id)
 * -m the manuscript number associated with the item

The second parameter indicates the status of the item:
 * -a whether or not the item has been approved