Curator Report Technology

Methods for generating various reports that are needed for Dryad presentations and documents.

DSpace Curation Reports
Many of the reports needed by curators can be run through the DSpace curation tasks system.

Summary Reports
Distribution of bitstream formats across all items in the Data Files collection: /opt/dryad/bin/dspace curate -v -t profileformats -i 10255/2 -r - >~/temp/profileFormats.txt

Check that data files/packages have all of the required metadata: /opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/2 -r - >~/temp/requiredFileMetadata.txt /opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/3 -r - >~/temp/requiredFileMetadata.txt

Identifiers

 * datacitechecker
 * datacitesynchronizer

Data Files
Aggregate statistics about data files, including embargo status: /opt/dryad/bin/dspace curate -v -t filestats -i 10255/2 -r - >~/temp/fileStats.txt

Locate embargoed files that do not yet have a designated end date for the embargo:
 * embargoedwithoutpubdate

Complete statistics about individual data files. This includes: handle, fileDOI, journal, journalAllowsEmbargo, journalAllowsReview, fileSize, embargoType, embargoDate, downloads, views, dateAccessioned. /opt/dryad/bin/dspace curate -v -t datafilestats -i 10255/2 -r - >~/temp/dataFileStats.txt

Data Packages
Aggregate statistics about data packages, including counts of integrated submissions: /opt/dryad/bin/dspace curate -v -t packagestats -i 10255/3 -r - >~/temp/packageStats.txt

Simple/quick statistics about individual data packages. This includes: handle, doi, journal, dateAccessioned. /opt/dryad/bin/dspace curate -v -t datapackageinfo -i 10255/3 -r - >~/temp/packageInfo.txt

Complete statistics about individual data packages. This includes: handle, doi, journal, numKeywords, numKeywordsJournal, numberOfFiles, packageSize, embargoType, numberOfDownloads, manuscriptNum, numReadmes, wentThroughReview, dateAccessioned. /opt/dryad/bin/dspace curate -v -t datapackagestats -i 10255/3 -r - >~/temp/dataPackageStats.txt

Journals
Dates for the first archived item associated with each journal in the repository: /opt/dryad/bin/dspace curate -v -t journaldates -i 10255/3 -r - >~/temp/journalDates.txt

Number of data packages for each journal in the repository: /opt/dryad/bin/dspace curate -v -t datapackagesperjournal -i 10255/3 -r - >~/temp/journalPackages.txt

List of all journals associated with archived items: \copy (select distinct text_value from metadatavalue where metadata_field_id =77 and item_id IN (select item_id from item where in_archive=TRUE and item_id IN (select item_id from collection2item where collection_id=2))) to '/home/dryad/currentJournals.txt' with CSV;

List of all journals associated with items in publication blackout: \copy (select text_value from metadatavalue where metadata_field_id =77 and item_id in (select item_id from metadatavalue where text_value like '%Entered publication blackout%' and item_id not in (select item_id from metadatavalue where text_value like '%Approved for entry into archive%'))) to '/home/dryad/allJournalsBlackout.txt' with csv;

Graphing Dryad submissions over time
First, get all of the deposit dates. On the production server:

postgres-client.sh -c "select text_value from metadatavalue where metadata_field_id=12 and item_id in (select item_id from item where owning_collection=2 and in_archive='t');" > dryadSubmitDates.txt

(collection 2 is "Data Packages" and metadata field 12 is "Date Available")

Edit the file to remove the timestamps (leave only the date portion).

Import the dryadSubmitDates.txt into Excel.

Sort the dates.

Create a column beside the dates that starts at 1 and counts up (this represents the total number of submissions present on each day).

Create a graph of these two columns.

Items in publication blackout
Items in blackout are owned by "Dryad Queue". The queue's ePerson ID is 949.

select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949);

To get metadata about items in the blackout, you must select the metadata directly from the database, one field at a time. Below is a query to select metadata field 17 (DOI). When performing these types of queries, always order by item_id to keep the ordering of the metadata values consistent across queries. select text_value from metadatavalue where metadata_field_id=17 and item_id IN (select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949)) ORDER BY item_id;

Submissions from a particular instituiton
This query will identify archived data packages that are associated with a particular email suffix. Note that this undercounts the number of deposits from an institution, since around 25% of Dryad user create accounts with a generic email address (e.g., GMail). select item_id from item where submitter_id IN (select eperson_id from eperson where email like '%TARGET_INSTITUTION_EMAIL_SUFFIX') and in_archive='t' and item_id IN (select item_id from collection2item where collection_id=2) order by last_modified;