Data Essay Criteria – Nineteenth-Century Data Collective

(~1500 words not counting bibliography)

Dataset Title

Brief Project Description

Briefly describe your data and the research question that led you to create this dataset. This section should explain the value of this data set, particularly its relevance to nineteenth-century studies (or another field). This section pertains to the data itself and possible or plausible analysis, not necessarily an analysis you have already undertaken. Cite sources in order to characterize the scholarly landscape (digital or traditional) that could either benefit from this dataset.

How is the data relevant to nineteenth-century scholarship? Who might it be useful for? What could it be used for? Please suggest at least three specific uses.

For what purpose did you create the dataset? Was there a gap that needed to be filled? Has the data been used already? Do similar or overlapping data exist publicly? If so, please describe.

Creator(s) Names, Institutions, and Contact Information

Funder(s)

Date of Creation & Date(s) of updates

Language(s)

Collection & Creation Methodology (a summary of your data standards and capture procedures)

This section should describe the choices that structured the creation of the data set, explain any categorical variables (if you have them), and discuss the labor and technology that went into data collection and creation. Please avoid passive voice.

How did you acquire or create the data? If you acquired it, were there licenses and MOUs, institutional subscriptions or purchase agreements? Who paid for it or facilitated the transactions? If you created it, how? What mechanisms or procedures did you use to collect it (e.g. hardware apparatus, human curation, software, API)?

If the data was hand-curated, what organizational heuristic was adopted, and why? What aspects of the data are products of the researcher’s judgment or interpretation, and which aspects were inherited? What are the implications of these decisions?

Who was involved in the data collection process (e.g. students, crowdworkers, contractors) and how were they compensated? How long did it take to collect the data?

Did you hand-clean the data (e.g. removal of instances, processing of missing values) or use OpenRefine or another tool? Do you have a saved copy of the “raw” data in addition to the cleaned (e.g. to support unanticipated future uses)?

Provide sufficient detail such that readers understand how the dataset was created and would within reason be able to recreate it.

Data Structure: if you have multiple files, describe the relationships between them

This section should explain the parameters and categories of your dataset (what are we looking at and how much of it are we looking at)?

What does the data describe? Are all instances included or a selection? If selected, what principles were used to justify inclusions and exclusions?

Is any information missing? If so, please provide a description, explaining why this information is missing (e.g. because it was unavailable). Are there any errors, sources of noise, or redundancies? If so, please describe.

What is the file type and size of the data?

Describe any variable or non-standard features of your data: If your dataset uses categorical variables or other labels or fields that you have created, explain how they were constructed. Should the user be aware of any categories or fields that condense or erase information?

Ethics

Were there any possible negative impacts or harms that resulted from collecting or curating this data?

What possible negative impacts or harms might result from the publication of your data?

Does the dataset contain data that might be considered confidential (e.g. data that includes the content of individuals’ non-public communications)? If so, please describe.

Does the dataset contain data that might be considered sensitive (e.g. data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please describe.

Were any ethical review processes conducted (e.g. by an institutional review board)? If so, please describe these review processes, including the outcomes, as well as a link or other access point to supporting documentation.

Can you anticipate any way that this data could be misused?

Format

Though the Collective aims to maximize interoperability, we recognize that data is messy. Please describe if your format necessitates changing or altering any existing standards. Post45 requires that data oriented around book titles must use columns that match those used by HathiTrust; 19thC prefers this as well but we are open to reasons why this cannot be the case. If it is a new category of data, the Collective will work with submissions (and perhaps across Collectives) toward creating exemplary standards.

For further help, consider the following resources:

Format your data from the UK Data Service
Sustainability of Digital Formats from the Library of Congress

Statement of Collaboration

Is there documentation (on Github or elsewhere) of the collaborative labor that went into making (and/or maintaining) this dataset? Briefly describe and provide links.

Versioning

Will the data be updated (e.g. to correct errors, add new instances, delete instances)? If so, please describe how often and by whom.

Bibliography

Provide a list of sources consulted or drawn from to produce the dataset.

Licensing & Rights: Intellectual Property or Licensing rights for the data

If applicable, the data must be deposited under an open license that permits unrestricted access (e.g. CC0, CC-BY).

Data Citation

Any reason why the citation should not conform to the 19thC Data Collective standard citational practice?

The language for these criteria was drawn from the Post45 website, which cites Katherine Bode, Jennifer Doty, Lauren F. Klein, Melanie Walsh, Cultural Analytics, Journal of Open Humanities Data, and “Datasheets for Datasets” by Timnit Gebru et. al. Additional language and criteria come from the Center for Digital Humanities at Princeton University, particularly the criteria devised by Grant Wythoff : https://cdh.princeton.edu/research/data-curation/.