Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2019.12.11 edition

Death row, environmental treaties, spelling fixes, book ratings, and yearbook photos.

Death row. “Forty-three years after the Supreme Court reversed course and reinstated the death penalty, reliable data on the individuals sent to death row is maddeningly difficult to obtain.” So reporters at The Intercept “set out to compile a comprehensive dataset on everyone sentenced to die in active death penalty jurisdictions since 1976.” The resulting database contains more than 7,300 entries; for each person, it contains demographics, sentencing information, whether the person is still on death row, whether they’ve been exonerated, and more. Previously: Death sentences (DIP 2018.08.01), executions (DIP 2019.05.15), and last words (DIP 2019.03.06).

Environmental treaties. The International Environmental Agreements Database Project describes more than 3,700 “international environmental treaties, conventions, and other agreements with links to text, membership, performance data, secretariat, and summary statistics.” The database, hosted at the University of Oregon, includes agreements from the 1850s to the present and can be queried online. It also includes detailed pages for each treaty, such as this one for the UN’s Paris Agreement. [h/t Erik Gahner Larsen]

Spelling fixes. The GitHub Typo Corpus contains structured data on misspellings, bad grammar, and the ways they’ve been corrected. To build the dataset, Masato Hagiwara and Masato Mita analyzed the “commits” — sets of changes to files, typically accompanied by short summaries — made to tens of thousands of projects on the code-sharing platform GitHub. With “more than 350k edits and 65M characters in more than 15 languages,” the authors say it’s “the largest dataset of misspellings to date.” [h/t u/Loves_Portisheads]

Book ratings. In 2004, computer scientist Cai-Nicolas Ziegler scraped (with permission) 433,000 numeric ratings of 186,000 books by 78,000 users on the book-tracking website BookCrossing. For most users, the data includes their stated city and age. [h/t Ningshan Zhang, Kyle Schmaus, and Patrick O. Perry]

Yearbook photos. Over at The Pudding, Elle O’Brien and Jan Diehm chart the rise and fall of “big hair”. To ’do it, they combed through a public dataset of 37,921 portraits, culled from the yearbooks of 115 American high schools in 26 states between 1905 and 2013. [h/t Sophie Warnes]