Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2023.09.13 edition

Political contributions, historical newspaper articles, Prigozhin audio messages, and license plate designs.

Political contributions, enhanced. Political scientist Adam Bonica’s Database on Ideology, Money in Politics, and Elections (DIME) gathers “500 million itemized political contributions made by individuals and organizations to local, state, and federal elections covering from 1979 to 2020.” The project, which received a major update last month, “is intended to make data on campaign finance and elections (1) more centralized and accessible, (2) easier to work with, and (3) more versatile […].” It assigns each contributor a unique identifier, geocodes their stated addresses, quantifies their ideological orientation, and more. The raw data come from several sources, including the Federal Election Commission and OpenSecrets. Related: MoneyInPolitics.wtf, a collaborative project that aims to be “America’s most comprehensive dictionary of campaign finance jargon.” [h/t Isadora Borges Monroy]

Historical newspaper articles. Melissa Dell et al.’s American Stories dataset contains the text of ~400 million newspaper articles, extracted from ~20 million public-domain scans in the Library of Congress’s Chronicling America project (DIP 2017.08.16). To construct the dataset, the authors built “a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes.” For each article, the dataset provides the newspaper name, edition number, date of publication (largely in the 1800s–1920s), page number, headline, byline, and article text. Previously: The LOC’s Newspaper Navigator dataset (DIP 2020.10.07), which extracts visual content from the Chronicling America scans. [h/t Derek M. Jones]

Prigozhin audio messages. Giorgio Comai has collected and auto-transcribed hundreds of audio messages from Russian mercenary leader Yevgeny Prigozhin, which were posted “on his official Telegram channel - the press service of his holding company” from late 2022 through June 2023. For each transcribed segment within each message, the resulting datasets (one in Russian and one auto-translated into English) include the message ID, time posted, segment timestamp, and segment text. [h/t Federico Caruso]

The “Unknome.” The human genome has been sequenced, but what do all those genes do? João J. Rocha et al.’s Unknome database assigns a “knownness” score to “all protein clusters that contain at least 1 protein from humans or any of 11 model organisms.” The score is based on the density of annotations in the Gene Ontology knowledgebase, which bills itself as “the world’s largest source of information on the functions of genes.” The clustering comes from another downloadable database, PANTHER, which contains “comprehensive information about the evolution of protein-coding gene families.”

License plate designs. Beautiful Public Data’s Jon Keegan scraped the websites of every US state’s (and DC’s) motor vehicle agency to assemble a dataset of 8,291 license plate designs. The dataset provides each design’s name, state, and image. Read more: Keegan’s exploration of the data, which includes a searchable table. Previously: Vanity plates requested in California (DIP 2020.01.29) and New York (DIP 2015.10.21).