Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2021.06.23 edition

2020 election emails, civilian harm in Yemen, medical abbreviations, NYC constituent inquiries, and Forrest Gump-ology.

2020 election emails. To build the Princeton Corpus of Political Emails, researchers auto-subscribed to thousands of mailing lists run by candidates, political parties, and other groups participating in the 2020 US election cycle. They’ve received 400,000+ messages so far. Since October, you’ve been able to search the corpus online; as of last month, you can request access to v1.0 of its bulk dataset, which contains 300,000+ emails received through Election Day. For each, it provides the subject and body text, sender, office sought, and more. Previously: Congressional e-newsletters via DCinbox (DIP 2021.03.03), and political emails gathered by The Markup and by FiveThirtyEight (DIP 2020.03.04). [h/t Samantha Guss]

Civilian harm in Yemen. The UN-affiliated Civilian Impact Monitoring Project conducts “real-time collection, analysis and dissemination of open source data on the civilian impact from armed violence in Yemen.” Its public datasets include the monthly incident and casualty counts and the incidents per region damaging various types of civilian infrastructure. Related: ACAPS, a humanitarian analysis group, is aggregating data on a range of “key drivers” and outcomes of the crisis (such as fuel prices, malnutrition, and internal displacement) in each district and governorate. Previously: The Yemen Data Project (DIP 2019.04.03). [h/t Sadam Al-Adwar]

Medical abbreviations. Lisa Grossman Liu et al. have developed the Medical Abbreviation and Acronym Meta-Inventory, a database that maps 104,000+ medical abbreviations and acronyms to 170,000+ different meanings. To build it, the authors standardized data from eight sources, including the Unified Medical Language System, Wikipedia, and ADAM: Another Database of Abbreviations in MEDLINE. Related: “At our urban academic medical center, acronyms constituted 30–50% of the words in a typical medicine admission note.

NYC constituent inquiries. Many members of the New York City Council use CouncilStat to track their constituents’ requests, complaints, and other inquiries. The tool’s public dataset contains 260,000+ anonymized entries going back to 2015. It identifies each inquiry’s topic (e.g., tax preparation, citizenship, affordable housing, street resurfacing), district, and dates opened and closed. As seen in: A pre-election analysis of the requests by The City’s Ann Choi.

Forrest Gump-ology. StudyForrest is “a one-of-a-kind resource for studying high-level cognition in the human brain under complex, natural stimulation.” Specifically: while watching Forrest Gump. In addition to fMRI scans and eye-tracking measurements, the project’s datasets include extensive annotations of the film itself, such as the location and timing of each shot.