Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2021.03.17 edition

Disasters, nursing-home staffing, Brazil’s vaccination registry, probable news quotations, and British millionaires of the late 1920s.

Disasters. The Emergency Events Database (EM-DAT) contains “essential core data on the occurrence and effects of more than 21,000 [natural and technological] disasters in the world, from 1900 to present.” It focuses on disasters that have caused 10+ human deaths, affected 100+ people, sparked a state of emergency, and/or prompted a request for international assistance. Where known, the public dataset (registration required) indicates each disaster’s location, type, start/end dates, and estimated damages; the number of people killed, injured, made homeless, or otherwise affected; and more. Related: The Geocoded Disasters (GDIS) Dataset (registration required) provides spatial coordinates for the natural disasters in EM-DAT from 1960 to 2018.

Nursing-home staffing. The US Centers for Medicare & Medicaid Services requires nursing homes to submit detailed payroll information, which the agency converts into public-use files that summarize daily staffing levels at each facility. The files count employee and contract hours for dozens of types of staff, ranging from administrators to respiratory therapists. Related: “Maggots, Rape and Yet Five Stars: How U.S. Ratings of Nursing Homes Mislead the Public,” a recent New York Times investigation that uses the data in several ways.

Brazil’s vaccination registry. Brazil’s health ministry is publishing granular data from its COVID-19 vaccination registry, with data on more than 12 million doses administered so far. For each dose, it indicates the patient’s date of birth, sex, race, location, and eligibility group; the vaccine name, manufacturer, and lot; the date and location of vaccination; and more. [h/t Olivier Lejeune]

Probable news quotations. Quotebank is “an open corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020.” Its authors used machine learning to identify the quotes and speakers, “correctly [attributing] 86.9% of quotations in our experiments.” The five most-quoted people: Barack Obama, Donald Trump, Mitt Romney, Hillary Clinton, and Narendra Modi. [h/t Lynn Cherny]

Early British millionaires. Business historian Peter Scott has reconstructed a dataset of Britain’s “inter-war super-rich.” Scott started with an official (but mostly nameless) list of 438 “millionaires” (which tax authorities defined as people with annual incomes above £50,000) from 1928/29, and then cross-referenced the entries with other sources to identify 291 men and 28 women who fit the criteria. [h/t Jain Family Institute]