Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2020.02.05 edition

Credibly accused clergy, coronavirus cases, sign-stealing, hotel bookings, and the Smiths (… and the Johnsons … and the Williamses …).

Credibly accused clergy. Reporters at ProPublica have assembled the first-ever nationwide database of US Catholic clergy “credibly accused” of sexual abuse, based on nearly 180 official lists released by dioceses and religious orders. (The majority of the lists were published during the past year and a half, following a landmark grand jury report in Pennsylvania.) The database contains more than 6,700 names so far, plus details available from some of the lists, including birth year, ordination year, assignments, and status.

Coronavirus cases. As the Wuhan coronavirus outbreak intensifies, a team at Johns Hopkins has been mapping the number of confirmed cases, deaths, and recoveries. The project aggregates data from several sources, including the WHO, US CDC, European CDC, and DXY, a website that reports case counts from China’s CDC and National Health Commission. The dataset powering the Johns Hopkins map is available as a spreadsheet. [h/t Fionn Delahunty]

Baseball sign-stealing. “I’m an Astros fan. They cheated during the 2017 regular season — the evidence is clear. In an attempt to understand the scope of the cheating and the players involved, I decided to listen to every pitch from the Astros’ 2017 home games and log any banging noise I could detect.” That’s from Tony Adams, who analyzed audio spectrograms corresponding to more than 8,200 pitches. Last week, he published a website documenting his findings, including a spreadsheet of the data. Related: Adams’s list of stories and analyses that have used the data. [h/t Dan Brady]

Hotel bookings. In 2018, a trio of researchers at the Instituto Universitário de Lisboa published a dataset detailing nearly 120,000 (anonymized) bookings at two (unnamed) hotels in Portugal between July 2015 and August 2017. The bookings, extracted from the hotels’ property management systems, are described in detail: the number of adults, children, and babies for the reservation; country of origin; customer, room, and deposit types; whether the guests were repeat visitors; the number of special requests made; and more.

Last (but not least) names. From the US Census Bureau, you can download a dataset of all surnames that belonged to at least 100 people in 2010, and the same for 2000. Those datasets indicate the total number of people with the name, and distribution of those people by race/ethnicity. A similar list, but based on a sample population and without demographic information, is also available for 1990. Pop quiz: Try to guess the five most popular surnames that are also colors … in order. [h/t Lynn Cherny]