Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2024.05.15 edition

Unaccompanied migrant children, greenhouse gas giants, 1 million ChatGPT conversations, England sewer discharge, and LA street trees.

Unaccompanied migrant children. The New York Times’ Hannah Dreier was awarded a Pulitzer Prize last week for a “series of stories revealing the stunning reach of migrant child labor across the United States—and the corporate and governmental failures that perpetuate it.” That reporting, Dreier has noted, was partly driven by data she obtained from the Department of Health and Human Services via a FOIA request and lawsuit. The dataset, published in December along with visualizations, describes the placement of 550,000+ unaccompanied migrant children with local sponsors between January 2015 and May 2023: each child’s country of origin, gender, date of entry, and date of release to a sponsor, plus the sponsor’s ZIP code and relationship to the child. Read more: “The data pointed to spots I never would have thought of: Flandreau, South Dakota; Parksley, Virginia; Bozeman, Montana,” Dreier says.

Greenhouse gas giants. Carbon Majors, run by the UK-based InfluenceMap, “is a database of historical production data from 122 of the world’s largest oil, gas, coal, and cement producers.” It attributes to these producers 1,421 metric gigatons of CO2-equivalent emissions from 1854 through 2022. Launched last month, the database provides downloads at several levels of granularity. The least granular version indicates the emissions calculated for each entity-year combination. The most granular version breaks those emissions down by commodity produced, quantity of commodity produced, emitting activity, and reporting entity.

1 million ChatGPT conversations. The WildChat Dataset, constructed by Wenting Zhao et al., “is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts.” The researchers, primarily affiliated with Cornell and the Allen Institute for AI, built it “by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection.” Each of the 1 million rows in the dataset represents a conversation and provides its text, main language, timestamp of its conclusion, underlying model used, moderation results, inferred country, and more. [h/t Data Machina]

England sewage discharge. The UK’s Environment Agency collects data from utility companies regarding sewage-discharging storm overflows in England. The records, available for 2020–2023, list every reported overflow event, its timing and location, number of detected discharges, discharge points, and much more. Related: Wales overflow data are available from other sources. As seen in: Maps from The Rivers Trust, Surfers Against Sewage, and The Guardian. [h/t Giuseppe Sollazzo + James Cheshire + Hugh Graham]

Los Angeles street trees. Journalist Matt Stiles has been using public records requests and official portals to compile data on 1.6 million street trees in 40+ Los Angeles County municipalities. The information varies by city but generally includes the tree’s coordinates and species, often also with measurements such as height and trunk diameter. Previously: Street trees in DIP 2022.09.07, DIP 2020.11.18, DIP 2018.08.08, and DIP 2016.11.16.