Data Is Plural

Data Is Plural — 2025.04.30 edition

2025-04-30T08:00:00-04:00

Refugee and asylum policies, tens of millions of flights, US sewer overflow sites, previously unmapped waterways, and canoe marathons.

Refugee and asylum policies. The Dataset of World Refugee and Asylum Policies “offers a complete dataset of de jure asylum and refugee policies” across 190+ countries and 70+ years, from 1951 to 2022. The project, developed by Christopher W. Blair et al. and updated in collaboration with the Joint Data Center on Forced Displacement, evaluates 54 aspects of each policy across five dimensions: access, services, the ability to earn a livelihood, freedom of movement, and political inclusion. Each aspect is scored on a 0-1-2-3 scale. The results are available to download and to analyze online. [h/t Annika Younge]

Tens of millions of flights. Sebastiaan Menger has developed a series of quarterly datasets “featuring global, high-level flight schedules extracted from worldwide aircraft ADS-B position transmissions,” going back to early 2024. Each quarterly extract, derived from the ADSB.lol flight-tracking initiative’s open data, features 10–13 million flights. Each flight’s entry indicates the aircraft’s registration number, type, call sign, airline (when applicable), approximate liftoff/touchdown times, and origin/destination airports.

US sewer overflow sites. “There are approximately 700 communities in the United States that have combined sewer systems and experience combined sewer overflow (CSO) discharges,” according to the EPA, whose National Combined Sewer Overflow Inventory lists 8,600+ outfalls across those communities. The downloadable inventory, last updated in September 2023, provides each outfall’s location and relevant information from the National Pollutant Discharge Elimination System’s permit database. As seen in: “Minority communities twice as likely to have sewage polluting nearby river or creek, CBS News analysis shows”. Previously: Sewer overflows in England (DIP 2024.05.15).

Previously unmapped waterways. WaterNet Global Waterways is “a new global dataset that predicts the locations of waterways around the world” using an AI model trained on satellite imagery and elevation data. A collaboration between Bridges to Prosperity and the Better Planet Laboratory, the dataset — available as raster files, vector files, and an interactive map — “triples the known extent of mapped waterways globally, adding 124 million kilometers to the previously mapped 54 million kilometers.” [h/t Cameron Kruse]

Canoe marathons. Paddle UK’s Marathon Racing Committee promotes endurance canoe and kayak competitions that range “from a couple of miles or kilometres to the ultimate challenge of the 125-mile Devizes to Westminster Canoe Race.” The organization publishes race results online, which data scientist Andrew Collier has collected into structured data files that indicate each competition’s date, name, region, and category, as well as each paddler’s name, club, division, class, finishing time, position, and points.

Dataset suggestions? Criticism? Praise? Paddle your feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2025.02.26 edition

2025-02-26T08:00:00-05:00

Presidential schedules, water availability, 18 million deceased veterans, Argentine treaties, and chord progressions.

Presidential schedules. Among its various White House–related undertakings, Roll Call Factba.se provides event-by-event structured data representing the public presidential calendars for Donald Trump and Joe Biden since the latter’s inauguration in January 2021. The schedules, available to download in bulk, provide each event’s day and time, location, a brief description, and other details. They contain 9,400+ entries from Biden’s four years in office plus another 300+ from Trump’s second term so far. The events include those from the official presidential schedule, those derived from pool reports, and press briefings. As seen in: POTUS Tracker. [h/t Dan Brady]

Water availability. The US Geological Survey last month released its National Water Availability Assessment, “a pioneering scientific overview of water availability that offers first-of-its-kind insights into the balance between water supply and demand across the conterminous United States.” Alongside the report, USGS launched a “data companion” providing “regularly updated, model-based estimates” of monthly water usage within each of the country’s hydrologic units. Estimates for water availability and water supply are “coming soon,” while those for water quality and aquatic ecosystems are “coming later.”

18 million deceased veterans. BIRLS.org, a new website from Reclaim The Records, provides “an index to basic biographical information on more than 18 million deceased American veterans who received some sort of veterans benefits in their lifetime”. Those records, obtained through a FOIA lawsuit, represent a substantial chunk of the Department of Veterans Affairs’ Beneficiary Identification Records Locator Subsystem. The site also helps you file follow-up requests for any individual’s “full VA claims file, which may contain hundreds of pages of never-before-seen biographical and historical material about the veteran, their military service, and their interactions with the VA.” Note: The “database is not a comprehensive database of all American veterans, but rather a partial and incomplete index of veterans who were eligible for VA benefits or whose heirs had some kind of contact with the VA regarding benefits.”

Argentine treaties. Javier I. Santander, a career diplomat, has built a dataset of 8,200+ bilateral treaties signed by Argentina from 1810 and 2023. It lists each treaty’s title, status, date signed, and counterpart country. The dataset is based on the government’s Digital Library of Treaties, where you can find copies of the treaties themselves. The most common counterparts have been neighboring countries — Chile, Brazil, Bolivia, Paraguay, and Uruguay — followed by Germany, the US, and Italy.

Chord progressions. Spyridon Kantarelis et al. have created CHORDONOMICON, a dataset identifying the progressions of 51 million chords in 667,000+ songs. The dataset is based on tablatures from the website Ultimate Guitar and then “annotated with structural parts, genre, and release date”. Most entries also include the song’s and artist’s IDs in Spotify’s system. [h/t Dale Debber]

Dataset suggestions? Criticism? Praise? Send euphonic feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2025.01.29 edition

2025-01-29T08:00:00-05:00

Hyperlocal Trump/Harris results, private schools, hurricane landfalls, grocery ingredients, and a royal regatta.

Hyperlocal Trump/Harris results. Earlier this month, colleagues at The New York Times published “An Extremely Detailed Map of the 2024 Election” and made the underlying data available to download. The effort “currently includes results for more than 110,000 precincts, or 73 percent of all votes, and will be updated as more data is collected.” The dataset lists each precinct’s state, county FIPS code, votes received by Kamala Harris, votes received by Donald Trump, and total votes (including third parties and write-ins). It also provides each precinct’s geographical boundaries, derived from a mix of official sources and estimations. Previously: “An Extremely Detailed Map of the 2020 Election” and the data behind it (DIP 2021.02.10). See also: Precinct-level election results for 2020, 2018, 2016, and 2012 from the Voting and Election Science Team.

Private schools. The National Center for Education Statistics’s Private School Universe Survey has been gathering data about private elementary and secondary schools every two years since the 1989–90 school year. It collects information on “religious orientation; level of school; size of school; length of school year, length of school day; total enrollment (K-12); number of high school graduates, whether a school is single-sexed or coeducational and enrollment by sex; number of teachers employed; program emphasis” and more. In the latest data, covering the 2021–22 school year, “there were 29,727 private schools, enrolling 4,731,303 students and employing 482,571 full-time teachers”. As seen in: ProPublica’s Private School Demographics lookup tool (webinar scheduled for January 31) and its reporting on “segregation academies”.

Hurricane landfalls. NOAA’s Hurricane Research Division maintains a table of hurricanes that have made landfall on the continental US since the 1850s. It records the year and month of landfall, designated name, states affected, the highest Saffir-Simpson category, central pressure at landfall, and maximum sustained wind speed. The division publishes another table containing more details — such as the full date, latitude, and longitude of landfall — but with a gap in the late 1970s to early 1980s. [h/t Michael Ferragamo + Dale Debber]

Grocery ingredients. To compile GroceryDB, Babak Ravandi et al. scraped data about 50,000+ food products available on the websites of Walmart, Target, and Whole Foods. For each product, they extracted the nutritional information and ingredient list, which they provide as structured data and use for estimating each product’s degree of processing. Related: TrueFood, a website the research team built with the findings.

A royal regatta. The Henley Royal Regatta, a multi-day rowing competition, has been held on the River Thames nearly every year since 1839. Dominic Goymour has scraped the event’s online results into a dataset covering 7,500+ outcomes since 1999. It includes each race’s date, starting time, stage, boat class, cup, winning crew/club, losing crew/club, winning time, and more.

Dataset suggestions? Criticism? Praise? Send hydrodynamic feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2025.01.08 edition

2025-01-08T08:00:00-05:00

Overdose demographics, opioid settlement spending, AI governance documents, NEA writing fellowships, and ISS telemetry.

Overdose demographics. Since mid-2024, reporters at the Baltimore Banner have been publishing a series examining the city’s overdose crisis — reporting supported by The New York Times’ Local Investigations Fellowship and Stanford’s Big Local News. Last month the team partnered with The Upshot and a range of local news organizations to examine a stark phenomenon: In dozens of US counties, “Black men born between 1951 and 1970 have died of overdose at exceptionally high rates for decades.” They’ve published the supporting data, which list overdose death counts and rates by year, county, race/ethnicity, sex, and age group. The data, based on restricted-use records from the CDC, cover the years 1989 to 2022 for “the 408 U.S. counties that had 200 or more overdose deaths between 2018 and 2022”. [h/t Cheryl Phillips + Kimi Yoshino]

Opioid settlement spending. KFF Health News, working with researchers at Johns Hopkins and Shatterproof, has published “a first-of-its kind database” tracking how states and local governments are using the billions of dollars received via opioid settlements in recent years. The database, drawing from “dozens of interviews, thousands of pages of documents, an array of public records requests, and outreach to all 50 states,” represents “the most comprehensive resource to date tracking some of the largest public health settlements in American history.” For each state, it indicates the total funds received in 2022-23, amount committed or spent in various categories (e.g., prevention, treatment, recovery), amount set aside, and amount “untrackable via public reports.” It also catalogs 7,000+ specific spending decisions: funder, destination, purpose, and amount. Previously: Opioid settlement payouts (DIP 2024.04.10). [h/t Aneri Pattani]

AI governance documents. The Emerging Technology Observatory’s AGORA “is a living collection of AI-relevant laws, regulations, standards, and other governance documents from the United States and around the world.” The dataset, available to download and explore online, provides the full text, metadata (e.g., jurisdiction, title, relevant dates), summaries, and thematic tags for 600+ documents. The project currently “skews toward U.S. law and policy” but is aiming “to broaden coverage of U.S. state documents […] and to broaden coverage of Chinese central government documents and major corporate commitments.”

NEA writing fellowships. A team led by English professor Alexander Manshel has compiled a dataset of every recipient of the National Endowment for the Arts’ fellowship for creative writing, “from the organization’s founding in 1965 to 2024, including information about those writers’ demographics, education, and geography.” The dataset, which lists 3,700+ recipients, is based on the NEA’s own directory and a 2006 report, as well as “author biographies and websites, institutional websites, interviews, encyclopedias, literary criticism, and literary journalism.” [h/t Melanie Walsh + Derek Willis]

ISS telemetry. The International Space Station beams home a wide range of measurements: cabin temperature, solar array angles, spacesuit power supply, wastewater tank capacity, oxygen production rate, and much more. NASA, in collaboration with Lightstreamer, provides a feed of these measurements. A team developing a live 3D model of the station has also published a couple of dashboards of the realtime data, historical data going back to 2018, and a data dictionary. [h/t ajdud + AIorNot]

Dataset suggestions? Criticism? Praise? Send orbiting feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2024.12.11 edition

2024-12-11T08:00:00-05:00

Quits and layoffs, food safety alerts, crop rotations, Serbian political party funds, and a long-running ultramarathon.

🔄 Last month, PEN America released its latest database of school book bans (DIP 2022.07.06), now covering the 2023–2024 school year. [h/t Florina Sutanto via Luke Hall]

Quits and layoffs. Minneapolis Fed–affiliated economists Kathrin Ellieroth and Amanda Michaud have constructed a new dataset on monthly quits and layoffs. Using Current Population Survey (CPS) microdata going back to 1978, the dataset estimates the proportions of employees who, after quitting or being laid off, transition to unemployment versus exiting the labor market. In a recent article, Ellieroth and Michaud note that “CPS data offer a perspective not seen in the most-often-used series on quits and layoffs, the Job Openings and Labor Turnover Survey (JOLTS),” featured in DIP 2022.09.21. “Whereas the JOLTS tracks what happens to a job, the CPS tracks what happens to people.” Analyzing it, they found “that increases in unemployment are typically not due to increases in layoffs; rather, they happen because laid-off workers are less likely to quickly find a new job, more likely to stay in the labor force, and thus more likely to join a growing pool of unemployed people hunting for work.” [h/t Alex Albright]

Food safety alerts. Data journalist Adrian Nesta is building a automated pipeline to collect and standardize data on food safety recalls and alerts from two US federal agencies — the FDA and the USDA. For each alert, the standardized dataset indicates the notice’s title, ID, URL, and time posted, as well as the product description, company name, brand name, recall type, recall reason, impacted states, risk level, and more.

Crop rotations. The Department of Agriculture’s Crop Sequence Boundaries initiative algorithmically analyzes satellite imagery to create “estimates of field boundaries, crop acreage, and crop rotations across the contiguous United States.” The results are available via an interactive map and downloads for eight-year time frames. The underlying code is open-source and can be used to generate datasets for custom time frames. Previously: The USDA’s CropScape tool and Cropland Data Layer (DIP 2019.03.06). [h/t Forest Gregg]

Serbian political party funds. The Center for Investigative Journalism of Serbia’s Party Funds database “tracks all reported incomes and expenses of 40 political parties and citizens’ groups in Serbia over the past nine years.” The records, based on financial disclosure reports, can be browsed online, searched, and downloaded. They indicate revenues, overhead costs, ad spending, salary expenditures, and more. The data specifies each line item’s year, amount, purpose, and other context-dependent details. [h/t Teodora Ćurčić]

A long-running ultramarathon. The Comrades Marathon, first run in 1921, is considered “the oldest and largest ultramarathon in the world.” The route stretches 80+ kilometers between Durban and Pietermaritzburg, flipping annually between “up” and “down” directions. In 2019, Kyle Stratton scraped the official website to construct a dataset of all 445,000+ finishers (year, name, country, club, category, finishing time, medal received) through that year. Related: The Association of Road Racing Statisticians’ lists of longest-running marathons and ultramarathons, last updated in 2017. As seen in: Antony Unwin’s Getting (more out of) Graphics.

Dataset suggestions? Criticism? Praise? Send inexhaustible feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2024.12.04 edition

2024-12-04T08:00:00-05:00

100 million places, education policies around the world, China leaders’ foreign visits, nanosatellites, and Pixar films.

100 million places. Foursquare has released an open dataset describing more than 100 million points of interest across 200+ countries. For each place, the dataset includes its name, address, latitude/longitude, date entered, date updated, date marked closed, telephone number, website, email address, and relevant categories. Among the many possible labels: casino, comedy club, 300+ kinds of restaurants (e.g., deli, diner, Korean BBQ, “mac and cheese joint”), and 100+ types of retailers (e.g., candy store, used car dealership, shopping mall). Learn more: Some initial explorations from Tim Wallace and from Simon Willison. Previously: The Overture Maps Foundation’s datasets (DIP 2023.08.09), including information about 53 million places. [h/t Derek M. Jones + Sharon Machlis + Giuseppe Sollazzo]

Education policies around the world. Adrián del Río et al. “introduce a global dataset on education policies and systems across modern history,” with “measures on compulsory education, ideological guidance and content of education, governmental intervention and level of education centralization, and teacher training.” The dataset covers 157 countries annually from 1789 to 2020. The questions answered by the team’s evaluators include, for example, “How many years of schooling are required by compulsory education?”, “Are there any national laws in place that ban specific subjects or topics in school?”, and “Which entities operate secondary schools?”

China leaders’ foreign visits. Yu Wang and Randall W. Stone’s China Visits dataset records 400+ visits by China’s presidents and premiers to 100+ countries between 1998 and early 2020. To compile it, the authors consulted official reports, web search results, and relevant Wikipedia pages. For each visit, the dataset indicates its starting and ending date, Chinese leader, foreign country, broader meeting (e.g., those of the Shanghai Cooperation Organisation, and source URL.

Nanosatellites. Space systems engineer Erik Kulu’s Nanosats Database tracks 4,000+ nanosatellites that have been launched into space, are planned for future launch, or have had their launches cancelled. The data for each satellite include its mission name and description, launching organization and country, mass/unit size, launch date, and status. Additional tables provide lists of CubeSat companies, launch providers, costs, and more. [h/t Ahmad Assem]

Pixar films. Software engineer Eric Leung built and maintains a dataset and R package providing structured information about every Pixar film — from 1995’s Toy Story to 2024’s Inside Out 2. It lists each film’s creators (storywriters, screenwriters, directors, composers, and producers), budget, box-office earnings, aggregate critic ratings, Oscar nominations and wins, and more. [h/t Josh Laurito]

Dataset suggestions? Criticism? Praise? Send animated feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2024.11.13 edition

2024-11-13T08:00:00-05:00

Tariffs, climate summit attendees, substance abuse treatment, waves, and NYC marathon finishers.

👋 Job alert: I’m hiring a statistician to join our data journalism team at The New York Times. Follow that link for details and feel free to email me with any questions.

Tariffs. The United States International Trade Commission maintains annual datasets of US import tariffs going back to 1997. The datasets include each impacted product’s eight-digit Harmonized Tariff Schedule code, a brief description, the duty rate, rate type, effective and ending dates, and more. The commission also publishes a tariff search tool and data on upcoming tariff rates. More globally, the World Trade Organization provides tools to query and download data about its members’ tariffs, as well as databases of regional trade agreements and preferential trade agreements. Previously: Trade policy intervention data from Global Trade Alert (DIP 2022.01.19).

Climate summit attendees. Daria Blinova et al. have built a dataset of 310,000+ attendees of United Nations climate summits. The data, largely compiled from PDFs of attendance rosters, include each attendee’s year and meeting attended, name, job title, affiliation, delegation, delegation type (party, observer state, intergovernmental organization, NGO), gender, and more. In all, the attendees span 27,000+ delegations across three decades of COP and predecessor summits. Read more: “This Is 29 Years of International Climate Summits, Visualized,” by The New York Times’ Mira Rojanasakul.

Substance abuse treatment. The Substance Abuse and Mental Health Services Administration’s Treatment Episode Data Set records admissions to, and discharges from, substance abuse centers in the US. The public-use datasets, which span several decades, are based on records collected by state agencies. They include each patient’s demographic information, state, metro/micro area, referral source, treatment type, substances used, frequency of use, age at first use, number of previous treatment episodes, among other details. Related: The administration’s National Survey of Substance Abuse Treatment Services, “an annual census of treatment facilities.” [h/t Conor Lennon et al.]

Waves. The Coastal Data Information Program, launched in the 1970s by a research group at the Scripps Institution of Oceanography, “is an extensive network for monitoring waves and beaches along the coastlines of the United States.” The program provides a map of its stations, a table of recent observations, a catalog of real-time and historical wave measurements, and an extreme wave tracker. As seen in: Dion Häfner et al.’s “FOWD: A Free Ocean Wave Dataset for Data Mining and Machine Learning.”

NYC marathon finishers. New York Road Runners publishes a searchable database of all races it has organized since 1970 — the year of NYC’s first marathon — and all finishers of those races. Data Is Plural reader Joe Hovde has scraped the results of the 2024 marathon into a downloadable spreadsheet. Each row represents one of the 55,000+ finishers and provides their name, bib number, age, gender, city, state, country, time ran, and place finished. Read more: “Marcelo & Karolina, the Fastest Names in the NYC Marathon,” by Hovde.

Dataset suggestions? Criticism? Praise? Send five-borough feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2024.10.23 edition

2024-10-23T08:00:00-04:00

SBA disaster loans, US buildings, the cost of sustenance, German election results, and synthesizers.

SBA disaster loans. “Following a declared disaster,” the US Small Business Administration offers “disaster assistance in the form of low-interest, long-term disaster loans for damages not covered by insurance or other recoveries to businesses of all sizes, private nonprofit organizations, as well as homeowners and renters.” The SBA publishes anonymized data about each such loan in fiscal years 2000 to 2022, drawn directly from its Disaster Credit Management System. The records provide the relevant disaster declaration IDs, property type, ZIP code, city, county, state, verified losses (in real estate and in “content”), and approved loan amounts. Previously: SBA datasets the Paycheck Protection Program (DIP 2020.07.08) and the administration’s 7(a) and 504 loan programs (DIP 2023.01.11). [h/t Benjamin L. Collier et al.]

US buildings. “Leveraging high performance computing, remote sensing, geographic data science, machine learning, and computer vision,” Hsiuhan Lexie Yang et al., researchers at Oak Ridge National Laboratory, have “partnered with Federal Emergency Management Agency (FEMA) to build a baseline structure inventory covering the US and its territories to support disaster preparedness, response, and recovery.” The dataset and interactive map trace the outlines of 125 million buildings and, in many cases, contain the building’s address, occupancy class, usage type, height, elevation, and other attributes. They also provide information about the imagery used to identify the structure.

The cost of sustenance. The UN World Food Programme’s Fill the Nutrient Gap initiative conducted a series of analyses in 2015 through 2021 to “calculate the costs of energy-sufficient and nutrient-adequate diets and the percentage of households that were unable to afford each diet.” In a recent paper, Zuzanna Turowska et al. describe the analyses’ methodology and share their results as a dataset. For each of the 37 countries analyzed, the dataset contains one row per geographic unit, timeframe, and type of household member; each row provides the cost and unaffordability estimates for that category.

German election results. GERDA, a new project by Vincent Heddesheimer et al., “provides a comprehensive dataset of local, state, and federal election results in Germany.” The results go back to 1953 for federal elections, to 1990 for local elections, and to 1996 for state elections. The files indicate each geographic unit’s number of eligible voters, actual voters, valid votes, invalid votes, and vote shares by party. The authors have also created “geographically harmonized datasets that account for changes in municipal boundaries and mail-in voting districts.”

Synths. Iftah Gabbai is building a dataset of “hardware synthesizers, samplers, and drum machines” produced since 1896, “compiled through a mix of automated and manual processes, combined with extensive research.” For each of the 2,300+ devices identified, the dataset indicates its name, brand, release year, years in production, device type (synth, sampler, et cetera), form factor, architecture, synth engine used, number of keys, key type, oscillator count, and more. Learn more: Gabbai’s introductory video. [h/t Stefan Bohacek]

Dataset suggestions? Criticism? Praise? Send oscillating feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2024.10.09 edition

2024-10-09T08:00:00-04:00

Disability claims processing, evapotranspiration, fuel forecasts, trash balloons, and Anglo-Saxons on record.

👋 Job alert: I’m hiring three data reporters at The New York Times to collaborate across the newsroom. Follow that link for details and feel free to email me with any questions.

Disability claims processing. The Social Security Administration publishes monthly and annual datasets tracking each state agency’s progress processing disability claims. The datasets, which go back to October 2000, count the number of initial claims received, pending, determined, and approved by each agency during each period, as well as similar breakdowns for denial reconsiderations and continuing disability reviews. The administration’s extensive catalog of public datasets also includes several that measure the waiting involved, such as monthly average initial claim processing times, average wait times for reconsiderations, and wait times for administrative hearings. As seen in: “Wait times for Social Security disability benefit decisions reach new high” (USAFacts).

Evapotranspiration. With the goal of “filling the biggest data gap in water management,” the OpenET project uses satellite imagery, weather data, and other sources to estimate the volume of evapotranspiration — “the process by which water is transferred from the land to the atmosphere” — at a 30-meter resolution across 17 states in the western US. The results are available to explore via an online map of annual cumulative evapotranspiration (2019–2024), as monthly datasets via Earth Engine, and through an API. [h/t Mira Rojanasakul]

Fuel forecasts. The US Energy Information Administration’s monthly Short-Term Energy Outlook provides forecasts and recent trends of energy supply, consumption, prices, and inventory. It covers a range of commodities and electricity sources, such as crude oil, coal, natural gas, gasoline, renewables, and nuclear. Starting with its September 2024 report, the outlooks have also begun to include more detailed data on biofuels, available in Table 4d of its structured datasets.

Trash balloons. Since May of this year, North Korea has floated thousands of trash-carrying balloons into South Korea. A team from the Center for Strategic and International Studies’ Beyond Parallel project has mapped 160+ known balloon landing locations, based on public sources. The map’s downloadable data indicates the each landing’s date, associated “wave,” coordinates, location name, and province. As seen in: Reuters’ visually immersive article on the topic. [h/t Soph Warnes]

Anglo-Saxons on record. The Prosopography of Anglo-Saxon England project “aims to provide structured information relating to all the recorded inhabitants of England from the late sixth to the late eleventh century.” Built over (relatively less) time by several teams at UK universities, PASE is “based on a systematic examination of the available written sources for the period, including chronicles, saints’ Lives, charters, libri vitae, inscriptions, Domesday Book and coins.” The Domesday-focused portion of the project features a downloadable table of 17,000+ landholders from that manuscript, listing their name (where known), gender, description, value of holdings, and linking to details about those holdings. [h/t Derek M. Jones]

Dataset suggestions? Criticism? Praise? Send prosopographical feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.

Data Is Plural — 2024.09.25 edition

2024-09-25T08:00:00-04:00

NYC evictions, open-source weather APIs, tree canopies, Canadian mines, and James Beard honorees.

NYC evictions. New York City’s government publishes a dataset listing evictions “pending, scheduled and executed” since 2017, updated daily. The data are “compiled from the majority of New York City Marshals,” who are mayor-appointed officers tasked with enforcing civil court cases. Each of the 97,000+ rows indicates the eviction court case number, address, property type, eviction type, execution date, and marshal. Related: The city also publishes data on marshals’ annual eviction revenues. Also related: nycdb points to, and helps download, a range of NYC housing–related datasets. As seen in: “Spiking Evictions Renew Calls to Reform NYC Marshals System,” by Patrick Spauster for City Limits.

Open-source weather APIs. Open-Meteo, an open-source project built on data from national weather services, offers a range of weather and climate APIs that are free for non-commercial use. They include weather forecasts (temperature, humidity, precipitation, wind speed, etc.), daily historical weather since 1940, climate change model outputs, marine wave forecasts, air quality assessments, and more. The project also provides bulk downloads of the underlying data and self-hosting instructions. As seen in: Jan Kühn’s Historical Meteo Graphs. [h/t Giuseppe Sollazzo]

Tree canopies. The High Resolution Canopy Height Maps dataset, released in April by Meta and the World Resources Institute, estimates “tree canopy height at a 1-meter resolution, allowing the detection of single trees at a global scale.” It is available to explore online and download, and was constructed by applying machine learning techniques to satellite imagery and LiDAR data. The estimates use satellite imagery mostly from 2018–2020, and “when newer imagery is available, the publicly shared model can be used to detect change in canopy heights.” [h/t Ben Hur Pintor]

Canadian mines. Economist Clara Dallaire-Fortier has compiled a dataset of “mine-level estimates for the Canadian mining industry with a persistent annual coverage between 1950 and 2022,” based partly on historical government maps. For each of the 947 mines identified, the dataset indicates its name, location, mining companies, dates open/closed, and commodities produced. Previously: Australian mine production, 1799–2021 (DIP 2023.07.12).

James Beard honorees. Cody Winchester has constructed a dataset of James Beard Foundation Award semifinalists, nominees, and winners since 1991, sourced from the foundation’s award-search page. For each honoree, the dataset provides their name, year, category, subcategory, and award status, plus additional category-specific variables (such as publisher for the book awards, and location for restaurant and chef awards).

Dataset suggestions? Criticism? Praise? Send gourmet feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.