Four decades of U.S. air quality. The Environmental Protection Agency collects air quality samples from thousands of monitoring stations across the country. The resulting datasets, which go back to the 1980s, are available as daily files, annual files, and via an API. The monitored pollutants include ozone, carbon monoxide, sulfur dioxide, nitrogen dioxide, particulate matter, volatile organic compounds, and more. You can also download daily Air Quality Index ratings and information about each monitoring station. Previously: Global air pollution datasets from Berkeley Earth (DIP 2017.03.22) and from the World Health Organization (DIP 2016.06.15). [h/t Swier Heeres]
Chest x-rays. Last week, the National Institutes of Health released a dataset containing more than 100,000 anonymized chest x-rays, from 30,000 patients, “including many with advanced lung disease.” For each image, the associated metadata includes the patient’s age, gender, and diagnosis labels. (The dataset’s authors used natural language processing to extract those labels from radiological reports; they estimate that fewer than 10% of the labels are incorrect.) Related: Andrew L. Beam’s list of medical datasets for machine learning. [h/t Chris Hamby]
Media coverage. Media Cloud, a collaboration between MIT and Harvard–based researchers, describes itself as “an open-source platform for studying media ecosystems.” The project lets you track topics and keywords across thousands of sources — including mainstream news publications in the U.S. and many other countries — at both a story and sentence level. You can access Media Cloud’s data via its dashboard or its API. Both require (free) registration. Related: “The Media Really Has Neglected Puerto Rico,” by Dhrumil Mehta at FiveThirtyEight; the analysis uses data from Media Cloud, the TV News Archive, and Google Trends. Also related: The geometry of hurricane coverage, as told through the front pages of The New York Times and Washington Post.
Privately owned public spaces. In certain cities, private developers can earn zoning concessions by converting sections of their properties into plazas, atriums, mini-parks, and other open-to-the-public spaces. You can download datasets of these “privately owned public spaces” in San Francisco, Seattle, New York City, and — thanks to a recent collaboration between Guardian Cities and local community group — London. Related: A guide to NYC’s POPS. [h/t Reddit user seeriktus + Ed Vine]
These American Voices. For a new interactive essay at The Pudding, Ash Ngu analyzed the gender composition of This American Life episodes. To support the findings, Ngu has published the underlying data, extracted from the show’s transcripts. Among the data extracted: the number of words spoken by each person in each act of each episode.