Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2022.11.16 edition

Big emitters, disease outbreaks, permissively-licensed code, impact craters, and tinned fish.

Big emitters. Climate TRACE, a nonprofit coalition launched in 2020, uses satellite imagery, sector-specific datasets, and other sources to estimate greenhouse gas emissions in detail. Their most recent inventory, released last week, highlights 70,000+ individual sites that “represent the top known sources of emissions in the power sector, oil and gas production and refining, shipping, aviation, mining, waste, agriculture, road transportation, and the production of steel, cement, and aluminum.” You can download the data, explore sector- and country-level estimates, and browse a map of the sites. Read more: Coverage in the New York Times. [h/t Ian Johnson]

Disease outbreaks. Juan Armando Torres Munguía et al. have built a dataset of infectious disease outbreaks, based on information extracted from the World Health Organization’s Disease Outbreak News alerts (DIP 2022.03.30) and its coronavirus dashboard. The authors have clustered the outbreaks by disease (classified by ICD-10 and ICD-11 codes), country, and year. Excluding the COVID-19 pandemic, this leads to 1,500+ total combinations between January 1996 and March 2022, spanning 60+ diseases and 200+ countries/territories. [h/t Konstantin M. Wacker]

Permissively-licensed code. The Stack, a new dataset from the BigCode project, “contains over 3TB of permissively-licensed source code files covering 30 programming languages crawled from GitHub.” Those terabytes hold more than 300 million files extracted from repositories whose licenses place “minimal restrictions on how the software can be copied, modified, and redistributed.” The dataset provides the contents of each file along with its repository name, path, size, programming language, detected licenses, and several high-level metrics. Read more: An introductory Twitter thread and preprint paper. [h/t Karsten Johansson]

Impact craters. The Earth Impact Database, maintained by the University of New Brunswick’s Planetary and Space Science Centre, catalogs nearly 200 impact craters caused by meteorites that have crashed into the planet. It presents the name, location, diameter, estimated age, geology, and other features of the craters, as well as photographs and bibliographies. Related: Cody Winchester has scraped the crater characteristics into CSV and GeoJSON files.

Tinned fish. Rainbow Tomatoes Garden is a farm in East Greenville, Pennsylvania, that also happens to run an online store selling “the largest selection of tinned seafood in the world.” Curator-owner Dan Waber publishes a spreadsheet of the store’s 630+ offerings, listing each product’s name, type of seafood, brand, country of origin, tin size, and price; whether it’s organic, certified kosher, smoked, boneless, and/or skinless; and more. [h/t George Ho]