Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2024.05.01 edition

Prison commissary prices, historical markers, biological numbers, openly-licensed video transcripts, and Agatha Christie’s bibliography.

Prison commissary prices. Through a series of public records requests, reporters at The Appeal have constructed the “first national database of prison commissary lists,” based on documents provided by 46 states. The database contains three tables. The first links to, and provides metadata about, each list. The second summarizes each state’s prices for two dozen types of products — such as ramen, toothpaste, and rosary beads — across three categories: food, personal care/hygiene, and religious items. The third table provides 2,200+ commissary-specific prices for those products. Read more: “Locked In, Priced Out: How Prison Commissary Price-Gouging Preys on the Incarcerated,” by reporters Elizabeth Weill-Greenberg and Ethan Corey. [h/t JQ Whitcomb]

Historical markers. Launched in 2006, the Historical Marker Database “is an illustrated searchable online catalog of historical information viewed through the filter of […] permanent outdoor markers, monuments, and plaques.” The crowdsourced project has documented 195,000+ markers in the US, plus thousands more in Canada, Mexico, the UK, and elsewhere. You can browse them by location and by topic, and download data corresponding to each collection. You can also search by person, keyword, historical date, and other attributes. As seen in: “Historical markers are everywhere in America. Some get history wrong,” by NPR’s Laura Sullivan and Nick McMillan, with associated data analysis. [h/t Walt Hickey]

Biological numbers. BioNumbers wants “you to find in one minute any useful molecular biology number that can be important for your research.” As its creators Ron Milo et al. described in 2010, those numbers “range from cell sizes to metabolite concentrations, from reaction rates to generation times, from genome sizes to the number of mitochondria in a cell.” You can search, browse, and download more than 14,000 entries. Each includes a number and/or range, units and method of measurement, relevant organism, and source. For instance: The diameter of an e. coli cell is 1-1.1 micrometers, the lifespan of a human red blood cell is 70-140 days, and a chicken’s genome has 1.05 billion base pairs.

Openly-licensed video transcripts. The YouTube-Commons dataset, built by a French startup, contains 15 million original and auto-translated audio transcripts from 2 million Creative Commons–licensed YouTube videos, sourced from 400,000+ channels. The dataset indicates each video’s YouTube ID, title, channel, and date, as well as each transcript’s original language, translated language, word count, and character count. Translations are available primarily in Dutch, English, French, German, Italian, Russian, and Spanish. [h/t Data Machina]

Agatha Christie’s bibliography. Nicole Mark has compiled a dataset of Agatha Christie’s published stories, covering 75 novels, 154 short stories, and 22 short story collections. The spreadsheets provide each work’s title and the character-based series to which it belongs (e.g., Hercule Poirot, Miss Marple, etc.). The novel and collection entries also indicate their year of initial publication, while the short-story entries list the collections that included them.