Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2021.12.01 edition

The General Index, personal income, banking-crisis interventions, Europe’s primary forests, and Michael Scott misquotes.

Language from 100 million academic articles. The General Index, a new project from Carl Malamud’s Public Resource, contains detailed linguistic data derived from 107,233,728 academic journal articles. The index’s main table contains all one-to-five-word sequences found in each article, and their frequencies — more than 355 billion “n-grams” in total. A second table identifies nearly 20 billion keywords auto-extracted from the corpus, and a third table lists the authors, title, publication date, and DOI associated with each article. Read more: “The plan to mine the world’s research papers” (Nature, July 2019) and “Giant, free index to world’s research papers released online” (Nature, October 2021). [h/t webmaven]

Personal income. The US Bureau of Economic Analysis has released its latest county-level estimates of personal income, which now cover 1969–2020. The per-capita numbers are also available by metropolitan statistical area, as well as disaggregated by income “component” (wages/salaries, income from assets, etc.) and industry. You can download the data and also explore it through interactive tables and maps. As seen in: “From Wealthy Enclaves to Asset Deserts,” a map and report by the Economic Innovation Group.

Banking-crisis interventions. Economists Andrew Metrick and Paul Schmelzing have compiled a database of 800+ banking crises spanning the years 1257 to 2019, plus 1,800+ government attempts to mitigate them. For each crisis, the database provides the starting year, relevant country or region, a brief description, and more. It also describes the interventions, lists their dates, classifies them into 20 categories (asset guarantees, market liquidity assistance, etc.), and links them to sources and prior literature.

Europe’s primary forests. Francesco Maria Sabatini et al. have combined information from dozens of sources to develop “the most comprehensive dataset” of Europe’s “primary forests” — those “where the signs of human impacts, if any, are strongly blurred due to decades without forest management.” The dataset, which includes 18,411 patches across 33 countries, describes their names, locations, level of “naturalness,” dominant tree species, and more. Previously: EU-Forest (DIP 2017.01.25).

“Early worm gets the worm.” Designer and “diehard Office superfan” Will Chase rewatched the US version of the sitcom with one goal in mind: to document every misquote, malapropism, mispronunciation, and other verbal flub by Steve Carell’s character Michael Scott. He found more than 200.