Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2015.11.11 edition

Naughty companies, side effects that may include nausea, Albuquerque’s open-data initiative, lots of books, and a tiny dataset with an outsized legacy.

Naughty companies. Good Jobs First’s Violation Tracker calls itself “the first national search engine on corporate misconduct.” The new database currently contains nearly 100,000 penalties for environmental, health, and safety violations — sourced from 13 U.S. regulatory agencies — since 2010. Search results can be downloaded as CSV files, which contain a few additional fields. (Tip: Search for “*” to get all cases.) The largest single fine? The Department of Justice’s $20.8 billion penalty this year against BP. [h/t Samuel Rubenfeld]

The 139,756 side effects of 1,430 medical drugs. The Side Effect Resource, a.k.a. SIDER, takes all the fine print from drug labels, and aggregates the information about side effects into a searchable, downloadable database. SIDER got a major upgrade last month, and now contains 40% more drug-effect pairs than before. The website incorporates both generic and brand names, so that searches for “Prozac” and “fluoxetine” bring you to the same page.

Albuquerque’s impressive open-data program. The New Mexico city publishes dozens of regularly-updated, well-documented datasets. Among them: government employee earnings, the number of daily visitors to the city’s swimming pools, real-time bus locations, the geography of police beats, and the city’s complete vendor checkbook. [h/t Tom Johnson, who emailed Data Is Plural to praise how Albuquerque is sharing its data: “I have not found any other city in the world doing so in such detail.”]

1.8 billion pages of books (and booklike things). Earlier this year, the HathiTrust Research Center released a massive dataset extracted from 4.8 million digitized volumes. For each of its 1.8 billion pages, the dataset includes word frequencies, languages used, and sentence counts, among other features.

Deadly Prussian horses. For his 1898 book, The Law of Small Numbers, statistician Ladislaus Bortkiewicz tabulated the number of Prussian cavalrymen killed by horse kicks each year between 1875 and 1894. (In total, 196 suffered that tragic fate.) The dataset is tiny, but boasts an outsized legacy: Bortkiewicz’s lethal horse kicks allegedly helped to popularize the then-obscure Poisson distribution. [h/t Noah Veltman]


Updates and corrections: