Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2019.06.19 edition

Drug prices, discographies, plant extinctions, monarchs, and sarcasm.

Drug prices. The Centers for Medicare and Medicaid Services’ National Average Drug Acquisition Cost dataset indicates how much U.S. pharmacies have to pay, on average, to obtain thousands of prescription and over-the-counter drugs. The dataset contains millions of rows — one for each National Drug Code in the survey, for each week since 2013 — but you can also download smaller, weekly slices. The agency also publishes a dataset of changes in these average costs. Previously: Total and average costs for Medicare Part B and Part D prescriptions (DIP 2016.12.14). [h/t data.world]

Discographies. Discogs, a user-contributed music database and marketplace, publishes “monthly data dumps” listing the millions of artists, labels, and releases in its system. Additional types of data (e.g., user reviews) are available through Discogs’ API. [h/t Jan Willem Tulp]

Plant extinctions. “Most people can name a mammal or bird that has become extinct in recent centuries, but few can name a recently extinct plant.” That’s from a new academic paper that presents “a comprehensive, global analysis of modern extinction in plants.” The paper itself is paywalled, but the dataset — of 571 extinct seed plants, plus other species that have been rediscovered or reclassified — is available to download. Related: World’s largest plant survey reveals alarming extinction rate, a summary of the findings. [h/t Joseph Stirt]

European monarchs. Developer Michael Zemel has built an interactive timeline of 282 European kings, queens, emperors, and other monarchs. For each, the data includes his or her name, religion, period of reign, reason for losing power, wars involved in, relationships, and notable events. Zemel has also published a detailed writeup about his inspiration and process, plus the underlying data and code. [h/t Giuseppe Sollazzo + Sophie Warnes]

An obviously perfect dataset. MUStARD is a corpus of 690 text and video clips “for research in automated sarcasm discovery.” The dataset’s 690 examples — half involving sarcasm, half not — come from Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. Related: Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper), the researchers’ introduction to the dataset.