Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2021.01.06 edition

Megascale coronavirus surveys, commodity-transportation costs, more college sports financing, ten million Jupyter notebooks, and 2020 in haiku.

Megascale coronavirus surveys. Carnegie Mellon University’s epidemiological forecasting group and Facebook have partnered to field a large-scale coronavirus survey in the US; they’ve collected more than 15 million responses since April 2020. The University of Maryland has formed a similar partnership for an international survey, in which “a representative sample of Facebook users is invited on a daily basis to report on symptoms, social distancing behavior, mental health issues, and financial constraints”; millions have also participated. Geographically-aggregated results of the US survey can be downloaded via an online interface or Delphi’s API; the international results are also available via API. Practical example: An analysis of state-by-state mask usage, with code. [h/t Alex Reinhart]

Commodity-transportation costs. The UN and the World Bank have launched a new interactive map and dataset that quantify the transportation costs for international trade — country-by-country and broken down by mode of transportation (sea, air, rail, road), trading partner, and commodity. The numbers, based both on directly-reported figures and statistical modelling, include costs overall, per unit, and per unit per kilometer. The project currently covers only 2016, but has plans to expand. [h/t Jan Hoffmann]

More college sports financing. The College Athletics Financial Information Database, run by the privately-funded Knight Commission on Intercollegiate Athletics, details the annual sources of revenue (such as ticket sales) and expenses (such as coaches’ compensation) for hundreds of schools, based on information self-reported to the NCAA and federal government. Many of the records were obtained via freedom-of-information requests by USA Today and Syracuse University students. [h/t Craig Garthwaite et al.]

Millions of computational notebooks. In 2017, a team of researchers downloaded and analyzed 1.25 million publicly-available Jupyter notebooks — documents that weave computational code, output, and text. They also published the notebooks and their related metadata. Inspired by that project, a team at JetBrains recently did a follow-up scan, analyzing and publishing data on nearly 10 million notebooks.

2020 in haiku. Over the course of 2020, Eli Holder paid workers on Mechanical Turk to turn news headlines into 5/7/5-syllable poems. The result: 2,760 “Doom Haikus,” which you can browse on a timeline or download in bulk. For each poem, the dataset also includes the original article URL, date processed, headline, and SEO snippet. [h/t Karsten Johansson]