Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2017.06.07 edition

Millions of scientists, Trump’s pre-presidency flights, severe workplace injuries, annotated Reddit conversations, and e. coli at Ocean Beach.

Millions of scientists, and their migrations. ORCID is a nonprofit organization that provides unique identifiers for researchers — mostly scientists so far — to make it easier to distinguish between them. It has issued more than 3 million IDs so far, and provides annual bulk downloads of all researchers’ public profiles. In many cases, the researchers have supplied their education and employment histories. That enabled Science magazine to analyze the migrations of more than 110,000 researchers who’ve listed multiple countries in these public CVs. (The data and code underlying the analysis are also available to download.) [h/t Shaun Coffey]

Trump’s pre-presidency flights. Before Donald Trump began flying on Air Force One, he rode a fleet of private aircraft. Reporters at Bloomberg used the Freedom of Information Act to obtain flight records for three major components of that fleet — a ”Boeing 757 with gold-plated seatbelt buckles, known as Trump Force One during the campaign; a Cessna 750 Citation X jet; and a Sikorsky helicopter”. For each of the more than 1,500 flights taken between August 2010 and November 2016, the dataset contains the date, time, and airport of both the departure and arrival. Trump wasn’t necessarily aboard each of those flights; the dataset does not contain passengers information. Related: Bloomberg’s analysis/maps of the data. Also related: The Washington Post used the data to estimate the flights’ CO2 emissions.

Severe workplace injuries. Beginning in January 2015, the Occupational Safety and Health Administration began requiring U.S. employers to report “all severe work-related injuries, defined as an amputation, in-patient hospitalization, or loss of an eye.” You can download a spreadsheet of these injuries — some 20,000 in 2015 and 2016 combined. It contains the injury dates, descriptions, and outcomes, as well as the employers’ names and locations. Previously: OSHA’s more detailed (but slightly more cumbersome) inspection data and API (DIP 2016.07.13).

Annotated Reddit conversations. Researchers at Google took a semi-random sample of 9,473 Reddit threads, containing 116,347 comments in total. Then, they paid people to categorize each comment by its “discourse act” — e.g., whether it was a question, answer, announcement, agreement, humor, et cetera. The result is Coarse Discourse, “a dataset for understanding online discussions.” [h/t Roberto Bayardo]

E. coli at Ocean Beach. The San Francisco Public Utilities Commission’s Beach Water Quality Monitoring Program measures bacteria levels at fifteen locations on the city’s shoreline. You can download the measurements by clicking the “raw data” link below this map. The data powers the (unsurprisingly) unofficial @BeachPooBot account on Twitter. [h/t Reddit user cavedave]