Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2017.04.12 edition

Plum presidential appointments, fuel efficiency, pirated papers, aid for maternal health, and one million comic book panels.

Plum presidential appointments. Every four years, Congress publishes United States Government Policy and Supporting Positions, better known as the Plum Book. The 2016 version, which is available as both PDF and Excel files, identifies more than 8,000 executive and legislative branch jobs subject to “noncompetitive appointment.” Those positions include 1,710 presidential appointments, which are as wide-ranging as the ambassadorship to Afghanistan and the directorship of the Occupational Safety and Health Administration’s Whistleblower Protection Program. Related: For positions requiring its confirmation, the Senate publishes XML files of pending, confirmed, and withdrawn nominees.

Miles per gallon. The Environmental Protection Agency publishes fuel efficiency data on all the car models it has tested, going back to the 1980s… minus all the Volkswagen, Audi, and Porsche diesels caught cheating. The data typically includes three estimates: for city driving, highway driving, and a city-highway combination.

Pirated papers. Sci-Hub, which describes itself as “the first pirate website in the world to provide mass and public access to tens of millions of research papers,” recently released a list of the 62,835,101 academic papers it has collected. That dataset identifies each paper only by its DOI — a short, unique ID. Helpfully, graduate student Bastian Greshake has extracted the journal name, publisher, and publication ear from those DOIs. Greshake has also combined that data with six months of Sci-Hub download data (previously featured in DIP 2016.05.04), and analyzed the datasets together. Among his findings: Both are “largely made up of recently published articles, with users disproportionately favoring newer articles and 35% of downloaded articles being published after 2013.”

International aid for maternal and child health. Researchers at the World Health Organization have assembled a dataset of international aid — both from official government assistance and private grants — devoted to reproductive, maternal, newborn, and child health from 2003 to 2013. The dataset, which the researchers described in a recent academic article, draws on 2.1 million records, and is based largely on the OECD’s Creditor Reporting System. Related: Earlier this month, the U.S. State Department cut all its funding for the UN’s family planning agency; it was the agency’s third-largest donor.

One million comic book panels. Comic books make use of white space — or gutters — to propel the story forward, relying on readers’ intuitive ability to fill in the gaps between panels. To see whether computers could learn to make the same inferences, a group of computer scientists built a giant corpus of public-domain comics and tried training a series of neural networks on it. (Spoiler: Humans are much better at this.) The underlying dataset contains 1.2 million panels from nearly 200,000 scanned pages of nearly 4,000 books in the Digital Comic Museum, all published during the 1938–1954 “Golden Age” of American comics. It also contains 2.5 million chunks of text extracted from the comics’ speech balloons, thought bubbles, and narration boxes. [h/t Robin Sloan]