Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2020.06.17 edition

Confederate monuments, more protest data, the Green Books, sentences for measuring bias, and a few hundred penguins.

Confederate monuments. For several years, the Southern Poverty Law Center’s “Whose Heritage?” project has been gathering and mapping information on “public symbols of the Confederacy,” such as monuments, place names, official holidays, commemorative license plates, and municipal seals. For each of the 1,800+ entries, the project’s dataset indicates the type of monument/symbol, its location, sponsor, year dedicated, and (if applicable) year removed. [h/t Gita Jackson + Dan Brady]

More protest data. has been using news reports to quantify protest events in the US since 2017. Its downloadable dataset currently contains more than 27,000 events and provides each event’s date, location, and approximate number of attendees. The events are also tagged with one or more topics, such as civil rights, healthcare, for racial justice, and against regulation. The site’s search/mapping tool lets you to filter by those tags, and also for “curated protest data for a more compassionate country.” Related: Tommy Leung and Nathan Perkins describe how they built the project. [h/t Audra Burch et al.]

The Green Books. The New York Public Library has digitized more than 20 volumes of the Green Book —  a series of travelers’ guides, published from the 1930s to 1960s by US postal worker Victor Green, that listed hotels, restaurants, gas stations, and other establishments where Black visitors would be safe and welcome. The library has converted its digital copies into semi-structured text and turned the 1947 edition into a fully-structured dataset. The University of South Carolina has built a dataset of the 1,500+ listings in the 1956 edition. Inn 2015, NYPL Labs combined both years’ datasets into a tool to map the locations and plan routes with them.

Sentences for measuring bias. StereoSet aims to measure biases toward stereotypes — as they relate to profession, gender, race, and religion — in statistical language models, using a dataset containing thousands of sentences, each with several variations. Through the project’s online data explorer, you can examine the sentences and see how some popular language models perform. [h/t Michael McLaughlin]

A few hundred penguins. Data educator Allison Horst earlier this month released a dataset describing the physical characteristics of 344 Antarctic penguins, derived from data collected by marine biologist Kristen Gorman and the Palmer Station. With palmerpenguins, Horst aims to provide a data-exploration alternative to the ubiquitous iris dataset, which was first published, in 1936, in the Annals of Eugenics. Previously: Thousands of penguins (DIP 2020.03.11). [h/t Alex Cookson]