Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2020.08.26 edition

The Mass Mobilization Project, ruling elites, government publications, national park trails, and bad words.

Protests around the world. The Mass Mobilization Project is “an effort to understand citizen movements against governments, what citizens want when they demonstrate against governments, and how governments respond to citizens.” The project’s dataset covers more than 14,000 protests in more than 160 countries between 1990 and early 2017. For each protest, it indicates the location, dates, estimated number of participants, protesters’ demands, the state’s response, and more. The project, led by political scientists David H. Clark and Patrick M. Regan, is indirectly funded by the CIA through the government-sponsored Political Instability Task Force. [h/t Erik Gahner Larsen]

Ruling elites. The Authoritarian Ruling Elites Database, compiled by political scientist Austin S. Matthews, is “a collection of biographical and professional information on the individuals who constitute the top elite of authoritarian regimes.” Each of the project’s 18 datasets focuses on a particular regime, such as the military dictatorship that ruled Chile from 1973 to 1990. The biographical data-points include gender, occupation, dates of birth and death, tenure among the elite, and more.

Government publications. The US Government Publishing Office’s provides online access to a wide range of official federal publications — including bulk downloads of congressional bills, the Federal Register, the Code of Federal Regulations, and more. It also provides sitemaps “to crawl and harvest content” from many of its other collections. [h/t Christine Stefano]

National park trails. The US National Park Service publishes a dataset of 28,000+ lines describing “formal and informal trails as well as routes within and across” the park system. The dataset provides the trail names, types, surfaces, allowed uses, and more. [h/t u/torrijasycafe]

Bad words. “With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don’t show up in places they shouldn’t.” The company’s dataset of Dirty, Naughty, Obscene, and Otherwise Bad Words contains the block-lists for their autocompletion and recommendation features, covering 2,600+ words and phrases in 28 languages. [h/t Katie McCulloch]