Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2021.04.21 edition

Domestic terrorism, web crawls, COVID-19 in incarceration facilities, New York City languages, and CEO dismissals.

Domestic terrorism. The Center for Strategic and International Studies has analyzed data on 980 domestic terrorism plots and attacks in the US from 1994 through January 2021, categorizing them as “violent far-right,” “violent far-left,” “religious,” “ethnonationalist,” or “other.” (The project’s methodology provides further details.) The researchers haven’t published the full dataset, but have allowed the Washington Post to publish a large slice of it — a dozen columns for each incident, with dates, locations, type of target, and more. Based on its own research, the Post added eight more columns about the far-right attacks, which it used for a data-driven article on the topic. Previously: The Global Terrorism Database (DIP 2018.02.07).

The web, crawled. The nonprofit Common Crawl is “dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.” Over the past decade, it has gathered and shared petabytes of data from its roughly-monthly web crawls. The most recent, completed in early March, contains 2.7 billion pages. Related: The University of Mannheim’s Web Data Commons generates structured data from the crawls, including nearly 300 million rows of information (on products, events, museums, and more) extracted from websites’ schema.org markup. [h/t Andrea Volpini]

COVID-19 in incarceration, by facility. For more than a year, the New York Times collected data on “coronavirus infections, deaths and testing for state and federal prisons; immigration detention centers; juvenile detention facilities; local, regional and reservation jails; and those in the custody of the U.S. Marshals Service.” On Friday, it published the case and death tolls for 2,000+ of these facilities, as of March 2021. Read more: The Times’ reporting and graphics based on the data. Previously: Weekly COVID-19 numbers for each state prison system (DIP 2020.05.06), collected by the Marshall Project and Associated Press. [h/t Libby Seline]

New York City languages. The Endangered Language Alliance’s Languages of New York City map highlights nearly 700 languages and dialects spoken in NYC and nearby counties. For each language, it indicates a number of significant sites where it is or has been spoken. The project’s downloadable dataset lists each site’s status and neighborhood or city, plus the language’s linguistic family, countries of origin, and estimated number of global speakers. [h/t Ross Perlin]

CEO dismissals. Richard J. Gentry et al. have overseen the collection of data on 1,400+ CEO dismissals and thousands of other CEO departures from S&P 1500 companies between 1992 and 2018. Related: Claudio Fernandez-Araoz et al. have compiled data on CEO and CFO turnover between 2014 and 2018. [h/t Steve Boivie]