Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2017.04.26 edition

National park visitors, word frequencies, women’s empowerment, family growth, and collaborative art.

National park visitors. The U.S. National Park Service publishes a ton of data about visitors to its parks, historic sites, memorials, preserves, and more. Among them: Visitors per park (annually since 1904, and monthly since 1979), overnight stays by type of lodging (tents, RVs, backcountry, etc.), and traffic. Related:The National Parks Have Never Been More Popular” (FiveThirtyEight, 2016). [h/t Jack King]

Word frequencies. You’re probably familiar with the Google Books Ngram Viewer, which lets you chart word and phrase frequencies over time. Google publishes the underlying data but those files can (depending on your tools and goals) be cumbersomely large. Here’s an alternative: DIP reader (and former colleague) Chris Wilson has condensed the overall frequencies for 87,000 words — those found in the CMU Pronouncing Dictionary — into a svelte, four-megabyte file. Related: BYU’s advanced interface to the Google Books data. Also related:The Pitfalls of Using Google Ngram to Study Language” (Wired, 2015). And also:Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them” (The Atlantic, 2017).

Women’s empowerment in India. For each of India’s 36 states and Union Territories, the country’s latest National Family Health Survey includes 114 metrics, such as the percentages of “households using iodized salt” and “men who have comprehensive knowledge of HIV/AIDS.” Unfortunately, the government publishes the reports only as PDFs. But the Hindustan Times has extracted the data for the survey’s eight “women’s empowerment and gender based violence” metrics, including the percentages of “ever-married women who have ever experienced spousal violence” and “women having a bank or savings account that they themselves use.” They’ve published that data as a spreadsheet and used it to construct an interactive Women Empowerment Index. [h/t Gurman Bhatia]

Marriage and divorce, pregnancy and infertility in the U.S. The CDC has been running its National Survey of Family Growth since 1973. For the first three decades, it surveyed only women ages 15-44. Starting in 2002, it began also surveying men. The latest survey was conducted in 2013-15, when it collected data from 10,205 residents about sexual activity and contraception, pregnancy and infertility, marriage and divorce, adoption, parenting, and more. [h/t Allen B. Downey]

This must be the r/place. For April Fools, Reddit launched a million-pixel canvas called “r/place.” Users could place a single-pixel tile, in one of 16 colors, anywhere on the canvas — but only every five minutes. By the end of r/place’s 72-hour lifetime, Redditors had placed 16.5 million tiles on the canvas, likely making it “the largest collaborative art project in history.” Last week, Reddit published the entire history of the canvas as structured data. [h/t Felipe Hoffa]