Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2023.09.27 edition

Linked historical censuses, fracking fluids, Europopulism, GitHub metrics, and ancient places.

Censuses, linked. The Census Tree, developed by Kasey Buckles et al., “is the largest-ever database of record links among the historical U.S. censuses, with over 700 million links for people living in the United States between 1850 and 1940.” By the team’s estimates, the dataset “includes over 70% of the possible links that could be made for men, and over 60% of possible links for women.” To build it, the researchers began with genealogy records from FamilySearch.org, developed a machine learning algorithm to identify additional links, and incorporated data from other census-linking projects. Each Census Tree row connects an individual’s IPUMS identifier in one decade’s census to their identifier in another. Per the project’s instructions, you’ll need to download the Census data itself from IPUMS.

Fracking fluids. Since launching in 2011, FracFocus has become the largest registry of hydraulic fracturing chemical disclosures in the US. The database, available to explore online and download in bulk, contains 210,000+ such disclosures from fracking operators; it details the location, timing, and water volume of each fracking job, plus the names and amounts of chemicals used. The project is managed by the Ground Water Protection Council, “a nonprofit 501(c)6 organization whose members consist of state ground water regulatory agencies”. As seen in: The latest installment of the New York Times’ Uncharted Water series.

Europopulism. The PopuList, constructed by Matthijs Rooduijn et al., “offers academics and journalists an overview of populist, far-left and far-right parties in Europe from 1989 until 2022.” Version 3.0 of the dataset, released last month, lists each party’s country, local/English names, presence in parliament, and identifiers in the Party Facts (DIP 2019.01.16) and ParlGov (DIP 2018.09.19) databases. It also indicates whether the project’s comparativists and country experts classified the party (outright or “borderline”) as populist, far-right, far-left, and/or euroskeptic, and for which time periods.

GitHub metrics. GitHub’s new Innovation Graph datasets present a range of quarterly metrics on the code-sharing site, aggregated by “economy” — a concept similar to “country” but slightly broader. (Antarctica is an “economy” in the data, for example.) The datasets count the number of developers based in each economy, their repositories and code pushes, most-used programming languages, and more. As noted in the project’s datasheet, the locations are based on IP addresses, so VPN usage may distort the results. Previously: More-granular GitHub activity via the GH Archive (DIP 2018.02.21). [h/t Kevin Xu]

Ancient places. Pleiades, “a community-built gazetteer and graph of ancient places,” has collected data on 40,000+ settlements, roads, rivers, monuments, and many other types of landmarks. It also describes the relationships between them — linking, for instance, the Parthenon to the Acropolis and the Acropolis to Athens. Related: ORBIS: The Stanford Geospatial Network Model of the Roman World. Previously: Roman amphitheaters (DIP 2022.06.08) and the Digital Atlas of Roman and Medieval Civilizations (DIP 2020.06.24). [h/t Avi Levin]