Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2023.07.26 edition

Natural resources revenue, gender and diplomacy, nearly 8 million PDFs, Iberian orcas, and Taskmaster tasks.

Natural resources revenue. To extract minerals, coal, oil, or gas from US federal land, Native American land, or the Outer Continental Shelf, companies must pay various royalties, rents, and bonuses to the Office of Natural Resources Revenue, which then distributes those payments — amounting to billions of dollars per year — to the federal government, local governments, tribes, and individuals. Through its data portal, the agency provides annual and monthly breakdowns of revenue, disbursements, and production, which you can download, query, and visually explore. Related: The Bureau of Land Management also provides “data that include the numbers of BLM-administered oil and gas leases, applications for permit to drill, and oil and gas wells” on federal land. As seen in: “How much oil and gas comes from federal territory?” (USAFacts).

Gender and diplomacy. Birgitta Niklasson and Ann Towns’s GenDip dataset “maps the extent to which states appoint men, women and other diplomats to different kinds of bilateral ambassador postings.” The data cover 200+ countries and 10 specific years between 1968 and 2021. For each diplomat and year, the dataset indicates their sending country, receiving country, type of diplomatic title (e.g., ambassador, minister, etc.), and gender, which is “based on titles (e.g. Mr/Mrs, prince/princess, baron/baroness, etc.), pronouns used when referring to the diplomat, or the recognition of names as either female or male.” The team has also produced a series of visualizations based on the data.

Millions of PDFs. As part of its SafeDocs project, DARPA has compiled a corpus of “nearly 8 million PDFs gathered from across the web in July/August of 2021.” To create it, the authors began with the URLs of PDF files identified by Common Crawl (DIP 2021.04.21), fetched their complete contents, and recorded metadata about each file and where it was found. “At the time of its creation, this is the largest single corpus of real-world (extant) PDFs that is publicly available,” they write.

Iberian orcas. The website orcas.pt publishes monthly, downloadable maps indicating the date, time, and location of orca sightings and attacks off the coasts of Portugal and Spain. Run by Rui Alves as a personal project, the project gathers its data through a network of local sailors. Related: The Cruising Association, in collaboration with Grupo de Trabajo Orca Atlántica, publishes maps and detailed reports of orca interactions, including “uneventful passages.” [h/t Soph Warnes]

Comedians, challenged. At TaskMaster.Info, Karl Craven is “obsessively documenting the international Taskmaster franchise,” which began as a British game show on which comedians compete to win challenges such as watermelon speed-eating and high-fiving strangers. Reddit user Alohamori has used the site and other sources to create a “ridiculously comprehensivedatabase of that information, enabling queries such as the fastest-completed tasks, tasks awarding zero points, and episodes ending in ties. Bonus link: Taskmaster’s official YouTube channel.