Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2022.01.19 edition

Political emails, foreign commerce interventions, commuting zones, tech support scams, and Wordle words.

More political emails. Derek Willis, a journalism lecturer with an expertise in political data, has published a searchable, downloadable database of 100,000+ email messages received in recent years by an address he created for this purpose, and “which I routinely plug into forms I find on candidate and committee sites.” The database, which Willis plans to update weekly, lists each message’s timestamp, sender, subject line, and body. Previously: The Princeton Corpus of Political Emails (DIP 2021.06.23), the Markup’s collection of 5,000+ campaign emails (DIP 2020.03.04, and DCInbox’s congressional e-newsletter collection (DIP 2021.03.03). [h/t jcberk]

Foreign commerce interventions. The Global Trade Alert, launched in 2009, “provides timely information on state interventions taken since November 2008 that are likely to affect foreign commerce,” such as new subsidies, export quotas, import tariffs, or anti-dumping laws. Its downloadable dataset describes 33,000+ interventions, listing their types, implementing jurisdictions, affected jurisdictions, affected products, and more. The project, affiliated with the University of St. Gallen, has also started tracking interventions that affect digital commerce. [h/t Simon Evenett and Johannes Fritz]

Commuting zones. Decades ago, the USDA Economic Research Service developed a methodology to group the nation’s counties into hundreds of “commuting zones,” based on the Census’s journey to work data. Those groupings are available from the agency (for 1980, 1990, and 2000), and from researchers at Penn State (for those years plus 2010). More recently, Facebook/Meta has developed its own methodology for estimating commuting zones, using location data collected from its users. The project’s public dataset spans the world and specifies the zones not as sets of counties but as detailed, custom boundaries.

Tech support scams. From 2018 to 2021, the now-shuttered PopupDB Project collected information about tech support scams and their deceptive browser popups. Its maintainers have since released two final downloads: a “light” dataset that lists the URLs and web hosts of 11,000+ such popups, and a “full” database that includes screenshots and source code. [h/t NeeP]

Wordle words. You might have heard of Wordle. The game’s 2,315 possible answers and 12,972 permitted guesses are not an enormous secret, being embedded in its viewable source code. The Riddler’s Zach Wissner-Gross, for instance, has extracted those word lists into two spreadsheets.