📬 Reader Erin Wilson has built a range of visualizations and analyses based on the Netflix viewership data featured in DIP 2023.12.20 … and a video walking you through it. (Have you published something with data you found via Data Is Plural? Click here to let me know.)
👋 DIP+ members got a sneak peek of this edition’s datasets on Monday. Support the newsletter and your career by becoming a member.
Institutional investments. If you’re an institutional investor with US operations and managing at least $100 million in publicly traded securities, the Securities and Exchange Commission requires you to file Form 13F each quarter. (The biggest filers — such as Vanguard, BlackRock, and State Street — have trillions of dollars invested.) These filings, available to download going back to mid-2013, detail each investor’s long positions for each security: their number of shares, market value, security type, issuer name, CUSIP code, and more. As seen in: Michigan teenager Anonyo Noor’s wallstreetlocal.com, which aggregates the data, matches it to additional information, and provides a search interface.
Human development, indexed. Perhaps the best-known metric of its kind, the United Nations’ Human Development Index combines statistics on life expectancy, income per capita, and years of schooling into a single number for each country-year. The UN provides downloads and an API for all annual HDI ratings and sub-components for 1990 to 2022. Those resources also feature data from related indices, such as the Inequality-adjusted Human Development Index, Gender Development Index, and Gender Inequality Index. [h/t Michael A. Rice]
Extrajudicial killings in Bangladesh. Between 2009 and 2022, “Bangladesh’s security forces killed at least 2,597 people in apparent extrajudicial executions, custodial torture, and by firing bullets at protesters,” according to Nazmul Ahasan’s analysis for Netra News, building on data “compiled by Bangladeshi human rights defenders and collated by the Australia-based Capital Punishment Justice Project.” Ahasan and colleagues “independently verified more than 98% of the cases in the dataset using press reports and subsequently updated any incomplete data.” The records are available as a table in the article and as a JSON file. Each entry includes the victim’s name (if known), incident date, description, location, agencies involved, purported justification, and news source. As seen in: The 2024 Sigma Awards.
Agri-environmental policies. David Wuepper et al. have constructed a dataset of 6,000+ policies between 1960 and 2022 “at the intersection of agriculture and the environment, implemented not only by national entities but also by subnational and supranational entities, covering different instruments (for example, regulations, frameworks, payment programmes) and topics,” such as the US Safe Drinking Water Act, the Bavarian Forestry Law, and Tanzania’s 2009 Wildlife Conservation Act. Each entry lists the policy’s country, title, type, keywords, year implemented, description, and other details.
Rolling Stone’s album rankings. A new visual essay from The Pudding compares Rolling Stone’s “500 Greatest Albums of All Time” lists from 2003, 2012, and 2020. A methodology note says the project began with a spreadsheet by Chris Eckert and eventually led the authors to develop a dataset of their own. Theirs lists every album in the rankings — its name, genre, release year, 2003/2012/2020 rank, the artist’s name, birth year, gender, and more — plus each year’s voters. [h/t Jason Kottke]
Dataset suggestions? Criticism? Praise? Send your greatest feedback of all time to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions. // Support the newsletter and your career by becoming a DIP+ member.
]]>
📬 Reader Aditya Dahiya has built a showcase of his Data Is Plural–inspired visualizations. “Here, you’ll find open-source code detailing every step, from data acquisition to final rendering,” Aditya writes. (Have you published something with data you found via Data Is Plural? Click here to let me know.)
Human trafficking. The Counter-Trafficking Data Collaborative’s Global Synthetic Dataset uses differential privacy techniques to represent “over 206,000 victims and survivors of trafficking identified across 190 countries and territories from 2002 to 2022.” The approach, developed in partnership with Microsoft Research, converts anonymized case records into “a new dataset in which records do not correspond to actual individuals, but which preserves the structure and statistics (i.e., utility) of the original data.” Each row indicates a (synthetic) individual’s gender, age group, citizenship, country of exploitation, duration of reported trafficking, traffickers’ means of control, types of exploitation, and the year the collaborative’s partners registered the case. Related: The collaborative’s Global Victim-Perpetrator Synthetic Dataset, which takes a similar approach to relationships between victims and perpetrators. [h/t Mariana Moreira + Lorraine Wong]
Real-world vehicle emissions. On Monday, the European Commission published its first report analyzing the real-world CO2 emissions of cars and vans, based on fuel consumption monitoring devices that the EU now requires. The report uses data received from 600,000+ vehicles. That sample is available to download, along with metrics aggregated by manufacturer and fuel type: average fuel consumption, emissions, and comparisons to standardized test results. Related: Data on millions of EU car registrations (and van registrations), including each vehicle’s fuel economy and emissions ratings. Previously: FuelEconomy.gov (DIP 2017.04.12), with data on decades of car models. [h/t Jan Willem Tulp + Xan Gregg]
State legislators. Nicholas Carnes and Eric Hansen’s 2023-4 State Legislators Dataset features “biographical information about state lawmakers who held office in 2023 and 2024 compiled from legislative and campaign websites and other online sources.” The dataset spans all 50 states and includes 7,300+ lawmakers. “The project’s principal aim was to record the current or most recent main occupation (outside of elected office) held by each member,” the authors write, “but the dataset also includes information about a wide range of characteristics including race, gender, and education.” A version for 2021–22 is also available. Previously: State legislator financial disclosures (DIP 2017.12.13) and ideology scores (DIP 2020.01.01). [h/t Derek Willis]
Meta Oversight Board decisions. Meta’s independent Oversight Board reviews a selection of the company’s content-moderation decisions and has the power to overturn them. The board publishes its rulings online, as does Meta itself; neither, however, provides a download link. But Information Is Beautiful has compiled a spreadsheet of the board’s 80+ decisions through early February, supporting a visualization of the cases’ topics and outcomes over time. [h/t Data Science Community Newsletter]
Aviation waypoints. For his recent exploration of the FAA’s aviation maps, Beautiful Public Data’s Jon Keegan has turned the agency’s list of 67,000+ navigation waypoints into a downloadable dataset. “Often these waypoint names will reflect the culture, food or sports teams of the city they are near,” Keegan writes. “Off the coast of New England, there is LBSTA and WHALE. Boston’s sports legacy gave us BOSOX, BRUWN, CELTS, PATSS, FENWY, ORRRR and BORQE. Salem has WITCH, and Plymouth has PLGRM.”
Dataset suggestions? Criticism? Praise? Send five-lettered feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions. // Support the newsletter and your career by becoming a DIP+ member.
]]>
📬 Reader Jack T. has built an interactive map of the data on “educational gag orders” compiled by PEN America, featured in DIP 2022.08.24. (Have you published something with data you found via Data Is Plural? Click here to let me know.)
👋 DIP+ members got a sneak peek of this edition’s datasets yesterday, received dispatches from last week’s NICAR conference, and get ongoing access to a semantic-search-bot for the DIP archives. Support the newsletter and your career by signing up.
Humanitarian emergency mapping. The United Nations Satellite Centre (UNOSAT) provides a range of services to UN agencies and the general public, including downloadable maps, data, and analyses produced “in response to humanitarian emergencies related to disasters, complex emergencies and conflict situations.” Those currently available include assessments of flood impacts in Libya, landslides in the Republic of the Congo, and building damage in the Gaza Strip. The latter identifies structures that satellite imagery suggests have been damaged; the data indicate each building’s location and damage level, plus an assessment confidence and notes. Assessments for prior humanitarian emergencies can be found by adjusting the listing page’s filters, and also via UNOSAT’s contributions to the Humanitarian Data Exchange. [h/t Allison Martell]
Materials and their properties. The Materials Project, led by scientists at Lawrence Berkeley National Lab, “is a multi-institution, multi-national effort to compute the properties of all inorganic materials” with the “ultimate goal” being “to drastically reduce the time needed to invent new materials.” Its online explorer and API currently provide information about 150,000+ materials. You can search by component elements, formula, thermodynamics, structural properties, magnetism, elasticity, and many other characteristics.
EU infringements. The European Commission publishes a searchable and downloadable database of all its decisions regarding national infringements of EU regulations, decisions, and directives. It currently contains 58,000+ decisions in 24,000+ cases, going back to the late 1980s. (To download the full database, conduct a blank search and then click the “Export to Excel” link.) Each entry lists a decision type and date, case identifier, country, policy area, and more. Recent examples include the Commission’s decisions to refer Ireland to court for failing to protect its peat bogs and Italy for noncompliance with a wastewater treatment directive. [h/t Maximilian Haag et al.]
NYC council members. Maximum New York has published a biographical dataset of people elected to the New York City Council. For each member since 1998 (plus some before that), it lists their name, district, borough, political party, date of birth, undergraduate/graduate universities and fields of study, whether they ever served on a community board, prior employer, and more. Related: DataMade’s Chicago Councilmatic lists all members, bills, votes, and meetings, and is also available as structured data. [h/t Vikram Oberoi + Forest Gregg]
Counting fish. The University of Washington’s Columbia Basin Research provides (among other data) daily, species-level counts of adult salmon and trout passing through more than a dozen sites in the Pacific Northwest. CalFish publishes fish counts and population estimates for the Upper Sacramento River Basin, which “contains much of California’s salmon and steelhead populations.” Similar resources include those available from Alaska, Oregon, and the Yakama Nation. [h/t Dan Brady]
Dataset suggestions? Criticism? Praise? Send upstream feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions. // Support the newsletter and your career by becoming a DIP+ member.
]]>
📬 Reader Nick Hillman’s research team at the University of Wisconsin uses data from the federal government’s Daily Treasury Statements (DIP 2023.04.05) to track the estimated volume of student loan payments over time. Although Nick learned about that data via non-DIP sources, the project is exactly the type I like to feature in these notes. (Have you published something with data you found via DIP? Click here to let me know.)
👋 DIP+ members got a sneak peek of this week’s datasets on Monday, and get ongoing access to a semantic-search-bot for the DIP archives. I’ve also posted a roundup of downloadable data published by the Sigma Awards’ shortlisted projects and pointers to upcoming data-related events. Support the newsletter and your career by signing up; memberships are just $7/month or $60/year.
Global military spending. How much money has each country spent, each year, on its military? Different datasets have different answers, cover different timeframes, and use different methodologies. Miriam Barnum et al.’s Global Military Spending Dataset attempts to bring them together. By uniting “76 variables from 9 dataset collection projects,” the authors write, “we provide the most comprehensive and complete set of published datasets on military spending ever assembled.” Each of the variables represents one source/methodology, and each observation is a country-year. “Disagreement on the actual expenditure value for a given country-year is common, even between datasets produced by the same project,” they find. Previously: The Stockholm International Peace Research Institute’s Military Expenditure Database (DIP 2017.03.29), one of the sources.
Fatal police pursuits. Reporters at the San Francisco Chronicle have compiled a national dataset of 3,300+ deaths in police car chases in 2017–2022. To build it, they used information from the federal government’s Fatality Analysis Reporting System (DIP 2016.08.31), research organizations, news reports, lawsuits, and public records requests. For each death, the dataset indicates the person’s name, age, gender, race, and connection to the pursuit (driver, passenger, bystander, officer). It also includes the incident’s date, location, reason given for the pursuit, and main law enforcement agency involved. Read more: “Fast and Fatal,” the Chronicle’s investigation based on the dataset. [h/t Susie Neilson]
Price-fixing cartels. Industrial economist John M. Connor has constructed the Private International Cartels dataset, “which the author believes to be the largest collection of legal-economic information on contemporary price-fixing cartels.” It spans three decades (1990–2019) and covers 1,500+ suspected or convicted cartels, including 1,100+ that “have been deemed guilty of price fixing by one or more antitrust authority.” It also links those cartels to tens of thousands of companies and to 2,000+ individuals indicted or punished for their involvement. The dataset’s variables include information about cartel geography, industry, market share, overcharges, penalties, and much more.
Real-time airport disruptions. The Federal Aviation Administration’s National Airspace System Status dashboard provides real-time listings of delays and closures at US airports. For each disruption, it indicates the type of problem, reason, current average delay times, and more. A minimal API linked from the site provides the information as an XML-formatted file. Read more: Ruihai Youngblood describes his experience helping to redesign the dashboard. [h/t Jason Scott]
Pinball machines. The Open Pinball Database provides a searchable inventory and API of ~2,000 pinball machines and 120+ manufacturers. Details include each machine’s name, manufacture date, mechanism type, display type, player count, and more. Related: Pinball Map’s crowdsourced global map and API of the locations of installed pinball machines. [h/t Jeremy Herrman + technophiliac]
Dataset suggestions? Criticism? Praise? Send high scores and feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions. // Support the newsletter and your career by becoming a DIP+ member.
]]>
Data Is Plural sends its heartfelt condolences to the friends and family of Fazil Khan, a young and talented data journalist whose work was recently featured here. Fazil died in New York City on Friday, when his apartment building caught fire. The Columbia Journalism School is hosting an online and in-person memorial service today at 6pm EST.
Climate funding. To develop its “Global Landscape of Climate Finance 2023” report, the Climate Policy Initiative gathered data on grants, loans, equity and other types of “primary financing in real economy sectors that reduce greenhouse gas emissions and build climate resilience.” The report, published in November, identified ~$1.3 trillion in such financing globally in 2021–2022. A spreadsheet provided alongside it indicates the estimated amount by year, region, sector, focus on mitigation vs. adaptation, type of financial instrument, funder sector (public vs. private), and funder type (development bank, corporation, institutional investors, etc.). Reports from earlier years include downloadable data for 2019–2020 and 2017–2018. Previously: National climate funds (DIP 2022.02.09) and climate finance projects (DIP 2023.06.14).
Federal laws since 1789. Political scientist and “recovering lawyer” Brian Libgober has compiled a “comprehensive dataset of U.S. federal laws,” covering 49,000+ legislative enactments from 1789 to 2022. Such a dataset has been elusive, Libgober notes, in part “because such laws have been enacted over hundreds of years, resulting in a complicated patchwork of documents published in numerous and inconsistent formats.” His solution combines three key sources: “the oldest meta data for the U.S. Statutes at Large disseminated via HeinOnline, similar and more recent meta data through the Governmental Printing Office, and finally the last six years of law-making as described by the National Archives’ website.” The dataset lists each law’s title, legislative session, date of passage, source/citation identifiers, and more.
TSA complaint counts. The Transportation Security Administration publishes semi-regular reports on the complaints it receives, aggregated by month, airport, category, and subcategory. Unfortunately, the agency only publishes those reports as PDFs, rather than structured data files. So volunteers and I at the Data Liberation Project have built a data pipeline to convert those PDFs into tidy CSV files, currently covering complaints to TSA at 440+ airports (and additional complaints not specifying any airport) from January 2015 to January 2024. Read more: The Data Liberation Project’s latest newsletter dispatch.
High-school financial education. In a recent paper, economists Allison Oldham Luedtke and Carly Urban introduce a dataset of 19,000+ high-school classes that teach financial literacy, manually collected from thousands of online course catalogs. Each row provides details about the school (e.g., name, location, enrollment) and course, including its title, description, duration, requirement status, and whether financial literacy was the main focus or smaller component. An auxiliary dataset indicates, annually for 1970–2024, which states required such coursework for high school graduation.
Trash interceptors. Mr. Trash Wheel is one of four “semi-autonomous trash interceptors” pulling garbage out of the Baltimore Harbor. The partnership behind the effort publishes spreadsheets of each contraption’s collection history. For each dumpster filled since May 2014, they list the date, weight and volume of trash, and estimated number of plastic bottles, cigarette butts, and other types of items extracted. [h/t Cody Winchester]
Dataset suggestions? Criticism? Praise? Send semi-autonomous feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions. // Support the newsletter and your career by becoming a DIP+ member.
]]>
📬 Reader Aleksandra Chirkina has explored the relationship between self-claimed expertise and coffee preferences in the data from James Hoffman’s Great American Coffee Taste Test (featured in DIP 2023.11.15). Have you published something with data you found via DIP? Click here to let me know.
👋 DIP+ members got a sneak peek of this week’s datasets on Monday, links to additional datasets too niche for the newsletter, and a mini-analysis of Hobbes’ appearances in the Visual Language Research Corpus. We also chatted about data-intensive programs that run entirely browser-side and changes to Census surveys’ questions about gender. Come join us! Memberships are just $7/month or $60/year.
Self-driving stats. Companies wanting to test or deploy self-driving cars on California’s public roads must receive a permit from the state DMV’s Autonomous Vehicles Program. As part of the requirements, these companies must “submit annual reports to share how often their vehicles disengaged from autonomous mode during tests (whether because of technology failure or situations requiring the test driver/operator to take manual control of the vehicle to operate safely).” These disengagement reports indicate each vehicle’s company, permit number, VIN, monthly miles driven, and annual disengagements — including vehicles with none. And for each disengagement, the reports list the vehicle, date, disengagement initiator (vehicle, test driver, remote operator, or passenger), type of location, and a brief summary. [h/t Chartr]
US wetlands. The National Wetlands Inventory, maintained by the US Fish and Wildlife Service, provides interactive maps and bulk data containing “geospatially referenced information on the status, extent, characteristics and functions of wetland, riparian, deepwater, and related aquatic habitats.” With contributions from 160+ organizations, coordinated through a dedicated national standard, the inventory represents “more than 35 million wetland and deepwater features,” and continues to receive updates. As seen in: DIP reader Simon Greenhill and coauthors’ paper, dataset, explainer video, and map using machine learning to predict wetland and river protections under the Clean Water Act.
SF overdose deaths. Last year, 813 people died from accidental drug overdoses in San Francisco, according to the latest figures from the city’s Office of the Chief Medical Examiner. That’s the highest annual total on record, per the San Francisco Chronicle’s overdose data tracker, which combines data from the medical examiner’s reports with information from other city agencies, such as overdose reversals by EMS responders administering naloxone and calls handled by the Street Overdose Response Team. As seen in: The Chronicle data team’s latest newsletter.
Spain’s political primaries. Oscar Barberà et al. have compiled a dataset of the outcomes of 361 primary contests in Spain since 1991, “based on information provided by political parties at the time of the event through their websites, press releases, and journalistic reports.” It includes primaries to determine candidates for regional, national, and EU elections, as well as for party leadership. For each contest, it lists the party, territory, year, type of post, number of competitors, turnout, incumbent’s outcome, winner’s percentage, and more.
Scott’s SMS spam. In late September 2022, Scott Lee Chua began preserving every SMS he received, motivated by the recent passage of the Philippine SIM Registration Act. By early February 2024, he’d amassed 3,324 messages, which he has grouped into five categories and charted: spam (13% of all messages), one-time passwords (12%), marketing (10%), government notices (1%), and “messages I both expect and welcome” (63%). He’s also published a partially-redacted dataset of each message’s time received, time read, sender, text, and category.
Dataset suggestions? Criticism? Praise? Send 13%-spam feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.
]]>
👋 Over its eight years and 360+ editions, Data Is Plural’s newsletter and archives have been freely accessible — just like the datasets DIP features. I have no plans to change that. Occasionally, however, I’ve wondered what a “premium” membership might look like. What could DIP offer to subscribers who wanted to dive deeper, to engage beyond the tidy boundaries of a newsletter? To that end: Introducing DIP+, Data Is Plural’s new membership program. For $60/year you’ll get access to a Discord-hosted community, featuring bonus material, lively discussions with fellow data-spelunkers, job boards, and more. There’s a 14-day free trial, so you can test it out first. Click here for details and to register. And feel free reply to this email with any questions or feedback.
State trust lands. “State trust lands just might be one of the best-kept public secrets in America,” according to Grist, which published a data-driven investigation on the topic last week. These lands — expropriated from Indigenous nations and managed by state governments for profit — “exist in 21 Western and Midwestern states, totaling more than 500 million surface and subsurface acres.” Grist’s inquiry focuses on those benefiting land-grant universities, the subject of a related High Country News investigation (DIP 2020.06.24). Using state property records and Forest Service data digitized from historical cession maps, Grist identified “more than 8.2 million acres of state trust parcels taken from 123 tribes, bands, and communities” that fund 14 such institutions. Their public data describe these 41,000+ pieces of land: each parcel’s size and location, rights type, land use, benefiting university, associated tribes, and cession details. [h/t Rachel Glickhouse]
Household composition. Juan Galeano et al.’s CORESIDENCE database provides 146 indicators of household arrangements in 156 countries and ~4,000 regions, spanning the years 1964 to 2021. The indicators range in specificity, from average household size to, e.g., the average number of non-relatives in 3-person households. They also include gender breakdowns, such as the proportion of 5-person households among female-headed households. The metrics are calculated using “global-scale individual microdata from four main repositories and national household surveys, encompassing over 150 million individual records representing more than 98% of the world’s population.”
Comics, deconstructed. The Tilburg University–based Visual Language Lab, led by comics scholar Neil Cohn, studies “all aspects of visual language, from the structure of individual drawings, emoji, or cartoons, to how we make meaning out of sequences of images like in comics.” The lab’s Visual Language Research Corpus provides detailed annotations of tens of thousands of panels in 300+ comic books and graphic novels (plus every Calvin & Hobbes strip). The dataset’s sources include material from multiple continents, time periods, and genres. The annotations examine “attentional framing structure and filmic shot scale, the situational changes across panels, page layouts, multimodality, visual morphology, and path structure,” among other characteristics. [h/t Cameron Yick]
Nursing home prices. Between December 2020 and March 2022, SeniorLiving.org researchers “attempted to contact 7,221 US senior facilities by telephone to obtain availability and service pricing.” The team got at least some pricing data for 3,000+ of the facilities. The results, available from SeniorLiving’s data portal, provide information about each provider, the team’s call attempts, and the average monthly price for five types of housing and care: skilled nursing with a private room, skilled nursing with a shared room, assisted living, independent living, and care tailored for people with Alzheimer’s disease and dementia. Related: The Centers for Medicare & Medicaid Services’ Skilled Nursing Facility Cost Report datasets, which contain annual metrics on finances, staffing, and care provided. [h/t Corie Wagner]
Twisty roads. Curvature uses OpenStreetMap data to map the world’s curviest roads. Built by motorcyclist and software developer Adam Franco, the open-source project “works by looking at the geometry of every road segment and adding up how much length of the road is sharp corners, broad sweeping curves, and straight areas.” Franco also provides a more detailed explanation, as well as data files scoring each road and curve segment. [h/t Giuseppe Sollazzo]
Dataset suggestions? Criticism? Praise? Send zigzagging feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.
]]>
Local election results. Justin de Benedictis-Kessner et al. have compiled a dataset of 77,000+ (distinct) candidates across 57,000+ US elections for mayor, city council, school board, county executive, county legislature, sheriff, and prosecutor. It is, they write, “the most comprehensive publicly-available source of information on local elections across the entire country.” It includes “most medium and large cities and counties” and spans 1989 to 2021. The authors combined data from pre-existing databases, state election websites, and newspaper archives. They also “worked with a team of research assistants who coded results from thousands of local elections based on city and county websites.” For each candidacy, the core table indicates the election’s jurisdiction, office, and timing, plus the candidate’s name, incumbency status, party, estimated demographics, votes won, and more.
Mid-century anti-Black killings. The Burnham-Nobles Archive at the Northeastern University School of Law is “dedicated to identifying, classifying, and providing factual information and documentation about anti-Black killings in the mid-century South.” The current version focuses on 11 southern states during 1930–1954. Developed by political scientist Melissa Nobles and law professor Margaret Burnham, the long-running project has gathered 12,000+ news articles, death certificates, federal agency records, and other sources. It documents 900+ incidents and 950+ victims, as well as alleged perpetrators and judicial outcomes. The archive provides an interactive map, record search (including by person, incident, and document), and downloads. Previously: Beck/Tolnay and Seguin/Rigby’s data on lynching victims (DIP 2021.06.02). [h/t Jasmine Mithani]
US travel behavior. Fielded every five to eight years since 1968, the Federal Highway Administration’s National Household Travel Survey “is the authoritative source on the travel behavior of the American public.” The questionnaire asks respondents to inventory all of their household’s trips taken during a 24-hour period. The most recent survey, for 2022, includes ~31,000 trips by ~17,000 people in ~8,000 households. It indicates each trip’s duration, vehicle details, motivation, parking costs, traveler demographics, and much more. The FHA provides downloads of the anonymized data, as well as user guides and technical notes. As seen in: “The school bus is disappearing. Welcome to the era of the school pickup line,” by the Washington Post’s Andrew Van Dam.
Space X-rays. eROSITA is a wide-field X-ray telescope “capable of delivering deep, sharp images over very large areas of the sky,” helping researchers “to study the large-scale structure of the universe.” The telescope launched into orbit in 2019 as a collaboration between Russia and Germany, with data rights split between the countries. In February 2022, following Russia’s invasion of Ukraine, Germany’s Max Planck Institute suspended eROSITA’s operations. Data processing continued, however, and last week the institute published its first data release. It contains the first six months of results for the hemisphere assigned to Germany, and is “the largest X-ray catalogue ever published.”
Amateur archaeological finds. The British Museum’s Portable Antiquities Scheme “records archaeological finds discovered by the public,” assisted by a network of national and local partners. Its database contains 1.1 million records describing 1.7 million objects. “All have been found by everyday people by chance, most through metal detecting.” The most common finds: coins (~500,000 records), buckles (~58,000), and brooches (~52,000). Search results are available in JSON, XML, and other structured formats. [h/t Maev Kennedy + Walt Hickey]
Dataset suggestions? Criticism? Praise? Send antiquarian feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.
]]>
📬 Reader Alexandra F has mapped and charted the locations and characteristics of “bog bodies,” using data from Roy van Beek et al. (featured in DIP 2023.02.15) and Overture Maps (featured in DIP 2023.08.09). Have you published something with data you found via DIP? Click here to let me know.
Groundwater levels. To assess global trends in groundwater levels, Scott Jasechko et al. have analyzed data from ~170,000 monitoring wells in 40+ countries. The researchers’ public datasets include annual groundwater levels “in all cases for which we have received permission from a database manager to post data,” amounting to more than 4 million measurements (59% of the total). The published records also include the boundaries of 1,600+ aquifer systems, “manually delineated […] using maps and descriptions from 1,236 local and regional studies.” The study’s supplementary documentation describes each data source and method of access. Read more: Coverage in the New York Times, with detailed maps of Iran and Spain, and in Wired. [h/t Patrick Tanguay]
Military surplus. Through its Excess Defense Articles program, the US military offers free and reduced-price equipment to foreign governments. The Department of Defense publishes a spreadsheet of the program’s authorizations and transfers, last updated in mid–2020 and going back to 2010. Each of the 4,100+ entries lists the foreign country, item description, transfer status, status date, whether it was a sale or grant, quantities (requested, allocated, accepted, rejected, delivered), “current” value, and acquisition value. Items range in size and significance, from vinyl tape (55 rolls authorized for transfer to Iraq in 2016) to Abrams tanks (including 178 provided to Morocco in 2018). [h/t David Vine]
Study abroad. Open Doors, an Institute of International Education initiative funded by the State Department, “is the only long-standing, comprehensive information resource on international students and scholars in the United States and on U.S. students studying abroad for academic credit.” Its annual reports provide aggregate statistics by country, field of study, and/or institution. For example: NYU hosted more international students (~25,000) than any other US university in the 2022–23 school year, followed by Northeastern (~21,000) and Columbia (~19,000); Italy, the UK, and Spain were the most popular destinations for US students in 2021–22. Related: The State Department’s statistics on non-immigrant visas, including student visas. Previously: Data on Europe’s Erasmus exchanges (DIP 2022.02.09). [h/t Kate Miller]
Dot-gov metadata. Father-son
duo Luke and Elias Fretwell are grading government
websites’ metadata. Starting with a dataset of 1,300+ federal
domains maintained by the Cybersecurity and Infrastructure Security
Agency, the Fretwells have tested whether each homepage’s source
code includes a <title>
tag, certain Open Graph markers, and other
key HTML metadata, which
“can have a significant impact on how citizens experience government
digital services.” The results, presented online in the form of report
cards, can also be downloaded.
Taylor’s colors. Last year, Reddit user swiftdata1989 published a spreadsheet and visualization of the colors mentioned in Taylor Swift’s albums. Each spreadsheet row lists the album, song, specific color (e.g., “scarlet”), closest generic color (e.g., red), number of times mentioned, and clarifying notes. Previously: Colors from the World Color Survey (DIP 2017.08.23), Werner’s Nomenclature of Colours (DIP 2021.11.17), and Bob Ross paintings (DIP 2020.12.09).
Dataset suggestions? Criticism? Praise? Send all hues of feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.
]]>
📬 Reader Aditya Dahiya has used data on movie theaters in eight Indian cities (featured in DIP 2017.11.08) and Netflix viewership data (featured in DIP 2023.12.20) to demonstrate custom color scales and legends with ggplot2. (Have you published something with data you found via DIP? Click here to let me know.)
Police data metadata. The Police Data Accessibility Project is compiling a meta-dataset of police records: where to find them online, what time period they cover, how often they’re updated, and other characteristics. The searchable, downloadable dataset includes links to 1,700+ resources, such as traffic stop datasets, crime maps, use-of-force reports, contract and policy listings, and many other types of records, across hundreds of agencies. The team also maintains a dataset of 23,000+ criminal legal agencies. Related: The Vera Institute of Justice’s Police Data Transparency Index, which scored 90+ local police agencies across 10 categories of data transparency; its methodology page links to a more detailed, downloadable spreadsheet.
Historical fishing intensity. Yannick Rousseau et al. have generated a series of datasets estimating annual fishing effort from 1950 to 2017 by country, year, gear type, vessel length, sector (industrial, artisanal motorized, and artisanal unmotorized), and category of species targeted. The datasets provide “information on number of vessels, engine power, gross tonnage, and nominal effort,” a metric that multiplies the engine power by the number of days at sea. Their sources include “a range of publicly available sources, governmental reports, and grey literature”. Related: Co-author Reg A. Watson’s Global Fisheries Landings dataset, which estimates “commercial, small-scale, illegal and unreported fisheries catch,” also since 1950. Previously: Global Fishing Watch’s fishing effort datasets, based on vessel tracking signals (DIP 2021.01.13).
Medicaid offices, geocoded. Paul R. Shafer et al. have created a dataset of 3,000+ Medicaid offices in the US, identified via state and county government websites. The team of Boston University researchers, who focused on “public-facing Medicaid offices providing enrollment support,” have provided each office’s agency name, state, city, and address, and latitude/longitude coordinates (primarily sourced via the US Census Bureau’s geocoder).
Seine water quality. Ahead of the 2024 Summer Olympics, Paris has been trying to decontaminate the Seine river to swimmable levels. But the city’s efforts appear to be falling short, according to government water testing data obtained, published, and analyzed by Mathieu Lehot-Couette, a reporter at Franceinfo.fr. The records include results from periodic samples taken at 14 points along the river, which Lehot-Couette has standardized into a spreadsheet of 1,400+ measurements of E. coli and enterococci between 2015 and 2023.
Human heights. Economic historians Jörg Baten and Matthias Blum have assembled a dataset on average male heights by decade and country. The estimates, derived from hundreds of scholarly and statistical sources, stretch back several centuries and span 140+ countries. A related resource page also provides individual-level data compiled by Baten and others, such as the heights of 1,000+ 19th-century Bavarian military conscripts. [h/t Karsten Johansson]
Dataset suggestions? Criticism? Praise? Send head-to-toe feedback to jsvine@gmail.com, or just reply to this email. // Looking for past datasets? This spreadsheet contains them all. // Visit data-is-plural.com to subscribe and to browse past editions.
]]>