Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2024.02.21 edition

Self-driving stats, US wetlands, SF overdose deaths, Spain’s political primaries, and Scott’s SMS spam.

Self-driving stats. Companies wanting to test or deploy self-driving cars on California’s public roads must receive a permit from the state DMV’s Autonomous Vehicles Program. As part of the requirements, these companies must “submit annual reports to share how often their vehicles disengaged from autonomous mode during tests (whether because of technology failure or situations requiring the test driver/operator to take manual control of the vehicle to operate safely).” These disengagement reports indicate each vehicle’s company, permit number, VIN, monthly miles driven, and annual disengagements — including vehicles with none. And for each disengagement, the reports list the vehicle, date, disengagement initiator (vehicle, test driver, remote operator, or passenger), type of location, and a brief summary. [h/t Chartr]

US wetlands. The National Wetlands Inventory, maintained by the US Fish and Wildlife Service, provides interactive maps and bulk data containing “geospatially referenced information on the status, extent, characteristics and functions of wetland, riparian, deepwater, and related aquatic habitats.” With contributions from 160+ organizations, coordinated through a dedicated national standard, the inventory represents “more than 35 million wetland and deepwater features,” and continues to receive updates. As seen in: DIP reader Simon Greenhill and coauthors’ paper, dataset, explainer video, and map using machine learning to predict wetland and river protections under the Clean Water Act.

SF overdose deaths. Last year, 813 people died from accidental drug overdoses in San Francisco, according to the latest figures from the city’s Office of the Chief Medical Examiner. That’s the highest annual total on record, per the San Francisco Chronicle’s overdose data tracker, which combines data from the medical examiner’s reports with information from other city agencies, such as overdose reversals by EMS responders administering naloxone and calls handled by the Street Overdose Response Team. As seen in: The Chronicle data team’s latest newsletter.

Spain’s political primaries. Oscar Barberà et al. have compiled a dataset of the outcomes of 361 primary contests in Spain since 1991, “based on information provided by political parties at the time of the event through their websites, press releases, and journalistic reports.” It includes primaries to determine candidates for regional, national, and EU elections, as well as for party leadership. For each contest, it lists the party, territory, year, type of post, number of competitors, turnout, incumbent’s outcome, winner’s percentage, and more.

Scott’s SMS spam. In late September 2022, Scott Lee Chua began preserving every SMS he received, motivated by the recent passage of the Philippine SIM Registration Act. By early February 2024, he’d amassed 3,324 messages, which he has grouped into five categories and charted: spam (13% of all messages), one-time passwords (12%), marketing (10%), government notices (1%), and “messages I both expect and welcome” (63%). He’s also published a partially-redacted dataset of each message’s time received, time read, sender, text, and category.