The border fence. There’s about 700 miles of official fencing between the U.S. and Mexico, covering about one-third of the full border. The Department of Homeland Security doesn’t provide structured spatial data about the fence’s path. But, thanks to a Texas law professor’s FOIA and some serious elbow grease, reporters at Reveal have created “the most detailed border fence map publicly available.” For each segment of fence, Reveal’s dataset includes the fence type (i.e., pedestrian, vehicle, or unknown), the government’s name for the segment, and the project through which the segment was built.
Insurance premiums and payouts. Last month, ProPublica and Consumer Reports published an analysis of car insurance costs in four states, finding that “some major insurers charge minority neighborhoods as much as 30 percent more than other areas with similar accident costs.” The reporters also published a detailed methodology and dataset supporting their findings. The dataset contains company-by-company insurance premiums for a (hypothetical) college-educated, excellent-credit, accident-free 30-year-old woman in each of 6,261 ZIP codes in the four states — California, Texas, Missouri, and Illinois. The dataset also includes several years of average (per-car) insurance payouts for each ZIP code, which the reporters obtained from state insurance commissioners. Related: The insurance industry’s rebuttal and ProPublica’s counter-rebuttal.
Three million grocery orders. Groceries-on-demand startup Instacart has released a dataset containing 3 million orders from 200,000 (anonymized) users. “For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order,” the company’s head of data science writes. “We also provide the week and hour of day the order was placed, and a relative measure of time between orders.” Here’s the data dictionary.
What do you do with a PhD in science? The National Science Foundation’s Survey of Doctorate Recipients “is a longitudinal biennial survey conducted since 1973 that provides demographic and career history information about individuals with a research doctoral degree in a science, engineering, or health (SEH) field from a U.S. academic institution.” You can download aggregated data and detailed survey responses going back to 1993. The next release is scheduled for this month. Related: The NSF has published an interactive graphic of the data. [h/t Peter Aldhous]
*Such* an important dataset. Grad students in Princeton’s computer science department have published a dataset they call Self-Annotated Reddit Corpus, or “SARC” for short. “The corpus has 1.3 million sarcastic statements — 10 times more than any previous dataset,” the authors write, and takes advantage of Reddit users’ habit of tagging sarcastic comments with an “/s”. Related: A dataset of sarcastic Amazon reviews. [h/t Carlos Somohano + Reddit user cavedave]