Indian court cases. The Development Data Lab has gathered data on 81.2 million court cases in India’s lower judiciary between 2010 and 2018, drawn from the country’s e-Courts platform. It’s “the largest open-access dataset on judicial proceedings in the world,” says project coauthor Aditi Bhowmick. The public dataset contains each case’s state, district, court, case type, filing and decision dates, defendant and petitioner genders, legal codes, and more. It “has been fully anonymized to prevent the identification of individual judges or litigants,” but researchers can apply for more extensive access. [h/t Shruti Rajagopalan]
Machine learning papers, code, and datasets. Papers with Code cross-references machine learning papers with their datasets, code, and results. The project has connected 56,000+ of those papers to specific code repositories, and assembled a meta-dataset of 3,000+ relevant datasets. The records can be downloaded in bulk or fetched via API. Related: A recent study examining 133 facial-recognition datasets created since 1976; further coverage in MIT Technology Review. Also related: Exposing.ai, which lets you “check if your Flickr photos were used to build face recognition.” [h/t Karsten Johansson]
Foreign ministers. Hanna Bäck et al. have built a dataset of 1,000+ foreign ministers — all officials holding such a post between 1789 and the mid-2010s in “the world’s 13 former and current great powers.” It describes their stints in office and biographical details, such as their marital status, occupational experience, education level, military service, and much more. [h/t Alex Quiroz Flores]
Spanish pardons. To build El Indultómetro/The Pardonometer, the Civio Foundation has “collected, scraped and classified all the information contained in the [Official State Gazette] on pardons granted in Spain since 1996.” You can browse the 10,000+ pardons and commutations online or download the dataset, which describes the initial charges, form of relief, relevant dates, and more. Related: Civio’s (Spanish-language) methodology and reporting on the topic. [h/t Olaya Argüeso Pérez]
O’Reilly animals. More than 1,000 illustrated animals have graced the covers of O’Reilly Media’s technical books. The publisher hosts an online “menagerie” where you can browse the pairings; it doesn’t provide downloads, but brian d foy, author of several O’Reilly books, has written a Perl tutorial on how to scrape it, and has shared the results.