Data Is Plural

... is a weekly newsletter of useful/curious datasets.

2017.12.06 edition

Workplace sexual harassment complaints, financial consumer complaints, students’ lives, Brazilian news outlets, and one family’s spending.

Two decades of workplace sexual harassment complaints. My colleague Lam Thuy Vo obtained an anonymized dataset listing all 170,000+ sexual harassment claims submitted to the U.S. Equal Employment Opportunity Commission between October 1995 and September 2016. For each claim, the dataset indicates the date the complaint was filed, the complainant’s gender, and the general category of employer. Additional fields — available for most claims, but not all — indicate the complainant’s birthdate, race, and national origin, as well as the employer’s industry and approximate number of workers. Related: Lam’s story and interactive graphics, which place the data in context.

Financial consumer complaints. The Consumer Financial Protection Bureau’s consumer complaint database can be searched online, accessed via an API, and downloaded in bulk. The 915,000+ complaints the Bureau has received have been categorized into 18 financial product groups (e.g., mortgages, debt collection, student loans, cryptocurrency) and more than 160 kinds of issues (e.g., billing disputes, communication tactics, privacy). The agency says they “don’t verify all the facts alleged in these complaints,” but that they “take steps to confirm a commercial relationship between the consumer and the company.” [h/t Dan Brady]

The StudentLife Study. Back in 2013, four dozen Dartmouth College students agreed to let a custom smartphone app surveil them for the StudentLife Study. During the 10 weeks of the spring academic term, the app collected data on the students’ physical activity, GPS coordinates, eating schedule, sleep habits, phone usage, and more. The study combined all that information with a slew of other data, including the students’ class deadlines, academic performance, and their responses to surveys about stress, depression, personality, and sleep quality. The study’s public (and anonymized) dataset clocks in at 53 gigabytes. Related:Towards Deep Learning Models for Psychological State Prediction using Smartphone Data: Challenges and Opportunities,” a recently-released academic paper that uses the StudentLife dataset. [h/t Konrad Kording]

5,000+ Brazilian news outlets. Atlas da Notícia is a Brazilian project that aims to collect data on all local and regional news outlets in the country. Last month, the project released its first batch of data, which identified 5,354 newspapers and online publications in a total of 1,125 municipalities. The raw dataset is currently only available in Portuguese, but the aggregate tables have been translated into English. [h/t Sérgio Spagnuolo]

One family’s spending. An anonymous married couple has decided “to be completely open about [their] finances so that people can see what an actual family’s budget looks like.” In addition to blogging about their financial habits, they’ve also published a spreadsheet of “(almost) every dollar” they spent between December 2015 and November 2017. For each transaction, the dataset provides the date, dollar amount, category (e.g., “Groceries”), and meta-category (e.g., “Food”).