A Murky World: Navigating Ethical Considerations in Data-Driven Research

Laurie Robinson - January 8, 2024

INFO researchers developed a tool to help data scientists wade through ethical issues

Integrity Text is Pointing By A 3D Compass Needle

In 2017, researchers from Stanford University created a deep neural network, the quintessential power players in the field of artificial intelligence (AI), that was able to outperform humans in a rather controversial capability: discerning a person’s sexual orientation merely by analyzing facial imagery. Trained on vast datasets of images, these advanced computational models scrutinize nuanced patterns and features within the human face—subtleties often imperceptible to the human eye.

Katie Shilton and Jessica Vitak, both associate professors at the University of Maryland (UMD) College of Information Studies (INFO) and co-principal investigators (Co-PIs) of PERVADE: Pervasive Data Ethics for Computational Research, wonder about the ethics of this study. 

“These types of studies are often motivated by a sense of ‘we want to prove it can be done’ without any consideration for if it should be done,” says Vitak. 

According to Vitak and Shilton, this is not a good reason to conduct research. Inferring sexual orientation from public data draws on pseudoscience associating physiology with binary sexual identities, and potentially poses downstream harm. “What they didn’t consider is that in some parts of the world being determined to be gay is illegal, and you can go to jail or worse,” says Vitak. Further, public availability of data doesn’t equate to consent for all types of research use, they say. 

PERVADE, a six-campus research project that ran from 2016-2023, addressed emergent ethical issues arising from the expansive use of personal data in computational research. The project brought together researchers from various fields such as computer science, sociology, legal studies, and information science to study how pervasive data—information about individuals collected as they interact with digital technologies—affects individuals and society. 

The PERVADE team examined how different stakeholders, including data subjects, researchers, and regulatory bodies, perceive and are impacted by the collection and analysis of pervasive data. With these insights, the project developed guidelines, educational materials, and tools to promote ethical practices in working with pervasive data. 

One of the educational materials the researchers created is the Data Ethics Decision Support Tool. “We went in knowing there were some big, unanswered questions. We did a survey of social media researchers and no one agreed on research ethics in this space,” says Shilton. 

Through studies involving social media users, Institutional Review Board (IRB) members, and data scientists, they concluded that ethical considerations hinge on various factors specific to each project, such as data sources, research questions, user expectations on data reuse, and power dynamics between researchers and subjects. To help researchers wade through that data and some of the factors that matter to research ethics, they developed a scalable resource that guides researchers through the ethical decision-making process, providing relevant materials to address their unique concerns.

How the Tool Works

The Data Ethics Decision Support Tool is an online quiz that walks researchers through a series of questions and provides per-question guidance and a personalized summary score. The per-question feedback gives insights into the field’s norms and highlights controversial topics without definitive answers. It prompts researchers to consider multiple aspects of their work and encourages them to think about and articulate the ethical implications of their research methodologies, even if they do not change their practices. A researcher might look into the ethical debate around scraping data against a platform’s terms of service. The tool suggests that they weigh the risks, justify their decisions, and be aware of potential legal and reputational consequences.

Once a respondent has gone through all the questions, the summary score categorizes their project into a “mode”–easy, moderate, or difficult. The purpose of this score is not to discourage the project but to indicate how much effort is needed to thoroughly address and communicate the ethical decisions made. The goal is to prompt researchers to be transparent and deliberate about their choices.

The tool challenges the tendency in data science to draw complex social conclusions from simple datasets, like photographs, which may not reflect the true spectrum of social identities, such as the non-binary nature of human sexuality and gender. The tool prompts data scientists to think beyond binary categories and integrate additional knowledge that can’t be obtained from data scraping. 

It also addresses ethical concerns regarding the use of public data. Researchers are urged to reflect on the intent and expectations of individuals sharing information online, recognizing that such data usage may not align with the participants’ own reasons for posting.

Challenges and Future Considerations 

Last year, the Data Ethics Decision Support Tool was demonstrated  at several conferences, gaining significant support from communities that recognize the importance of discussing its themes. However, convincing other groups poses a challenge. 

When Vitak demoed the tool at a computational conferences last year, she encountered some attendees who were “just kind of scratching their heads” as she went through the tool’s features.  She adds, “These are conversations they’ve never had around their research. There’s going to be a much harder time convincing them that this stuff is important.” 

These communities—where norms around discussing ethics at all stages of research have yet to be established—may benefit greatly from the tool but are harder to persuade. More effort is needed to engage them and conduct user testing, which the PERVADE researchers are seeking funding for.

The team is considering comparison with more resource-intensive, expert consultation models like Stanford University’s Ethics and Social Review (ESR) process for AI projects. 

“The challenge is how do we get the information that those experts have in their heads to these folks that have the questions,” says Shilton. Research is ongoing to explore how to best disseminate expert knowledge through the tool to those needing guidance and evaluate if educational resources can be as effective as direct expert consultation. 

While the tool supports research and knowledge discovery using big data, it needs continual updating and adaption. For example, the team would like to add components addressing the use of data for generating new content (generative AI). The team is hoping to create specific versions of the tool tailored to different communities, such as one for AI development that would cover topics of prediction, modeling, and generation, as well as another version designed for health researchers. Vitak (along with two other PERVADE Co-PIs) is also pursuing funding to develop data ethics curriculum for undergraduate and graduate computer science students, including expansion of the tool to training environments for future data scientists.

“We think that this tool has so much potential, but we’ve got to find ways to sustain and adapt it over time,” Shilton says. “We’re excited to work with broad data science communities to figure out that sustainability.”