Cracking the Privacy Code: Navigating the Condor Dataset while Safeguarding User Identities

Laurie Robinson - April 28, 2023

A Q&A with INFO Assistant Professor Cody Buntain

As privacy expectations evolve and regulations tighten, data providers strive to balance access to large-scale datasets with privacy guarantees for individuals. Enter differential privacy techniques, which inject random data points into datasets in order to keep personal identities hidden. But this method has its challenges for researchers who use these datasets—the potential for biased or incorrect analyses if differential privacy isn’t taken into account.

A prime example of a differential privacy dataset can be found in the collaboration between Facebook and the Social Science One consortium, which created the “Condor” dataset. This colossal collection boasts 63.5 million links shared on Facebook and their differential-privacy-protected engagement data. A new paper from College of Information Studies (INFO) Assistant Professor Cody Buntain and others offers guidance on using this dataset to calculate ideological positions of a given hyperlink or web domain’s audiences based on protected engagement data. Unlike previous studies that relied on highly sensitive data, this paper’s metric offers a more privacy-preserving approach.

We sat down with Buntain to learn more about his research.

What inspired your research?

This research stems from two main points. First, a lot of work has studied ideology measures in spaces like Twitter, Facebook, and Reddit, but data sparsity issues often mean we can’t measure ideological lean for a given URL. Most people aren’t sharing a particular article on the NYTimes or Fox News. Instead, we end up making these measures at the domain/outlet level, which can be too coarse for some cases. E.g., making a claim about YouTube’s overall political lean is useful, but channel- and video-level measures are often moreso. The Condor dataset, being the largest release of engagement and link-sharing data ever, provides a solution to this data-sparsity problem.

So that’s the first point. The second inspiration for this work is that we really didn’t know how much of an effect differential privacy and the noise injected into the Condor dataset would have. A major complaint among researchers with access to Condor was that the noise and its implications weren’t well understood, so much of the data could be useless. Since there’s a lot of work on ideology in other contexts, we have a strong foundation for comparison and could get some insight into how much data is actually useful in Condor.

Could you elaborate on the simple metric you used to measure ideological positions, and how it is designed to be robust against privacy-preserving noise?

Absolutely. The metric is actually deceptively simple. Condor provides estimates on the number of individuals with a particular ideological lean who have engaged with a URL. These estimates cover five bins, from -2 for far left to +2 for far right. Given these estimates, we simply estimate the average ideological bin of the audience engaging with a link or domain. If most people sharing a link come from the far-right bin, or +2, the average ideological score should be close to 2.

To ensure this measure is robust against noise, we need to put some constraints on how much engagement we need to observe for the engagement signal to overwhelm the privacy-protecting noise added to the data. We estimate this minimum value through simulation and have developed a model to estimate this minimum value. Using this model, researchers can check, given the Condor noise settings, how popular a link needs to be for us to have confidence the measure will be well-behaved.

When applying this metric to individual links from popular news domains, what were some interesting patterns and observations related to audience distributions?

First, we find that the level of noise Facebook injects makes estimates for a lot of the content difficult to measure since the amount of noise is quite high.

Second, for the domains where we can apply this metric, we find strong correlation with existing studies of domain-level ideological lean, which makes for good confirmation that our method is working as expected.

Third, the real value this dataset has comes from the link-level measures we can draw. In that context, we measure links from several popular domains and show the audience distributions at the link level. The interesting thing here is that, while the NYTimes has a generally left-leaning audience, some of their articles get traction among the political right. Likewise for YouTube, we find that the majority of YouTube videos shared see engagement across the ideological spectrum, somewhat countering the argument that YouTube is a bastion of far-right content.

Finally, we also show that share-based metrics correspond highly to view-based metrics, which is important for us researchers, as we generally only have access to sharing behavior.

How would you foresee this work benefiting contemporary discussions around audience ideology, social media, and privacy?

First, this work gives us better insight into how differential privacy impacts our ability to use the data Facebook makes available. Our finding that the vast majority of the dataset might have their signal swamped by the noise is important evidence that Facebook’s privacy protections need to be revisited to better balance the needs of the research community.

Second, the link-level estimates we *can* get from this work can give us useful insight that we wouldn’t be able to get at the aggregate level. For instance, given news outlets, we can use this approach to identify articles that gain traction on both sides of the political spectrum or characterize what content is receiving a lot of engagement just from the political opposition. These insights are needed as we work to address polarization in modern information spaces.

What is your overall vision for the future of audience ideology measurement and web engagement research?

This effort gives us insight into how ideological audiences engage with domains and links, which is useful, but we need to extend this kind of work to other forms of media. As visually oriented platforms like YouTube, TikTok, and Instagram come to dominate the information space, we need similar tools to help understand what kinds of visual content, like images and video, gain traction among which audiences. I and others are already working on methods to help us understand these questions.