Privacy in Machine Learning is a complicated subject. Let’s look at why.
Back in 2012, US retail chain Target created an algorithm that could work out if customers were pregnant based on their past purchasing behaviour. After analysing historical data, they used the results for a direct marketing campaign, snail-mailing coupons for baby products to the home addresses of likely new-mums-to-be.
It was an impressive feat of predictive analytics — but also problematic. One teenage customer hadn’t yet told her father about the pregnancy. He found out when he saw the baby coupons in the family’s morning post.
Before they shut it down in 2010, Netflix’s annual prize for the algo that most improved their film recommendation system raised similar concerns. Two researchers at The University of Texas were able to identify individual users just by matching their Netflix ratings with ratings on the Internet Movie Database (IMDB).
While it generated some headlines at the time, Netflix only canceled the competition after the US Federal Trade Commission filed a lawsuit over privacy breaches.
Accidental leaks and unintended consequences
With everything that’s happened in machine learning since then, our power to pinpoint individuals based on their past purchases, opinions, likes, shares, posts, and physical movements has only got better.
In 2018, the Strava fitness app found itself at the centre of a PR shitstorm after releasing a searchable map of publicly logged runs and bike rides. The app uses mobile phone GPS tracking to map routes and calculate speeds, distances, and calories burned. Unfortunately, that data also made it possible for researchers to locate sensitive military sites in Afghanistan and Syria.
It turns out many soldiers go for runs just inside the perimeter fence of their compounds, creating route maps with identifiable square shapes, curiously located in remote areas without a town or other human habitation.
Politicians and regulators have taken notice and taken action.
I don’t need to tell anyone here about GDPR, HIPAA, or the CCPA. Now that facial recognition tech has taken off — with mixed results and concerns about racial bias in the data sets — cities all across the US have restricted or banned its use. From Oakland and San Francisco on the west coast to Massachusetts cities like Brookline, Cambridge, and Somerville on the east.
California, Oregon and New Hampshire have all implemented state-wide bans on facial recognition via police body cams.
Wait to be regulated, or get ahead of the game?
So you can see the problem. To an outsider, ML’s power looks like a double-edged sword, capable of achieving great things but also loaded with potential for abuse.
Experiments in deep learning have shown that, even when sensitive or personally identifiable information (PII) is removed from a data set, ML algorithms can re-infer it. Unless specific privacy-preserving techniques are used, trained models will capture information that can reveal identities.
In one study, University of Wisconsin researchers could pull genomic information for individuals from an ML model trained to predict medicine dosages. Scientists in a Carnegie Mellon-backed research project were able to reconstruct headshot images from a model trained to perform facial recognition.
Worrying, but is focusing on these sorts of deep learning applications missing the point? Machine learning might amplify concerns about data privacy but maybe holds the key to solving them too.
Privacy Enhancing Technologies (PETs) could evolve a new kind of privacy-preserving machine learning by protecting data privacy throughout its life cycle.
A lot has been written about technologies like homomorphic encryption, which could enable encryption of ML models so that data is evaluated in its encrypted state, arguably eliminating privacy concerns.
But is it practical? Regulation is already throwing up administrative barriers that slow down ML projects and complicate collaboration. The cure for privacy concerns shouldn’t be worse than the disease.
We spoke to Jean-François Rajotte, a Researcher and Resident Data Scientist at the University of Columbia, about his experiences dealing with healthcare data privacy.
‘We noticed very quickly that there was this big need for data exchange within research teams, but it was such a difficult thing to do. We had to jump through all the normal hoops of securing approval and getting signatures on NDA’s. But whenever new collaborators joined, they had to have a special device connected to their laptop or only use a computer that was on-premise if they wanted to access data. So that can be very limiting.’
He says they’re looking at the use of synthetic data as a way to get over those hurdles.
‘We’ve started experimenting with emerging technologies like generative adversarial networks (GANs) as a way to generate synthetic data. Those sets can be shared more easily,’
He notes, however, that even synthetic data generation needs a level of privacy control.
‘When you create synthetic data, you might leak some information or perhaps leave in some PII data when you train your generator.’
Can more technology solve a technology-driven problem? Time will tell. But if the approach works in practice, it could free up teams to share data and collaborate without breaching regulatory rules or threatening individual privacy.Tags: Machine learning privacy, Machine learning regulations, Synthetic data