Some of the ways Big Tech companies are feeding your personal data to AI feels like a privacy violation — or even theft
Your email is just the start. Meta, owner of Facebook, took a billion Instagram posts from public accounts to train an AI, and didn’t ask permission. Microsoft uses your chats with Bing to coach the AI bot to better answer questions, and you can’t stop it.
Increasingly, tech companies are taking your conversations, photos and documents to teach their AI how to write, paint and pretend to be human. You might be accustomed to them using your data to target you with ads. But now they’re using it to create lucrative new technologies that could upend the economy — and make Big Tech even bigger.
We don’t yet understand the risk that this behavior poses to your privacy, reputation or work. But there’s not much you can do about it.
Sometimes the companies handle your data with care. Yet often, their behavior is out of sync with common expectations for what happens with your information, including stuff you thought was supposed to be private.
If you’re using pretty much any of Big Tech’s buzzy new AI products, you’ve likely been compelled to agree to help make their AI smarter through a “data donation.” (That’s Google’s actual term for it.)
Lost in the data grab: Most people have no way to make truly informed decisions about how their data is being used. That can feel like a privacy violation — or just like theft.
“AI represents a once-in-a-generation leap forward,” says Nicholas Piachaud, a director at the open source nonprofit Mozilla Foundation. “This is an appropriate moment to step back and think: What’s at stake here? Are we willing just to give away our right to privacy, our personal data to these big companies? Or should privacy be the default?”
It isn’t new for tech companies to use your data to train AI products. Netflix uses what you watch and rate to generate recommendations. Facebook uses everything you like and comment on to train its AI how to order your news feed and show you ads.
Yet generative AI is different. Today’s AI arms race needs lots and lots of data. Elon Musk, owner of Twitter and Chief Executive of Tesla, recently bragged to his biographer that he had access to 160 billion video frames per day shot from the cameras built into people’s Teslas to fuel his AI ambitions.
“Everybody is sort of acting as if there is this manifest destiny of technological tools built with people’s data,” says Ben Winters, a senior counsel at the Electronic Privacy Information Center, who has been studying the harms of generative AI. “With the increasing use of AI tools comes this skewed incentive to collect as much data as you can upfront.”
All of this brings some unique privacy risks. Training an AI to learn everything about the world means it also ends up learning intimate things about individuals.
Some tech companies even acknowledge that in their fine print. When you use Google’s new AI writing coach for Docs, it warns: “Do not include personal, confidential or sensitive information.”
The actual process of training AI can be a bit creepy. Sometimes it involves having other people look at the data. Humans are reviewing our back and forth with Google’s new search engine and Bard chatbot, just to name two.
Even worse for your privacy, generative AI sometimes leaks data back out. Generative AI systems that are notoriously hard to control can regurgitate personal info in response to a new, sometimes unforeseen prompt.
It even happened to a tech company. Samsung employees were reportedly using ChatGPT and discovered on three different occasions that the chatbot spit back out company secrets. The company then banned the use of AI chatbots at work. Apple, Spotify, Verizon and many banks have done the same.
The Big Tech companies told me they take pains to prevent leaks. Microsoft says it de-identifies user data entered in Bing chat. Google says it automatically removes personally identifiable information from training data. Meta said it will train generative AI not to reveal private information — so it might share the birthday of a celebrity, but not regular people.
Okay, but how effective are these measures? That’s among the questions the companies won’t give straight answers to. “While our filters are at the cutting edge in the industry, we’re continuing to improve them,” says Google. And how often do they leak? “We believe it’s very limited,” it says.
It’s great to know Google’s AI only sometimes leaks our information. “It’s really difficult for them to say, with a straight face, ‘we don’t have any sensitive data,’” says Winters.
Perhaps privacy isn’t even the right word for this mess. It’s also about control. Who’d ever have imagined a vacation photo they posted in 2009 would be used by a megacorporation in 2023 to teach an AI to make art, put a photographer out of a job, or identify someone’s face to police?
There’s a thin line between “making products better” and theft, and tech companies think they get to draw it.
Which data of ours is and isn’t off limits? Much of the answer is wrapped up in lawsuits, investigations and hopefully some new laws. But meanwhile, Big Tech is making up its own rules.
I asked Google, Meta and Microsoft to tell me exactly when they take user data from products that are core to modern life to make their new generative AI products smarter. Getting answers was like chasing a squirrel through a funhouse.
They told me they hadn’t used nonpublic user information in their largest AI models without permission. But those very carefully chosen words leave a lot of occasions when they are, in fact, building their lucrative AI business with our digital lives.
Not all AI uses for data are the same, or even problematic. But as users, we practically need a degree in computer science to understand what’s going on.
Google is a great example. It tells me its “foundational” AI models — the software behind things like Bard, its answer-anything chatbot — come primarily from “publicly available data from the internet.” Our private Gmail didn’t contribute to that, the company says.
However, Google does still use Gmail to train other AI products, like its Gmail writing-helper Smart Compose (which finishes sentences for you) and new creative coach Duet AI. That’s fundamentally different, Google argues, because it’s taking data from a product to improve that product.
Perhaps there’s no way to create something like Smart Compose without looking at your email. But that doesn’t mean Google should just switch it on by default. In Europe, where there are better data laws, Smart Compose is off by default. Nor should your data be a requirement to use its latest and greatest products, even if Google calls them “experiments” like Bard and Duet AI.
Facebook’s owner Meta also told me it didn’t train its biggest AI model, called Llama 2, on user data. But it has trained other AI, like an image-identification system called SEER, on people’s public Instagrams.
And Meta wouldn’t tell me how it’s using our personal data to train generative AI products. After I pushed back, the company said it would “not train our generative AI models on people’s messages with their friends and families.” At least it agreed to draw some kind of red line.
Microsoft updated its service agreement this summer with broad language about user data, and it didn’t make any assurances to me about limiting the use of our data to train its AI products in consumer-facing programs like Outlook and Word. Mozilla has even launched a campaign calling on the software giant to come clean. “If nine experts in privacy can’t understand what Microsoft does with your data, what chance does the average person have?” Mozilla says.
It doesn’t have to be this way. Microsoft has lots of assurances for lucrative corporate customers, including those chatting with the enterprise version of Bing, about keeping their data private. “Data always remains within the customer’s tenant and is never used for other purposes,” says a spokesman.
Why do companies have more of a right to privacy than all of us?