- The Upgrade
- Posts
- How AI Made Data Privacy Everybody's Business
How AI Made Data Privacy Everybody's Business
A guest post by Pete Pachal. Plus, special AI course offer, and a new podcast drop!
Welcome to The Upgrade
Welcome to my weekly newsletter, which focuses on the intersection of AI, media, and storytelling. A special welcome to my new readers! Drop me a note here, and let’s get acquainted. 😊
Over the next few weeks, we’re piloting a new initiative: a series of guest essays by prominent voices in AI and media! We’ll be back to our regular programming in August. Today, a fantastic piece on the threats to data privacy in the AI era by a long-time tech journalist and editor:
✍🏻 Guest Post: How AI Made Data Privacy Everybody's Business by Peter Pachal, CEO of The Media Copilot
🎓Learn AI with MindStudio Academy! 💻
Ready to learn the fastest way to build no-code AI-powered apps and automation? The Upgrade is partnering with MindStudio to lead the MindStudio Academy! ⚡️
The next cohort takes place on Saturday, July 27th. Hope to see you there!
SAVE 20% with code: THEUPGRADE20
How AI Made Data Privacy Everybody's Business by Pete Pachal
AI models have an insatiable need for more data. How to safeguard your information and understand the new reality.
Hey all, Pete Pachal here filling in for Peter. I'm the Founder of The Media Copilot, which publishes a daily newsletter about how AI is changing media, journalism, and content creation. We also offer AI training for busy people and consult with newsrooms and other orgs about how to build AI into their operations.
That wasn't just a plug — it's the context that allows me to say this: If there's a single topic within AI that everybody has concerns about, it's data privacy. It comes up in every class I teach, every casual conversation about how AI is affecting the media business, and almost all the interviews I have with decision-makers at media companies. The fear of losing control of your data, and the outrage toward tech companies who act as if they are entitled to take it, is palpable.
A recent article in The New York Times has renewed fears about data privacy in the age of AI, pointing to a set of recent changes in the Terms of Service for various software products, including those from Google, Snap and Meta. In each case, the company altered language to ensure they included provisions for leveraging user data to help power or train AI systems.
While most of the companies who've done this would no doubt prefer the changes were simply treated as incidental, users have not responded with nonchalance. Customers of Adobe, for instance, openly revolted when the company quietly altered its terms of service a few months ago, and executives had to do multiple rounds of damage control. AI, it seems, has everyone on edge.
Tech companies have given them good reason to fear. Another New York Times report from earlier this year laid out how both OpenAI and Google sought to deliberately ignore YouTube's terms of service by training their AI models on YouTube videos (yes, Google owns YouTube, greatly complicating the matter). Even before that, a detailed report from IEEE Spectrum proved that popular AI image generator Midjourney was trained on copyrighted content, including images from Marvel movies.
Generally, AI companies have hoovered up the majority of public data on the internet to power their models, without clarity on whether that was in any sense OK. Several lawsuits are now before the courts that may help chart a path to a definitive answer to that question.
As the recent Times piece points out, many of the big tech companies — Meta and Google in particular — don't just host public data; they're also sitting on mountains of private data: information that users don't share. With virtually no more public data left to train on, the builders of these AIs would find this private data immensely valuable.
Are these alterations to terms of service a precursor to some kind of retroactive harvesting of that private data to train AI? There isn't evidence of that, but given Silicon Valley's record, you can see why people might be concerned.
If you are, what should you do? How can you adjust your approach to AI in a way that maximizes your data privacy?
Breaking Down the Privacy Problem
To answer that question, it's helpful to unpack why people find the practice of data harvesting so objectionable in the first place. By doing that, we can better find ways to address specific concerns. For many aspects of this, there aren't easy solutions. But there are ways to adjust thinking and approach to put yourself in the best possible position.
As I see it, concerns about data privacy tend to tall into two buckets:
Exploitation: "You're taking my data and leveraging or monetizing it without giving me anything in return."
Control: "By granting access to my data, I no longer have control of it."
Let's address each of these in turn.
Exploitation
When ChatGPT exploded into existence in late 2022, we were all so blown away by what it could do that few at the time stopped to think about the training data needed to create that experience. The gigantic data sets from Common Crawl et al. had been used, essentially for free, by search engines for years, and this seemed like a logical extension of that norm.
But as time has gone on and we have clarity on how "answer engines" like Perplexity and Google AI Overviews work, public attitudes have shifted. There's now a general consensus that information sources should be compensated for the information they provide — a recent poll from a think tank called the Artificial Intelligence Policy Institute showed that 74% of respondents said, "AI companies should compensate creators for using their data."
We're seeing this shift play out in the business world as OpenAI and others have begun to sign deals with publishers like News Corp and Axel Springer as well as platforms like Reddit to give LLMs access to their content. In the meantime, various challenges are slowly making their way through the courts in the hope of getting a final ruling from s legal perspective.
Today, any site that wants to guard against tech companies harvesting training data can set their site's preferences (the robots.txt file) to forbid the practice. There are also ways to tag your content at the article level, giving you more control over how those articles are crawled and used. Intaglio, co-created by Media Copilot co-founder John Biggs, is such a solution.
Adding insult to injury, AI systems typically don't just train on content for profit — their output also acts as a replacement for the content for many users. While people often have a visceral reaction to this reality, it mostly adds a dimension of urgency to resolving the situation in the legal and regulatory realms.
Control
When you interact with a chatbot like ChatGPT or Meta.ai, the requests, documents, and other data you feed into it will generally be used as training data. What that means is there's a chance that, at some point in the future, another user might be able to coax some or all of that information from the chatbot just by asking.
This obviously means you should not feed sensitive or non-public information into a chatbot. If you want to use an LLM privately, you should use the APIs that AI companies provide, which don't keep the data for training, according to those same companies.
However, given the track record of Big Tech and the strong tendencies of AI builders to hoover up data however and whenever they can, there are some companies that forbid the use of commercial AI completely — even through an API. That's pretty extreme, but it doesn't mean they have to cut themselves off from LLMs: You can still run AI locally, on a server or private cloud.
The Silver Lining
The lesson in all this for everyone, even individuals, is awareness. It'd be unrealistic for most digital citizens to simply "opt out" of using AI or digital platforms that want to harvest data. But you can better understand what you're agreeing to when you do via an app called Tosless. Feed it any Terms of Service agreement, and it'll tell you which parts are most concerning. It may not diffuse any privacy land mines, but at least it'll let you know when you're about to step on them.
There's a silver lining in all this consternation about privacy: Given AI's insatiable need for more data, the value of everyone's information has effectively gone up. But for a fair economy to arise around that reality, the owners of that information need to be aware of that value. Over the last decade, Big Tech has done its best to convince us that exchanging free services for limitless data harvesting was a fair bargain. Ironically, it could be their most innovative creation — AI — that gets us to finally push back on that idea.
Pete Pachal is the Founder of The Media Copilot, a leading resource dedicated to understanding how AI is transforming media, journalism, and content creation. With a rich background in journalism and media strategy, Pete has held leadership positions at renowned companies, including CoinDesk, Mashable, and NBC Universal.
At The Media Copilot, he offers training and consulting services to journalists, marketers, and PR professionals, helping them leverage generative AI for content creation while maintaining integrity. Pete's expertise in AI, Big Tech, and cryptocurrency is frequently showcased on national television, including the Today Show, CNN, and Fox Business. His current work focuses on cutting through the hype and backlash surrounding AI to provide clear, actionable insights for media professionals.
Don’t be shy—hit reply if you have thoughts or feedback. I’d love to connect with you!
Until next week,
Psst… Did someone forward this to you? Subscribe here!
Reply