ChatGPT has taken the world by storm.

Within two months of its release it reached 100 million active users, making it the fastest-growing consumer application ever launched. Users are attracted to the tool’s advanced capabilities – and concerned by its potential to cause disruption in various sectors.

A much less discussed implication is the privacy risks ChatGPT poses to each and every one of us. Just yesterday, Google unveiled its own conversational AI called Bard, and others will surely follow. Technology companies working on AI have well and truly entered an arms race.

The problem is it’s fuelled by our personal data.

300 billion words. How many are yours?

ChatGPT is underpinned by a large language model that requires massive amounts of data to function and improve. The more data the model is trained on, the better it gets at detecting patterns, anticipating what will come next and generating plausible text.

OpenAI, the company behind ChatGPT, fed the tool some 300 billion words systematically scraped from the internet: books, articles, websites and posts – including personal information obtained without consent.

If you’ve ever written a blog post or product review, or commented on an article online, there’s a good chance this information was consumed by ChatGPT.

So why is that an issue?

The data collection used to train ChatGPT is problematic for several reasons.

First, none of us were asked whether OpenAI could use our data. This is a clear violation of privacy, especially when data are sensitive and can be used to identify us, our family members, or our location.

Even when data are publicly available their use can breach what we call contextual integrity. This is a fundamental principle in legal discussions of privacy. It requires that individuals’ information is not revealed outside of the context in which it was originally produced.

Also, OpenAI offers no procedures for individuals to check whether the company stores their personal information, or to request it be deleted. This is a guaranteed right in accordance with the European General Data Protection Regulation (GDPR) – although it’s still under debate whether ChatGPT is compliant with GDPR requirements.

This “right to be forgotten” is particularly important in cases where the information is inaccurate or misleading, which seems to be a regular occurrence with ChatGPT.

Moreover, the scraped data ChatGPT was trained on can be proprietary or copyrighted. For instance, when I prompted it, the tool produced the first few passages from Joseph Heller’s book Catch-22 – a copyrighted text.

ChatGPT doesn’t necessarily consider copyright protection when generating outputs. Author provided

Finally, OpenAI did not pay for the data it scraped from the internet. The individuals, website owners and companies that produced it were not compensated. This is particularly noteworthy considering OpenAI was recently valued at US$29 billion, more than double its value in 2021.

OpenAI has also just announced ChatGPT Plus, a paid subscription plan that will offer customers ongoing access to the tool, faster response times and priority access to new features. This plan will contribute to expected revenue of $1 billion by 2024.

None of this would have been possible without data – our data – collected and used without our permission.

A flimsy privacy policy

Another privacy risk involves the data provided to ChatGPT in the form of user prompts. When we ask the tool to answer questions or perform tasks, we may inadvertently hand over sensitive information and put it in the public domain.

For instance, an attorney may prompt the tool to review a draft divorce agreement, or a programmer may ask it to check a piece of code. The agreement and code, in addition to the outputted essays, are now part of ChatGPT’s database. This means they can be used to further train the tool, and be included in responses to other people’s prompts.

Beyond this, OpenAI gathers a broad scope of other user information. According to the company’s privacy policy, it collects users’ IP address, browser type and settings, and data on users’ interactions with the site – including the type of content users engage with, features they use and actions they take.

It also collects information about users’ browsing activities over time and across websites. Alarmingly, OpenAI states it may share users’ personal information with unspecified third parties, without informing them, to meet their business objectives.

Time to rein it in?

Some experts believe ChatGPT is a tipping point for AI – a realisation of technological development that can revolutionise the way we work, learn, write and even think. Its potential benefits notwithstanding, we must remember OpenAI is a private, for-profit company whose interests and commercial imperatives do not necessarily align with greater societal needs.

The privacy risks that come attached to ChatGPT should sound a warning. And as consumers of a growing number of AI technologies, we should be extremely careful about what information we share with such tools.

ChatGPT is a data privacy nightmare. If you’ve ever posted online, you ought to be concerned
ChatGPT is fuelled by our intimate online histories. It’s trained on 300 billion words, yet users have no way of knowing which of their data it contains.

Everyone’s having a field day with ChatGPT – but nobody knows how it actually works

ChatGPT is the latest and most impressive artificially intelligent chatbot yet. It was released two weeks ago, and in just five days hit a million users. It’s being used so much that its servers have reached capacity several times.

OpenAI, the company that developed it, is already being discussed as a potential Google slayer. Why look up something on a search engine when ChatGPT can write a whole paragraph explaining the answer? (There’s even a Chrome extension that lets you do both, side by side.)

But what if we never know the secret sauce behind ChatGPT’s capabilities?

The chatbot takes advantage of a number of technical advances published in open scientific literature in the past couple of decades. But any innovations unique to it are secret. OpenAI could well be trying to build a technical and business moat to keep others out.

The ChatPGT site has gone down multiple times due to an influx of users. OpenAI/Screenshot

What it can (and can’t do)

ChatGPT is very capable. Want a haiku on chatbots? Sure.

Chatbots, oh so wise
Endlessly processing
Answers from the void.

How about a joke about chatbots? No problem.

Why did the chatbot cross the road?
To get to the other side – of your screen!

ChatGPT can do many other tricks. It can write computer code to a user’s specification, draft business letters or rental contracts, compose homework essays and even pass university exams.

Just as important is what ChatGPT can’t do. For instance, it struggles to distinguish between truth and falsehood. It is also often a persuasive liar.

What is the fastest marine mammal?
The fastest marine mammal is the sailfish.
The sailfish is not a mammal.
You are correct, the sailfish is not a mammal. I apologize for the mistake. The fastest marine mammal is the peregrine falcon.

ChatGPT is a bit like autocomplete on your phone. Your phone is trained on a dictionary of words so it completes words. ChatGPT is trained on pretty much all of the web, and can therefore complete whole sentences – or even whole paragraphs.

However, it doesn’t understand what it’s saying, just what words are most likely to come next.

Open only by name

In the past, advances in AI have been accompanied by peer-reviewed literature.

In 2018, for example, when the Google Brain team developed the BERT neural network on which most natural language processing systems are now based (and we suspect ChatGPT is too), the methods were published in peer-reviewed scientific papers and the code was open-sourced.

And in 2021, DeepMind’s AlphaFold 2, a protein-folding software, was Science’s Breakthrough of the Year. The software and its results were open-sourced so scientists everywhere could use them to advance biology and medicine.

Following the release of ChatGPT, we have only a short blog post describing how it works. There has been no hint of an accompanying scientific publication, or that the code will be open-sourced.

To understand why ChatGPT could be kept secret, you have to understand a little about the company behind it.

OpenAI is perhaps one of the oddest companies to emerge from Silicon Valley. It was set up as a non-profit in 2015 to promote and develop “friendly” AI in a way that “benefits humanity as a whole”. Elon Musk, Peter Thiel and other leading tech figures pledged US$1 billion towards its goals.

Their thinking was we couldn’t trust for-profit companies to develop increasingly capable AI that aligned with humanity’s prosperity. AI therefore needed to be developed by a non-profit and, as the name suggested, in an open way.

In 2019 OpenAI transitioned into a capped for-profit company (with investors limited to a maximum return of 100 times their investment) and took a US$1 billion investment from Microsoft so it could scale and compete with the tech giants.

It seems money got in the way of OpenAI’s initial plans for openness.

Profiting from users

On top of this, OpenAI appears to be using feedback from users to filter out the fake answers ChatGPT hallucinates.

According to its blog, OpenAI initially used reinforcement learning in ChatGPT to downrank fake and/or problematic answers using a costly hand-constructed training set.

But ChatGPT now seems to be being tuned by its more than a million users. I imagine this sort of human feedback would be prohibitively expensive to acquire in any other way.

We are now facing the prospect of a significant advance in AI using methods that are not described in the scientific literature and with datasets restricted to a company that appears to be open only in name.

Where next?

In the past decade, AI’s rapid advance has been in large part due to openness by academics and businesses alike. All the major AI tools we have are open-sourced.

But in the race to develop more capable AI, that may be ending. If openness in AI dwindles, we may see advances in this field slow down as a result. We may also see new monopolies develop.

And if history is anything to go by, we know a lack of transparency is a trigger for bad behaviour in tech spaces. So while we go on to laud (or critique) ChatGPT, we shouldn’t overlook the circumstances in which it has come to us.

Unless we’re careful, the very thing that seems to mark the golden age of AI may in fact mark its end.

Everyone’s having a field day with ChatGPT – but nobody knows how it actually works
We’re facing a significant advance in AI using methods that are not described in scientific literature, and with datasets restricted to a single for-profit company.

We publish daily doses of decentralization to over 4000 regular visitors, and boost out on Mastodon, Twitter, Telegram, Tribel and Element (Matrix) to over 4500 daily followers and growing! Please like & share our output. We rely on you for content, so please write for us. We welcome sponsorship and donations to help us continue our work - all major cryptos accepted or maybe buy us a coffee. Contact us at - many thanks for all donations received, much appreciated.
Share this post