Who trains on your data? Auditing ChatGPT, Claude, Llama, and Gemini

The Train Awards are here!

Oct 14, 2024

∙ Paid

Hi AI ethics enthusiasts,

AI companies collect user data, often without informed consent. The excuse is that the training policies are buried somewhere in the privacy policies that most users never see.

That’s why I’ve decided to hold a very special award ceremony:

The Train Awards!

I reviewed the privacy policies of four platforms: ChatGPT, Claude, Llama, and Gemini. Based on my findings, I am awarding them each one of four training trophies:

The Shield Train - They commit not to train on user data
The Vacuum Train - They train on user data by default, but you can opt out
The Fishy Train - They don’t explicitly admit to training, but probably are, and there are no opt-out options. Very fishy!
The Elephant in the Train - There’s no telling what they do because their privacy policy ignores the topic of training, the elephant in the room.

And with that, let’s start the show!

ChatGPT (by OpenAI)

I award the Vacuum Train Trophy to OpenAI for training by default on all individual users!

Training in services for individuals

Trains on user data by default everywhere except Europe. They say this here.
You can opt out:
- Settings> Data Controls> Improve the model for everyone. Turning it off applies only to the future.
- When you use the temporary chat, your data will not be used for training, regardless of whether you opted out.
All of this doesn’t apply to the European Economic Area, the UK, and Switzerland. No training of their data. Here’s the European Privacy Policy.

Note from me: The name OpenAI chose for the training toggle, “Improve the model for everyone,” is ridiculous and deceptive. Better name: “Help OpenAI make more money using your data while you get nothing and risk data leaks”

Training in services for businesses

They explicitly say they don’t train by default in these products: ChatGPT Team, ChatGPT Enterprise, and the API. They say this here, and here is their policy for enterprises.
API customers can opt-in to sharing data.
Relatedly, OpenAI may run automated content classifiers and safety tools on business meta-data and human reviews on business data itself.

Note from me: For API users, I would double-check that the toggle is off.

Claude (by Anthropic)

I award the Shield Train Trophy to Anthropic for committing not to train on user data!

They don’t train on user data on any of their products, individual or commercial. The three exceptions are when:

The content is flagged due to a trust and safety issue
The user explicitly reports the content
The user explicitly opts into training.

They say this here.

Note from me: To add this fantastic policy, they also explain how to block their crawlers here.