There are millions of personal data examples in a large AI training data set

by SkillAiNest July 18, 2025

written by SkillAiNest July 18, 2025

One of the colleagues, William Agno, says in AI Ethics at the University of Carnegie Melman, is that “whatever you can put online (be) and maybe it has been scratched.”

Researchers found thousands of people Examples of accurate identification documents include photographs of Credit Credit Cards, Driver License, Passport, and Birth Certificate – as well as more than 800 legitimate job application documents (including resoments and core letters), which are affiliated with real people through LinkedIn and other web searches. (In many cases, researchers did not have time to verify the documents or were unable to do so because of matters such as image explanation.)

Several resoments revealed sensitive information, including disability status, background test results, birth dates and the birthplace of those who rely, and race. When resoments were linked to people with online presented, researchers also found contact information, government identifiers, socioimographic information, facial photos, house addresses, and contact information (such as references).

Examples of documents related to identification in a small -scale datastate of the Common Pool, which show credit cards, social security numbers, and the driver’s license. Each sample looks above UR, URL site type, icon in the middle, and title in the references below. All personal information has been changed, and the text has been described to avoid direct references. Photos have been shaped to show the presence of faces without identifying individuals.

When it was released in 2023, the data compressive pool, with its 12.8 billion data samples, was the largest data set of the publicly available text couple, which is often used to train generative text to image models. While its curators said that the purpose of the Common Pool was for educational research, its license does not prohibit commercial use.

The Common Pool was built as a follow -up of the Lean -5B data set, which was used to train models, including stable dispersion and midwife. This is based on the same data source: Web scraping is made between 2014 and 2022 through non -profit common crawl.

Although trading models often do not disclose that their data sets are trained, the source of the data composer of the Common Common Pool and Lean -5B means that the datases are the same, and that the same personally identified information is likely to appear in the data, as well as the data. Researchers at the Common Pool did not answer the email questions.

And since the data compressive has been downloaded more than 2 million times over the past two years, it is likely that “PhD student at Computer Science at the University of Washington and the main author of the paper Rahil Hong says” there are (who) all are trained on the exact data set. “They will copy similar risks to privacy.

Good intentions are not enough

“You can assume that any large-scale web scrapped data always contains content that should not be there,” Abba Barhani, who says Trinity College Dublin’s AI Accountability Lab, says that he is a personalized information, Pii, Pii, PII). The exploration of children’s sexual abuseOr hate speech (which has his own Research Found in Laion-5B).

Good intentions are not enough

Editor's pick

Get latest news

There are millions of personal data examples in a large AI training data set

Good intentions are not enough

Virture Luma Pro Glasses I wish I had OLED screen of Switch 2

7 for data scientists

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news