06/03/2025

AI training dataset exposes nearly 12,000 API keys and passwords

Researchers have discovered close to 12,000 valid secrets, including API keys and passwords, in the Common Crawl dataset, which is widely used to train multiple artificial intelligence models.

The Common Crawl non-profit organization actively maintains this massive open-source repository, which contains petabytes of web data collected since 2008. Moreover, it is freely accessible to anyone.

Due to the vast size of this dataset, many artificial intelligence projects heavily depend on it—at least in part—for training large language models (LLMs). Notably, companies like OpenAI, DeepSeek, Google, Meta, Anthropic, and Stability have leveraged this digital archive for their models.

AWS root keys and MailChimp API keys

Researchers at Truffle Security, the company behind the open-source TruffleHog scanner for sensitive data, identified valid secrets after analyzing 400 terabytes of data from 2.67 billion web pages in the Common Crawl December 2024 archive.

During their investigation, they uncovered 11,908 secrets that authenticated successfully. These findings suggest that developers had hardcoded credentials, raising concerns about the potential risks of training large language models (LLMs) on insecure code.

It is important to note that LLM training data does not remain in its raw form. Instead, it undergoes a pre-processing stage that includes cleaning and filtering out irrelevant data, duplicates, harmful content, and sensitive information.

However, despite these precautions, completely removing confidential data remains a significant challenge. The process provides no guarantee that a dataset of this scale will be entirely stripped of personally identifiable information (PII), financial data, medical records, or other sensitive content.

After analyzing the scanned data, Truffle Security confirmed the presence of valid API keys for Amazon Web Services (AWS), MailChimp, and WalkScore services.

AWS root key in front-end HTML
source: Truffle Security

TruffleHog identified a total of 219 distinct secret types within the Common Crawl dataset, with MailChimp API keys being the most common.

“Nearly 1,500 unique MailChimp API keys were hardcoded in front-end HTML and JavaScript,” Truffle Security reported.

MailChimp API key leaked in front-end HTML
source: Truffle Security

According to the researchers, this issue stemmed from developers mistakenly embedding these keys directly into HTML forms and JavaScript snippets instead of using server-side environment variables.

As a result, attackers could exploit these exposed keys for malicious activities such as phishing campaigns and brand impersonation. Furthermore, leaking such secrets increases the risk of data exfiltration.

Another critical finding in the report highlights the high reuse rate of these exposed secrets. In fact, 63% appeared on multiple pages. One striking example was a WalkScore API key, which surfaced 57,029 times across 1,871 subdomains.

The researchers also discovered a webpage containing 17 unique live Slack webhooks—credentials that should remain private, as they allow apps to post messages in Slack channels.

“Keep it secret, keep it safe. Your webhook URL contains a secret. Don’t share it online, including via public version control repositories,” Slack warns.

What now?

Following their research, Truffle Security reached out to affected vendors and collaborated with them to revoke compromised keys.

“We successfully helped those organizations collectively rotate/revoke several thousand keys,” the researchers stated.

Even if an artificial intelligence model is trained on older datasets than the one analyzed in this study, Truffle Security’s findings serve as a critical reminder: insecure coding practices can directly influence the behavior of LLMs.

Source: BleepingComputer, Ionut Ilascu