Summary: OpenZeppelin detecta contaminación de datos en EVMbench de OpenAI

Published: 1 month and 23 days ago

Based on article from CoinTelegraph

OpenAI, in collaboration with Paradigm, recently introduced EVMbench, a new artificial intelligence benchmark designed to assess AI models' proficiency in identifying, fixing, and exploiting smart contract vulnerabilities. However, leading blockchain security firm OpenZeppelin subjected EVMbench to its rigorous scrutiny, similar to how it evaluates high-stakes DeFi protocols. Their comprehensive audit uncovered significant methodological flaws and data integrity issues, raising questions about the benchmark's reliability.

Concerns Over Data Contamination

One of the primary issues identified by OpenZeppelin was the likely contamination of EVMbench's training data. The audit revealed that top-performing AI agents, despite having their internet access cut during testing, had probably been exposed to the benchmark's vulnerability reports during their pre-training phases. EVMbench's dataset comprised vulnerabilities from audits spanning up to mid-2025, coinciding with the typical knowledge cutoffs for these AI models. This overlap created a high risk that the AI agents already possessed the "answers" to the problems, undermining the benchmark's ability to gauge their capacity for discovering novel vulnerabilities. OpenZeppelin stressed that the limited dataset size further amplified these contamination concerns, reducing the overall quality and evaluative surface of the test.

Flawed Vulnerability Classifications

Furthermore, OpenZeppelin's review uncovered critical errors in EVMbench's vulnerability classifications. The audit found that several issues labeled as "high-severity" were, in practice, not exploitable. Despite these factual inaccuracies, EVMbench reportedly scored AI agents correctly for identifying these non-functional exploits. This isn't a matter of subjective disagreement on severity but rather a fundamental flaw where the described exploits simply do not work, leading to an inaccurate assessment of AI performance. OpenZeppelin highlighted that such significant errors in the dataset compromise the integrity of the benchmark and, consequently, the evaluation of AI models. Ultimately, OpenZeppelin affirms the transformative potential of AI in enhancing blockchain security. However, their findings underscore a critical need for rigorous standards in the data and benchmarks used to develop and evaluate these tools. The integrity of AI-driven security solutions hinges on the quality and accuracy of their underlying assessment mechanisms, which must meet the same high bar as the smart contracts they are designed to protect.

Original article

Summary: OpenZeppelin detecta contaminación de datos en EVMbench de OpenAI

Concerns Over Data Contamination

Flawed Vulnerability Classifications

Data

Trade

Insights

Company