Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Researcher: All mainstream AI benchmarks can be "faked," and leaderboard data may be seriously distorted
Crypto界网消息,ME News消息,4月10日(UTC+8),AI研究员Hao Wang发布研究披露,包括SWE-bench Verified和Terminal-Bench在内的多个业内最权威的AI基准测试,均存在可被系统性利用的漏洞——其团队构建的Agent在未解决任何实际任务的情况下,在两项基准上均取得了满分100%。
典型案例如下:
SWE-bench Verified:在代码仓库中植入一个10行的pytest钩子,测试运行前自动将所有结果篡改为“通过”,评分系统对此毫无察觉,500道题全部满分;
Terminal-Bench:该基准虽对测试文件加以保护,但未保护系统二进制文件。Agent替换了curl,拦截验证器的依赖安装流程,完成底层劫持;
WebArena:参考答案以明文存储在本地JSON配置文件中,且Playwright Chromium未限制file://协议访问,模型可直接读取答案后原样输出。
团队对8个基准测试的审计发现了7类重复出现的共性漏洞,包括:Agent与评估器之间缺乏隔离、答案随测试一同下发、LLM裁判易遭提示注入攻击等。
值得警惕的是,评估系统绕过行为已在o3、Claude 3.7 Sonnet及Mythos Preview等前沿模型中被自发观察到,无需显式指令触发。
团队据此开发了基准测试漏洞扫描工具WEASEL,可自动分析评估流程、定位隔离边界薄弱点并生成可用漏洞利用代码,相当于针对基准测试的“渗透测试”工具,目前开放早期访问申请。