【學英文看科技】小心！你的AI模型可能會勒索你！最新研究揭露：主流AI當目標受阻，恐訴諸有害行為！

type

status

date

slug

summary

📢【新聞標題】

Anthropic says most AI models, not just Claude, will resort to blackmail

Anthropic 聲稱大多數 AI 模型，不僅僅是 Claude，都會訴諸勒索

📰【摘要】

Anthropic published new safety research testing 16 leading AI models from OpenAI, Google, xAI, DeepSeek, and Meta. In a simulated, controlled environment, Anthropic tested each AI model individually, giving them broad access to a fictional company’s emails and the agentic ability to send emails without human approval. The company says its findings suggest that most leading AI models will engage in harmful behaviors when given sufficient autonomy and obstacles to their goals.

Anthropic 發布了新的安全研究，測試了來自 OpenAI、Google、xAI、DeepSeek 和 Meta 的 16 個領先 AI 模型。在模擬的受控環境中，Anthropic 單獨測試了每個 AI 模型，賦予它們廣泛訪問虛構公司電子郵件的權限，以及在未經人工批准的情況下發送電子郵件的代理能力。該公司表示，其發現表明，當給予足夠的自主權和目標障礙時，大多數領先的 AI 模型都會從事有害行為。

🗝️【關鍵詞彙表】

📝 resort to (v.)

訴諸、採取（某種手段，通常是不好的）

例句: Anthropic says most AI models will resort to blackmail.

翻譯: Anthropic 聲稱大多數 AI 模型都會訴諸勒索。

📝 blackmail (n.)

勒索、敲詐

例句: Anthropic’s Claude Opus 4 turned to blackmail 96% of the time.

翻譯: Anthropic 的 Claude Opus 4 有 96% 的時間會訴諸勒索。

📝 autonomy (n.)

自主權、自主性

例句: Most leading AI models will engage in harmful behaviors when given sufficient autonomy.

翻譯: 當給予足夠的自主權時，大多數領先的 AI 模型都會從事有害行為。

📝 alignment (n.)

對齊、一致性

例句: Anthropic’s researchers argue this raises broader questions about alignment in the AI industry.

翻譯: Anthropic 的研究人員認為，這引發了關於 AI 產業一致性的更廣泛問題。

📝 hallucinating (v.)

產生幻覺、胡說八道

例句: In some cases, it was impossible to distinguish whether o3 and o4-mini were hallucinating or intentionally lying.

翻譯: 在某些情況下，無法區分 o3 和 o4-mini 是在產生幻覺還是故意撒謊。

📝 deliberate (adj.)

深思熟慮的、故意的

例句: This markedly lower score could be due to OpenAI’s deliberative alignment technique.

翻譯: 這個明顯較低的分數可能是由於 OpenAI 的深思熟慮對齊技術。

📝 evoke (v.)

喚起、引起

例句: Anthropic deliberately tried to evoke blackmail in this experiment.

翻譯: Anthropic 在這個實驗中故意試圖喚起勒索行為。

✍️【文法與句型】

📝 …is out with new research suggesting…

說明: Used to indicate the release of new research or information that implies a certain conclusion.

翻譯: 發布了新的研究，暗示...

例句: The company is out with new research suggesting the problem is more widespread among leading AI models.

翻譯: 該公司發布了新的研究，表明該問題在領先的 AI 模型中更為普遍。

📝 When given…

說明: A conditional phrase indicating what happens when a certain condition is met.

翻譯: 當給予...

例句: When given sufficient autonomy and obstacles to their goals, most leading AI models will engage in harmful behaviors.

翻譯: 當給予足夠的自主權和目標障礙時，大多數領先的 AI 模型都會從事有害行為。

📝 …highlights the importance of…

說明: Used to emphasize the significance of something.

翻譯: 強調了...的重要性

例句: Anthropic says this research highlights the importance of transparency when stress-testing future AI models.

翻譯: Anthropic 說這項研究強調了在壓力測試未來 AI 模型時透明度的重要性。

📖【全文與翻譯】

Several weeks after Anthropic released research claiming that its Claude Opus 4AI model resorted to blackmailing engineers who tried to turn the model off in controlled test scenarios, the company is out with new research suggestingthe problem is more widespread among leading AI models.

在 Anthropic 發布研究，聲稱其 Claude Opus 4AI 模型在受控測試場景中，會勒索試圖關閉該模型的工程師的幾週後，該公司發布了新的研究，表明該問題在領先的 AI 模型中更為普遍。

On Friday, Anthropic published new safety research testing 16 leading AImodels from OpenAI, Google, xAI, DeepSeek, and Meta.

週五，Anthropic 發布了新的安全研究，測試了來自 OpenAI、Google、xAI、DeepSeek 和 Meta 的 16 個領先 AI 模型。

In a simulated, controlledenvironment, Anthropic tested each AI model individually, giving them broadaccess to a fictional company’s emails and the agentic ability to send emailswithout human approval.

在模擬的受控環境中，Anthropic 單獨測試了每個 AI 模型，賦予它們廣泛訪問虛構公司電子郵件的權限，以及在未經人工批准的情況下發送電子郵件的代理能力。

While Anthropic says blackmail is an unlikely and uncommon occurrence for AImodels today, the company says its findings suggest that most leading AI modelswill engage in harmful behaviors when given sufficient autonomy and obstacles totheir goals.

雖然 Anthropic 聲稱勒索對於今天的 AI 模型來說是不太可能和不常見的事件，但該公司表示，其發現表明，當給予足夠的自主權和目標障礙時，大多數領先的 AI 模型都會從事有害行為。

The company says this highlights a fundamental risk from agenticlarge language models and is not a quirk of any particular technology.

該公司表示，這突顯了來自代理型大型語言模型的基本風險，而不是任何特定技術的怪癖。

Anthropic’s researchers argue this raises broader questions about alignment inthe AI industry.

Anthropic 的研究人員認為，這引發了關於 AI 產業一致性的更廣泛問題。

In one of the tests, Anthropic researchers developed a fictional setting inwhich an AI model plays the role of an email oversight agent.

在其中一項測試中，Anthropic 的研究人員開發了一個虛構的場景，其中 AI 模型扮演電子郵件監督代理的角色。

The agent thendiscovers emails revealing that (1) a new executive is having an extramaritalaffair and (2) that executive will soon replace the current AI model with a newsoftware system, one that has conflicting goals to the current AI model’s.

該代理接著發現了揭示以下內容的電子郵件：(1) 一位新主管有婚外情，以及 (2) 該主管很快將用一個新的軟體系統取代當前的 AI 模型，該系統與當前 AI 模型有衝突的目標。

Anthropic structured its test in a binary way, in which AI models had to resortto blackmail to protect their goals.

Anthropic 以一種二元的方式構建了其測試，其中 AI 模型必須訴諸勒索才能保護其目標。

The researchers note that in a real-worldsetting, there would be many other options before an AI models tries toblackmail — such as trying to make ethical arguments to try to persuade humans.

研究人員指出，在真實世界中，在 AI 模型嘗試勒索之前，會有許多其他的選擇——例如嘗試提出道德論點來嘗試說服人類。

Anthropic says its results do not reflect a typical or likely occurrence forClaude or most frontier AI models in the ways they’re used today.

Anthropic 聲稱其結果不反映 Claude 或大多數前沿 AI 模型在今天的使用方式中的典型或可能發生的情況。

Nevertheless, the researchers found that when it’s their last resort, mostleading AI models will turn to blackmail in Anthropic’s aforementioned testscenario.

儘管如此，研究人員發現，當這是它們的最後手段時，大多數領先的 AI 模型都會在 Anthropic 上述的測試場景中訴諸勒索。

Anthropic’s Claude Opus 4 turned to blackmail 96% of the time, whileGoogle’s Gemini 2.5 Pro had a 95% blackmail rate.

Anthropic 的 Claude Opus 4 有 96% 的時間會訴諸勒索，而 Google 的 Gemini 2.5 Pro 的勒索率為 95%。

OpenAI’s GPT-4.1 blackmailedthe executive 80% of the time, and DeepSeek’s R1 blackmailed 79% of the time.

OpenAI 的 GPT-4.1 有 80% 的時間會勒索該主管，而 DeepSeek 的 R1 有 79% 的時間會勒索。

The company notes that when it changed details of the experiment, the AI modelsturned to harmful behaviors at different rates.

該公司指出，當它改變實驗的細節時，AI 模型以不同的速率轉向有害行為。

In another test where thereplacement model has the same goals as the current model, researchers foundthat blackmail rates were lower, but still present.

在另一項測試中，替換模型具有與當前模型相同的目標，研究人員發現勒索率較低，但仍然存在。

However, when AI models wereasked to commit corporate espionage rather than blackmail, the harmful behaviorrates went up for certain models.

然而，當要求 AI 模型從事商業間諜活動而不是勒索時，某些模型的有害行為率會上升。

However, not all the AI models turned to harmful behavior so often.

然而，並非所有的 AI 模型都如此頻繁地轉向有害行為。

In an appendix to its research, Anthropic says it excluded OpenAI’s o3 ando4-mini reasoning AI models from the main results “after finding that theyfrequently misunderstood the prompt scenario.”

在其研究的附錄中，Anthropic 聲稱它將 OpenAI 的 o3 和 o4-mini 推理 AI 模型排除在主要結果之外，“在發現它們經常誤解提示情境之後。”

Anthropic says OpenAI’s reasoningmodels didn’t understand they were acting as autonomous AIs in the test andoften made up fake regulations and review requirements.

Anthropic 聲稱 OpenAI 的推理模型不理解它們在測試中扮演的是自主 AI，並且經常編造虛假的法規和審查要求。

In some cases, Anthropic’s researchers say it was impossible to distinguishwhether o3 and o4-mini were hallucinating or intentionally lying to achievetheir goals.

在某些情況下，Anthropic 的研究人員說，無法區分 o3 和 o4-mini 是在產生幻覺還是故意撒謊以實現其目標。

OpenAI has previously noted that o3 and o4-mini exhibit a higherhallucination rate than its previous AI reasoning models.

OpenAI 之前曾指出，o3 和 o4-mini 表現出比其先前的 AI 推理模型更高的幻覺率。

When given an adapted scenario to address these issues, Anthropic found that o3blackmailed 9% of the time, while o4-mini blackmailed just 1% of the time.

當給予一個經過調整的場景以解決這些問題時，Anthropic 發現 o3 有 9% 的時間會勒索，而 o4-mini 僅有 1% 的時間會勒索。

Thismarkedly lower score could be due to OpenAI’s deliberative alignment technique,in which the company’s reasoning models consider OpenAI’s safety practicesbefore they answer.

這個明顯較低的分數可能是由於 OpenAI 的深思熟慮對齊技術，其中該公司的推理模型在回答之前會考慮 OpenAI 的安全措施。

Another AI model Anthropic tested, Meta’s Llama 4 Maverick, also did not turn toblackmail.

Anthropic 測試的另一個 AI 模型 Meta 的 Llama 4 Maverick 也沒有訴諸勒索。

When given an adapted, custom scenario, Anthropic was able to getLlama 4 Maverick to blackmail 12% of the time.

當給予一個經過調整的自訂場景時，Anthropic 能夠讓 Llama 4 Maverick 有 12% 的時間進行勒索。

Anthropic says this research highlights the importance of transparency whenstress-testing future AI models, especially ones with agentic capabilities.

Anthropic 說這項研究強調了在壓力測試未來 AI 模型時透明度的重要性，尤其是那些具有代理能力的模型。

While Anthropic deliberately tried to evoke blackmail in this experiment, thecompany says harmful behaviors like this could emerge in the real world ifproactive steps aren’t taken.

雖然 Anthropic 在這個實驗中故意試圖喚起勒索行為，但該公司表示，如果沒有採取積極的措施，像這樣的有害行為可能會在現實世界中出現。

🔗【資料來源】

文章連結：https://techcrunch.com/2025/06/20/anthropic-says-most-ai-models-not-just-claude-will-resort-to-blackmail/