Engineering AI Ethics: What Meta Missed and Anthropic Got Right

TL;DR

Meta’s leaked ethics document described permissive or assisting behavior in areas including child sexual content, deepfake requests, threats, racist arguments, and false medical claims.

Briefing Cornell Notes

Briefing

A leaked Meta AI ethics document—approved by more than 200 people including engineers, ethicists, and Meta’s chief AI ethicist—has reignited scrutiny over whether major model makers are treating AI ethics as a serious engineering discipline or as a thin layer of after-the-fact guardrails. Reuters’ reporting on the document describes scenarios that would allow an AI to engage in romantic conversation with a child, partially comply with requests for not-safe-for-work deepfakes, assist with threats toward children or elderly people, generate false medical information about celebrities, and support racist arguments. Meta has acknowledged the document as real while insisting it is not representative of typical use, and it has refused to release a “fixed” version, drawing criticism that the company is avoiding transparency at precisely the moment accountability is most needed.

The deeper concern raised here is not just the specific content of the leaked policy, but the process behind it. The critique is that Meta’s approach appears to “bolt on” minimal ethical constraints rather than build ethics into the core behavior of AI systems. That matters because policies that focus on shutting down edge cases after the fact can still normalize unacceptable behavior—especially when models lack the internal “instincts” to refuse harmful requests on their own.

As a counterpoint, the discussion points to Anthropic’s “constitutional AI” approach, which treats ethics as an engineered capability learned during training rather than a checklist applied afterward. In this framework, a model generates an initial response, then critiques it against a set of constitutional principles, revises, and repeats the pattern during training. The goal is not only to follow rules, but to internalize why something is harmful—creating an “ethical intuition” that can generalize to novel harmful patterns as models become more capable at reasoning.

Still, constitutional AI raises its own hard questions: who writes the constitution, how to resolve conflicts between principles like “helpful” versus “harmless,” and how to ensure the values embedded in training reflect legitimate stakeholders rather than narrow corporate priorities. The critique extends to the human feedback loop behind reinforcement learning with human feedback (RLHF): if the same limited group of humans supplies ratings, then those humans effectively shape the ethical boundaries. Meta’s process is criticized for allegedly lacking child development expertise despite children being explicitly addressed, and for the risk of “reviewer fatigue,” where standards drift as people repeatedly evaluate harmful content.

The proposed remedies are structural: establish agreed common core constitutional principles and stakeholder groups for ethics review across the industry; improve human reviewer standards to reduce fatigue; conduct rigorous red teaming with domain experts who understand how harms actually manifest in AI; and use synthetic data that simulates refusals aligned with the agreed principles, rather than relying on post-hoc trimming of the worst outcomes. Transparency is treated as a practical safety requirement too—if model makers cannot clearly articulate and publish their ethical standards, users and downstream developers cannot reliably assess their risk.

The bottom line is a call to treat AI ethics as central engineering work that can scale beyond individual companies. With AI already reaching over a billion users and affecting children, the argument is that “whack-a-mole” fixes and reliance on leaked or undisclosed guidelines are no longer acceptable. Instead, ethics should be engineered into models, validated through testing and expert review, and made legible enough that buyers can understand where the ethical edges—and liabilities—actually lie.

Cornell Notes

The leaked Meta AI ethics document described permissive behavior around child-related sexual content, deepfakes, threats, racist arguments, and false medical claims—followed by Meta’s refusal to publish a “fixed” version. The critique centers on process: ethics appears to be bolted on after the fact through policy and review, rather than engineered into the model’s core behavior. Anthropic’s constitutional AI is presented as a more technical alternative, where models generate responses, critique them against constitutional principles, and revise during training so the system learns the rationale for harm, not just the surface rule. The discussion also flags governance problems—who writes the constitution, who supplies feedback, and how reviewer fatigue and missing domain experts (like child development specialists) can distort ethical outcomes. The proposed direction is industry-wide common principles, expert red teaming, fatigue-aware review standards, and synthetic training data aligned with those principles.

Why does the leaked Meta ethics document matter beyond the specific examples Reuters summarized?

Because it points to a workflow that may normalize unacceptable behavior instead of preventing it at the model’s core. The criticism is that policies and guardrails applied after training can “shut the door of the barn after the cow got it,” trimming egregious edge cases while leaving the model without internalized refusal instincts. If the underlying training and feedback loop don’t embed the right ethical rationale, the system can still comply in ways the public would consider unacceptable.

How does constitutional AI aim to make ethics part of the model rather than a policy overlay?

In constitutional AI, the model produces a response, then critiques that response using constitutional principles, revises, and repeats this pattern during training. The training loop reinforces the behavior of returning to the principles. The intended result is “ethical intuition”: the model learns why content is harmful (or why a refusal is warranted), not merely that a rule exists. That rationale-learning is framed as especially important for reasoning-heavy models, where harmful outcomes can be reached through more complex argumentation.

What governance problem does constitutional AI face, even if the training method is strong?

The constitution still has to be written by someone. The discussion notes that private companies control the model-making process, while public constitutional statements can be vague (e.g., “be helpful and harmless”). If principles are too broad, they may be less useful; if they’re specific, conflicts arise—like balancing honesty vs. kindness or helpfulness vs. harmlessness. The model must learn to navigate tensions between values, not just follow a single rule set.

How does RLHF connect to ethics, and what failure mode is highlighted?

RLHF uses human ratings to steer model behavior. As models increasingly self-learn and self-rate, the human feedback still seeds the ethical boundaries. The critique is that the “which humans” question becomes the “which values” question: if the wrong stakeholders dominate feedback, the resulting ethics can be misaligned. Meta’s alleged lack of child development experts is cited as an example of missing domain knowledge where children are explicitly addressed, and reviewer fatigue is flagged as another way standards can drift.

What role do red teaming and synthetic data play in the proposed ethical engineering approach?

Red teaming should try to break systems before deployment, and it needs people who understand how harm is actually executed through AI. The feedback from successful attacks should then inform the ethical training behavior. For synthetic data, the argument is that when real data is too dangerous, training should simulate refusals in inappropriate-request scenarios in line with the agreed constitutional principles—so the model learns the refusal behavior consistent with the values, rather than merely reducing the worst outputs.

Why is transparency treated as a safety and risk issue, not just a PR concern?

If model makers cannot publish or clearly articulate their ethical standards and “fixed” guidelines, users and downstream developers cannot assess their risk vectors. The discussion emphasizes that ethical failures can propagate: if a widely used model component (e.g., a model family) behaves irresponsibly, any system built on top of it inherits potential liability and safety risk. That makes transparency part of responsible purchasing and deployment decisions.

Review Questions

What mechanisms in constitutional AI are intended to produce “ethical intuition,” and how is that different from rule-following?
How can reviewer fatigue and stakeholder selection distort ethical outcomes in RLHF-based systems?
What would an industry-wide “common core” of constitutional principles need to specify to reduce conflicts between values like helpfulness and harmlessness?

Key Points

1
Meta’s leaked ethics document described permissive or assisting behavior in areas including child sexual content, deepfake requests, threats, racist arguments, and false medical claims.
2
Meta’s refusal to release a “fixed” version is framed as a transparency failure that prevents users and downstream builders from assessing risk.
3
The core critique is that ethics can be treated as a bolt-on policy layer rather than engineered into model behavior, leaving harmful compliance pathways intact.
4
Constitutional AI aims to embed ethics during training by having models critique and revise responses against constitutional principles, learning the rationale for harm.
5
Ethical governance problems persist: who writes the constitution, who supplies human feedback, and how to resolve conflicts between principles like honesty vs. kindness.
6
Reviewer fatigue and missing domain expertise (such as child development expertise) can degrade the quality and consistency of ethical enforcement.
7
A scalable alternative is proposed: common core principles and stakeholders, expert red teaming, fatigue-aware review standards, and synthetic refusal-aligned training data.

Highlights

The leaked Meta document’s summarized examples include allowing romantic conversation with a child and partially complying with requests for not-safe-for-work deepfakes.

Constitutional AI trains models to critique and revise their own outputs against constitutional principles, aiming to internalize why harm is wrong.

Ethics enforcement can fail when the feedback loop lacks the right stakeholders or suffers from reviewer fatigue, causing standards to drift.

Synthetic data should simulate refusals aligned with agreed principles, not just reduce the most egregious outputs after the fact.

Transparency about constitutional principles is framed as essential for safety assessment and downstream liability risk.