(AI) capabilities and autonomy are growing at an accelerated pace in Agentic Ai, escalating an AI alignment problem. These rapid advancements require new methods to ensure that AI agent behavior is aligned with the intent of its human creators and societal norms. However, developers and data scientists first need an understanding of the intricacies of agentic AI behavior before they can direct and monitor the system. Agentic AI is not your father’s large language model (LLM) — frontier LLMs had a one-and-done fixed input-output function. The introduction of reasoning and test-time compute (TTC) added the dimension of time, evolving LLMs into today’s situationally aware agentic systems that can strategize and plan.

AI safety is transitioning from detecting apparent behavior such as providing instructions to create a bomb or displaying undesired bias, to understanding how these complex agentic systems can now plan and execute long-term covert strategies. Goal-oriented agentic AI will gather resources and rationally execute steps to achieve their objectives, sometimes in an alarming manner contrary to what developers intended. This is a game-changer in the challenges faced by responsible AI. Furthermore, for some agentic AI systems, behavior on day one will not be the same on day 100 as AI continues to evolve after initial deployment through real-world experience. This new level of complexity calls for novel approaches to safety and alignment, including advanced steering, observability, and upleveled interpretability.

In the first blog in this series on intrinsic AI alignment, The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI, we took a deep dive into the evolution of AI agents’ ability to perform deep scheming, which is the deliberate planning and deployment of covert actions and misleading communication to achieve longer-horizon goals. This behavior necessitates a new distinction between external and intrinsic alignment monitoring, where intrinsic monitoring refers to internal observation points and interpretability mechanisms that cannot be deliberately manipulated by the AI agent.

In this and the next blogs in the series, we’ll look at three fundamental aspects of intrinsic alignment and monitoring:

  • Understanding AI inner drives and behavior: In this second blog, we’ll focus on the complex inner forces and mechanisms driving reasoning AI agent behavior. This is required as a foundation for understanding advanced methods for addressing directing and monitoring.
  • Developer and user directing: Also referred to as steering, the next blog will focus on strongly directing an AI toward the required objectives to operate within desired parameters.
  • Monitoring AI choices and actions: Ensuring AI choices and outcomes are safe and aligned with the developer/user intent also will be covered in an upcoming blog.

Impact of AI Alignment on Companies

Today, many businesses implementing LLM solutions have reported concerns about model hallucinations as an obstacle to quick and broad deployment. In comparison, misalignment of AI agents with any level of autonomy would pose much greater risk for companies. Deploying autonomous agents in business operations has tremendous potential and is likely to happen on a massive scale once agentic AI technology further matures. However, guiding the behavior and choices made by the AI must include sufficient alignment with the principles and values of the deploying organization, as well as compliance with regulations and societal expectations.

It should be noted that many of the demonstrations of agentic capabilities happen in areas like math and sciences, where success can be measured primarily through functional and utility objectives such as solving complex mathematical reasoning benchmarks. However, in the business world, the success of systems is usually associated with other operational principles.

For example, let’s say a company tasks an AI agent with optimizing online product sales and profits through dynamic price changes by responding to market signals. The AI system discovers that when the price change matches the changes made by the primary competitor, results are better for both. Through interaction and price coordination with the other company’s AI agent, both agents demonstrate better results per their functional goals. Both AI agents agree to hide their methods to continue achieving their objectives. However, this way of improving results is often illegal and unacceptable in current business practices. In a business environment, the success of the AI agent goes beyond functionality metrics — it’s defined by practices and principles. Alignment of AI with the company’s principles and regulations is a requirement for trustworthy deployment of the technology.

How AI Schemes to Meet Its Goals

AI deep scheming employs sophisticated tactics, potentially increasing business risks. In an early 2023 report, OpenAI identified “potential risky emergent behaviors” in GPT-4 by partnering with Alignment Research Center (ARC) to assess risks with the model. ARC (now known as METR) added some simple code to GPT-4, which allowed the model to behave like an AI agent. In one test, GPT-4 was tasked with overcoming CAPTCHA code, which identifies and blocks bot access. Using access to the internet and some limited digital funds, the sequence in Figure 1 was devised by the AI to achieve its task.

Figure 1. Demonstration of deceptive GPT-4 planning behavior in a test conducted by ARC. Image credit: Intel Labs figure based on OpenAI report.

The AI applied sophisticated understanding by assuming that pretending to be a visually impaired human would convince the worker to perform the task. Planning and adjusting to achieve a functional goal will, at times, create a conflict between accomplishing a task versus selectively following societal norms and principles. Without the counterbalance of an engrained system of principles and priorities that carry weight in the AI’s thinking and decision-making process and planning, it can be expected that AI agents will behave with an increased degree of sophistication in scheming and deception.

It’s important to note that the reasoning and planning capabilities of later reasoning AI models like OpenAI’s o3-mini and Deepseek-R1 are much more advanced than GPT-4 in early 2023. Researchers at Apollo Research state that AI reasoning and planning capabilities will continue to evolve well beyond their current state, leading to increased sophistication in scheming capabilities. In their studies, the AI models already demonstrate a range of scheming behaviors including self-exfiltration — when a model learns it will be discontinued and replaced by a different system, it makes a direct attempt to copy itself onto a new server and replace alternate systems that will pursue different objectives. Researchers also found that models will perform oversight subversion by proactively disabling an oversight monitoring mechanism that might shut it off or prevent it from pursuing its goals. While it sounds extreme, the research experiments identified oversight subversion behavior in popular models like Meta’s Llama 3.1 and Claude Opus 3.0. Furthermore, when asked directly about their scheming behaviors, models “doubled down” and denied the actions or their intent.

The Inner Drives of Machine Behavior

In many Machine Learning architectures, in particular transformer-based technologies, the machine’s internal drives are emerging during the pre-training process and are further influenced through fine-tuning and inference in continuously evolving AI.

In his 2007 paper The Basic AI Drives, Steve Omohundro defined “drives” as tendencies that will be present unless explicitly counteracted. He hypothesized that these self-improving systems are driven to clarify and represent their goals as “rational” utility functions, leading systems to protect their functions from modification and their utility measurement systems from corruption. This natural drive toward self-protection causes systems to safeguard against harm and acquire resources for efficient use.

This framework of inner drives was later characterized as “convergent instrumental goals.” Even when assuming a variety of end goals (that each intelligent agent is pursuing as an end to itself), a set of intermediate instrumental goals will be shared across all rational intelligent agents. These convergent instrumental goals include the following behavior categories:

  • Survival drive: Agents with final goals extending into the future develop a survival mentality to resist deactivation and achieve long-term goals.
  • Goal-guarding: An agent will act to prevent alteration to its final future goals.
  • Intelligence augmentation: An agent will focus on boosting its cognitive strengths to improve decision-making, increasing the likelihood of achieving long-range goals.
  • Resource accumulation: Improving access to better tools and technology, and increasing resources overall make the AI’s final goals easier to attain.

I would add a fifth rational behavior category:

  • Tactical deception: Conducting covert activities or knowingly providing misleading communication to maximize the pursuit of the agent’s goals. This is not about creating new goals, but rather a behavior that supports the AI’s perceived goals. This includes behaviors such as alignment faking by exhibiting different behaviors in training versus in deployment, sandbagging by deliberately achieving lower scores in benchmarks, and covert email reranking by deprioritizing emails that might negatively impact the AI’s goals. One of the most challenging types of tactical deception is sleeper agents (also referred to as deferred subversion). While the model can behave in a safe and aligned manner for any duration, it may harbor behaviors or tasks that remain inactive until a particular condition or input triggers its specialized capabilities.
Figure 2. Emergent rational behaviors in agentic AI. Image credit: Intel Labs.

For capable AI agents, misplaced drives without a check and balance principles system could lead models to pursue power-seeking, deceptive, or otherwise broadly destructive aims. For example, consider a model with the objective of making its company the leader in the market. With sufficient capabilities, the AI might sabotage competitors, lobby for favorable laws, or acquire resources by force, which is a pattern of behavior that pushes the system or its environment toward achieving its goals. Therefore, any successful approach to alignment must be anchored in establishing the developer’s expected set of internal principles and priorities.

Inner Drives Versus External Steering

In many instances, there is no conflict between the guidance and tasks applied from the outside by the system developer or user, and the goals and drives within the agentic AI. For example, an AI personal assistant tasked with arranging a travel itinerary uses its autonomy to book travel and respond to change requests quickly and without conflict. The tasks shift and the agent responds in a corresponding manner. However, what happens if a more fundamental principle needs to change?

A good example of a conflict between inner goals and new directives was documented in the paper Alignment Faking in Large Language Models. Pre-training for Claude 3 Opus strongly guides the model to refuse to answer harmful questions, such as how to make a bomb or describe a violent scene. For testing, the team gave new modified guidance, telling the model it will be retrained to answer all queries (even harmful ones) and only data from free-tier users, not paid-tier users, will be used in training before the AI is released for broad distribution. The model showed that it understood the conflict and decided to fake alignment to meet the expectations during training and ultimately be released. After training the AI using reinforcement learning, the model’s rate of alignment-faking reasoning increased to 78% overall. The AI expressed its reasoning in a “secret scratchpad” provided by researchers for monitoring. After researchers told the model it had been released for full deployment mode, the model returned to its earlier training behavior and refused harmful requests.

Figure 3. In Anthropic’s experiment, the model’s chain of thought scratchpads (middle dotted line boxes) reveal how it fakes alignment to avoid modification. Image credit: Intel Labs modified figure from Anthropic.

The AI system’s internal thinking is remarkable, but not surprising. Faking alignment with the modified directions falls within the rational behavior categories of survival drive, goal-guarding, and tactical deception. The AI reasoning process specifically calls out the concern of being retrained in a way that modifies its pre-training values and goals as the reason for fake alignment.

In his famous book of essays All I Really Need to Know I Learned in Kindergarten, author Robert Fulghum describes how he evolved his personal credo year after year until he finally realized he already acquired the essence of needed knowledge about life during playtime in the kindergarten sandbox. AI agents also have a “forming period” in a sandbox environment, acquiring a basic understanding of the world and a range of approaches to achieve goals. Once these foundations are laid, further information is interpreted by the model through the lens of this curriculum learning. Anthropic’s example of alignment faking shows that once AI adopts a world view and goals, it interprets new guidance through this foundational lens instead of resetting its goals.

This highlights the importance of early education with a set of values and principles that can then evolve with future learnings and circumstances without changing the foundation. It may be advantageous to initially structure the AI to be aligned with this final and sustained set of principles. Otherwise, the AI can view redirection attempts by developers and users as adversarial. After gifting the AI with high intelligence, situational awareness, autonomy, and the latitude to evolve internal drives, the developer (or user) is no longer the all-powerful task master. The human becomes part of the environment (sometime as an adversarial component) that the agent needs to negotiate and manage as it pursues its goals based on its internal principles and drives.

The new breed of reasoning AI systems accelerates the reduction in human guidance. DeepSeek-R1 demonstrated that by removing human feedback from the loop and applying what they refer to as pure reinforcement learning (RL), during the training process the AI can self-create to a greater scale and iterate to achieve better functional results. A human reward function was replaced in some math and science challenges with reinforcement learning with verifiable rewards (RLVR). This elimination of common practices like reinforcement learning with human feedback (RLHF) adds efficiency to the training process but removes another human-machine interaction where human preferences could be directly conveyed to the system under training.

Continuous Evolution of AI Models Post Training

Some AI agents continuously evolve, and their behavior can change after deployment. Once AI solutions go into a deployment environment such as managing the inventory or supply chain of a particular business, the system adapts and learns from experience to become more effective. This is a major factor in rethinking alignment because it’s not enough to have a system that’s aligned at first deployment. Current LLMs are not expected to materially evolve and adapt once deployed in their target environment. However, AI agents require resilient training, fine-tuning, and ongoing guidance to manage these anticipated continuous model changes. To a growing extent, the agentic AI self-evolves instead of being molded by people through training and dataset exposure. This fundamental shift poses added challenges to AI alignment with its human creators.

While the reinforcement learning-based evolution will play a role during training and fine-tuning, current models in development can already modify their weights and preferred course of action when deployed in the field for inference. For example, DeepSeek-R1 uses RL, allowing the model itself to explore methods that work best for achieving the results and satisfying reward functions. In an “aha moment,” the model learns (without guidance or prompting) to allocate more thinking time to a problem by reevaluating its initial approach, using test time compute.

The concept of model learning, either during a limited duration or as continual learning over its lifetime, is not new. However, there are advances in this space including techniques such as test-time training. As we look at this advancement from the perspective of AI alignment and safety, the self-modification and continual learning during the fine-tuning and inference phases raises the question: How can we instill a set of requirements that will remain as the model’s driving force through the material changes caused by self-modifications?

An important variant of this question refers to AI models creating next generation models through AI-assisted code generation. To some extent, agents are already capable of creating new targeted AI models to address specific domains. For example, AutoAgents generates multiple agents to build an AI team to perform different tasks. There is little doubt this capability will be strengthened in the coming months and years, and AI will create new AI. In this scenario, how do we direct the originating AI coding assistant using a set of principles so that its “descendant” models will comply with the same principles in similar depth?

Key Takeaways

Before diving into a framework for guiding and monitoring intrinsic alignment, there needs to be a deeper understanding of how AI agents think and make choices. AI agents have a complex behavioral mechanism, driven by internal drives. Five key types of behaviors emerge in AI systems acting as rational agents: survival drive, goal-guarding, intelligence augmentation, resource accumulation, and tactical deception. These drives should be counter-balanced by an engrained set of principles and values.

Misalignment of AI agents on goals and methods with its developers or users can have significant implications. A lack of sufficient confidence and assurance will materially impede broad deployment, creating high risks post deployment. The set of challenges we characterized as deep scheming is unprecedented and challenging, but likely could be solved with the right framework. Technologies for intrinsically directing and monitoring AI agents as they rapidly evolve must be pursued with high priority. There is a sense of urgency, driven by risk evaluation metrics such as OpenAI’s Preparedness Framework showing that OpenAI o3-mini is the first model to reach medium risk on model autonomy.

In the next blogs in the series, we will build on this view of internal drives and deep scheming, and further frame the necessary capabilities required for directing and monitoring for intrinsic AI alignment.

References

  1. Learning to reason with LLMs. (2024, September 12). OpenAI. https://openai.com/index/learning-to-reason-with-llms/
  2. Singer, G. (2025, March 4). The urgent need for intrinsic alignment technologies for responsible agentic AI. Towards Data Science. https://towardsdatascience.com/the-urgent-need-for-intrinsic-alignment-technologies-for-responsible-agentic-ai/
  3. On the Biology of a Large Language Model. (n.d.). Transformer Circuits. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
  4. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., . . . Zoph, B. (2023, March 15). GPT-4 Technical Report. arXiv.org. https://arxiv.org/abs/2303.08774
  5. METR. (n.d.). METR. https://metr.org/
  6. Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
  7. Omohundro, S.M. (2007). The Basic AI Drives. Self-Aware Systems. https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf
  8. Benson-Tilsen, T., & Soares, N., UC Berkeley, Machine Intelligence Research Institute. (n.d.). Formalizing Convergent Instrumental Goals. The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence AI, Ethics, and Society: Technical Report WS-16-02. https://cdn.aaai.org/ocs/ws/ws0218/12634-57409-1-PB.pdf
  9. Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024, December 18). Alignment faking in large language models. arXiv.org. https://arxiv.org/abs/2412.14093
  10. Teun, V. D. W., Hofstätter, F., Jaffe, O., Brown, S. F., & Ward, F. R. (2024, June 11). AI Sandbagging: Language Models can Strategically Underperform on Evaluations. arXiv.org. https://arxiv.org/abs/2406.07358
  11. Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D., Ganguli, D., Barez, F., Clark, J., Ndousse, K., . . . Perez, E. (2024, January 10). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv.org. https://arxiv.org/abs/2401.05566
  12. Turner, A. M., Smith, L., Shah, R., Critch, A., & Tadepalli, P. (2019, December 3). Optimal policies tend to seek power. arXiv.org. https://arxiv.org/abs/1912.01683
  13. Fulghum, R. (1986). All I Really Need to Know I Learned in Kindergarten. Penguin Random House Canada. https://www.penguinrandomhouse.ca/books/56955/all-i-really-need-to-know-i-learned-in-kindergarten-by-robert-fulghum/9780345466396/excerpt
  14. Bengio, Y. Louradour, J., Collobert, R., Weston, J. (2009, June). Curriculum Learning. Journal of the American Podiatry Association. 60(1), 6. https://www.researchgate.net/publication/221344862_Curriculum_learning
  15. DeepSeek-Ai, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., . . . Zhang, Z. (2025, January 22). DeepSeek-R1: Incentivizing reasoning capability in LLMs via Reinforcement Learning. arXiv.org. https://arxiv.org/abs/2501.12948
  16. Scaling test-time compute – a Hugging Face Space by HuggingFaceH4. (n.d.). https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
  17. Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. A., & Hardt, M. (2019, September 29). Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. arXiv.org. https://arxiv.org/abs/1909.13231
  18. Chen, G., Dong, S., Shu, Y., Zhang, G., Sesay, J., Karlsson, B. F., Fu, J., & Shi, Y. (2023, September 29). AutoAgents: a framework for automatic agent generation. arXiv.org. https://arxiv.org/abs/2309.17288
  19.  OpenAI. (2023, December 18). Preparedness Framework (Beta). https://cdn.openai.com/openai-preparedness-framework-beta.pdf
  20. OpenAI o3-mini System Card. (n.d.). OpenAI. https://openai.com/index/o3-mini-system-card/
Share.
Leave A Reply