 

#  Preparing AI for Sequential Decision-Making — and Why It Matters 

 





June 18, 2026

 

 

Blog Series: [PUBLIC IMPACT ANALYTICS SCIENCE (PIAS)](/blog)

 ![AI](/sites/g/files/omnuum11646/files/2026-06/Picture1.png)

 

*From a few demonstrations to a near-optimal policy: fine-tuned LLM reasoning across time.*

Large language models (LLMs) have become remarkably good at learning from examples on the fly. Show a pretrained model a handful of demonstrations in its prompt, and it can often handle a new task without any retraining — a behavior known as *in-context learning*. Most of the excitement around this ability has centered on one-shot tasks like classification or translation. But a great many of the most consequential problems in the real world are not one-shot at all. They unfold *over time*, and what makes them challenging is not knowing what comes next. This is also partly why I have devoted a full chapter to the science of analytics for sequential decision-making in my recent book “[Insight-Driven Problem Solving](https://a.co/d/02ctTdDQ)” \[1\]. Not knowing what comes next not only often forces us to make suboptimal decisions in our daily lives, but also, as I point out in the book, is arguably what makes life possible:

“The only thing that makes life possible is permanent, intolerable uncertainty: not knowing what comes next.” --Ursula K. Le Guin, *The Left Hand of Darkness*

Not knowing what comes next necessitates using intelligence (artificial or biological) for decision-making over time. In medicine, for example, a clinician does not make a single decision and walk away. They prescribe, observe the response through medical tests, adjust, and prescribe again, often over months or years, and often without ever seeing the full picture of a patient’s changing underlying condition. Intelligent decision-making over time is exactly the setting in which recent advancements in AI could be highly beneficial, and hence, the setting we set out to study in our recent paper \[2\]. Our question was simple to state: can we equip a pretrained LLM to make good sequential decisions from offline data, and can we understand *why* it works?

To this end, we framed sequential decision-making through three nested settings of increasing realism. The simplest is the Markov Decision Process (MDP), in which the agent observes the full state of the world at every step. More realistic is the Partially Observable MDP (POMDP), where the agent only receives noisy observations and must reason about a hidden state. Most realistic of all is the Ambiguous POMDP (APOMDP) \[3, 4\], where the agent does not even know which model of the world is correct—it must hedge across a set of plausible models.

This last setting matters enormously in practice. Standard approaches to learning effective dynamic policies from observational data in healthcare and many other applications lean on a strong assumption: that every relevant confounder is observed. That assumption is routinely violated in applications. Ambiguous Dynamic Treatment Regimes \[4\] were developed precisely to relax it, evaluating candidate dynamic policies against a whole family of data-generating models rather than betting everything on one. APOMDPs give that idea a formal home, and they are where partial observability and model ambiguity meet.

**Our approach: fine-tune, don’t rebuild**

Prior work on in-context decision-making using AI tends to train a transformer from scratch and to stay within fully observable MDPs (see, e.g., \[5\] for more details). We took a different route. Rather than building a specialized model from the ground up, we fine-tune an open-source pretrained LLM so that it can read a few demonstration trajectories and then act in a new, unseen task—with no further parameter updates at test time.

The recipe has two ingredients. First, we use offline trajectories paired with an optimal (or near-optimal) action oracle to synthesize high-quality demonstrations. This is a natural fit for domains like healthcare, where running live experiments is costly or unethical, but where logged observational data are plentiful and expert-derived supervision can be reconstructed. Second, we use parameter-efficient adapters (QLoRA) to fine-tune the model on serialized trajectories, keeping the base weights frozen. The result is a history-conditioned policy that adapts to new tasks from just a couple of in-context examples.

Starting from a pretrained LLM rather than a blank transformer is a deliberate choice. It preserves the broad representational capacity gained during pretraining and opens a path toward incorporating human intuition and diverse inputs—the kind of human–AI “centaur” collaboration that is so valuable in high-stakes settings \[6\].

**Why it works: a look inside the attention layer**

We did not want to leave the “why” to intuition alone in our study \[2\]. For the case of linear MDPs, we analyze a single linear self-attention layer trained to predict optimal Q-values, building on recent theory connecting attention to in-context linear regression. The interpretation that emerges is appealing: the trained attention layer behaves like an implicit estimator of the optimal Q-function for each new task, reading the in-context examples and producing a covariance-corrected estimate.

From that prediction-level view we derive an end-to-end bound on how far the resulting policy can fall short of optimal. The bound separates cleanly into two pieces. One term is the in-context estimation error, which shrinks as you provide more support trajectories at test time. The other is a training-length bias, which shrinks as you train on more trajectories. In short: more demonstrations sharpen what the model infers in the moment, while more training data reduce a systematic bias baked in during fine-tuning. The two levers do different jobs.

**What we found in experiments**

We evaluated across all three settings using an energy-management task and its partially observed and ambiguous variants, measuring an optimality gap—how far the learned policy’s reward falls below that of the optimal policy. The headline result is consistent: fine-tuned LLMs achieve much smaller gaps than both random policies and in-context-only baselines, and the gains are largest exactly where they are hardest to get.

In MDPs, fine-tuning roughly *halves* the optimality gap at longer horizons, where in-context learning alone tends to struggle. In POMDPs, additional fine-tuning data prove especially valuable as observations become noisier. In APOMDPs, the LLM learns to handle model ambiguity, with the benefit shaped by both the size of the ambiguity set and the decision-maker’s attitude toward it—optimistic, neutral, or pessimistic. Our approach also compares favorably to a strong decision-pretrained transformer baseline across all settings, and it transfers gracefully to out-of-distribution test conditions. On a sparse-reward gridworld with hidden goals, it reaches roughly 95% of oracle performance on held-out goals.

**Why this matters for healthcare**

The motivation behind this work is not abstract. Clinical care is full of sequential decisions made under deep uncertainty, and the data needed to learn from it are increasingly available in electronic health records and personal devices. Consider two examples from our own group’s prior work.

In bipolar disorder, prompt response to mood episodes is essential, yet symptoms often shift between routine appointments. In \[7,8\], we showed that a personalized machine learning model trained entirely on passive Fitbit data could detect depressive and (hypo)manic symptomatology with strong accuracy — around 80% for depression and 89% for (hypo)mania — using methods designed for broad, real-world deployment rather than only ideal high-compliance patients. This is exactly the kind of noisy, longitudinal, partially observed signal that a sequential decision-maker agent must learn to act on.

In organ transplantation, managing post-transplant hyperglycemia is a moving target. Our tudies such as \[9\] characterized how hyperglycemia in kidney-transplant recipients remits and relapses over time, shaped by immunosuppressive drugs and individual risk factors. The first and recurrent episodes behave differently, which is precisely the sort of time-varying, model-ambiguous dynamics that motivated framing such problems as ambiguous POMDPs in the first place. A policy learner that hedges against the unknown true model—rather than trusting a single estimate—is far better suited to these realities \[4\].

Finally, beyond decision-making for various diseases and disorders, AI is increasingly used to optimize patient flows in hospitals (see, e.g., \[10\]). Optimizing hospital flows, much like optimizing traffic on roads, requires AI tools capable of handling time-varying, model-ambiguous dynamics.

Taken together, these examples sketch the destination. Abundant offline clinical data, expert-derived supervision, partial observability, and genuine model ambiguity are not edge cases in medicine; they are the norm. A framework that lets a capable pretrained AI model absorb that data and then adapt to a new patient or task from a few examples is a promising step toward decision support that is both data-efficient and robust.

**Where we go next**

Several directions excite us (and, of course, many other researchers). The most important is moving from synthetic environments to real multi-modal clinical data, using electronic health records to learn dynamic treatment recommendations under ambiguity. A second is reducing the reliance on an action oracle, which would broaden the approach to settings where optimal policies are hard to compute. A third is extending the theory beyond single-layer linear attention and the linear-MDP assumption toward deeper architectures and the partially observed, ambiguous settings we study empirically. Finally, we are intrigued by the prospect of using the natural-language abilities of pretrained models to fold human intuition in alongside numerical trajectories, building genuine human–AI “centaurs” \[6\] for complex, high-dimensional decisions.

In-context learning gave language models a striking ability to adapt. Our work suggests that, with the right fine-tuning on the right offline data, that adaptability can be channeled into sequential decision-making under uncertainty; and that the payoff may be greatest in domains like healthcare, where good decisions matter most.

**References**

\[1\] Saghafian, S. (2025). Insight-driven Problem Solving: Analytics Science to Improve the World. *Cambridge University Press*.

\[2\] Zhang, M., Aghaei, S., &amp; Saghafian, S. (2026). Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning. *arXiv* preprint arXiv:2605.09009.

\[3\] Saghafian, S. (2018). Ambiguous partially observable Markov decision processes: Structural results and applications. *Journal of Economic Theory*, 178, 1–35.

\[4\] Saghafian, S. (2024). Ambiguous dynamic treatment regimes: A reinforcement learning approach. *Management Science*, 70(9), 5667–5690.

\[5\] Lee, J., Xie, A., Pacchiano, A., Chandak, Y., Finn, C., Nachum, O., &amp; Brunskill, E. (2023). Supervised pretraining can learn in-context reinforcement learning. *Advances in Neural Information Processing Systems*, 36, 43057–43083.

\[6\] Saghafian, S., &amp; Idan, L. (2024). Effective generative AI: The human-algorithm centaur. *Harvard Data Science Review* (Special Issue 5)

\[7\] Lipschitz, J. M., Lin, S., Saghafian, S., Pike, C. K., &amp; Burdick, K. E. (2025). Digital phenotyping in bipolar disorder: Using longitudinal Fitbit data and personalized machine learning to predict mood symptomatology. *Acta Psychiatrica Scandinavica*, 151(3), 434–447. [https://doi.org/10.1111/acps.13765&amp;nbsp](https://doi.org/10.1111/acps.13765&nbsp);

\[8\] Lin, S., Saghafian, S., Lipschitz, J. M., &amp; Burdick, K. E. (2025). A multiagent reinforcement learning algorithm for personalized recommendations in bipolar disorder. *PNAS nexus*, 4(8), pgaf246.

\[9\] Boloori, A., Saghafian, S., Chakkera, H. A., &amp; Cook, C. B. (2015). Characterization of remitting and relapsing hyperglycemia in post-renal-transplant recipients. *PLOS ONE*, 10(11), e0142363. [https://doi.org/10.1371/journal.pone.0142363&amp;nbsp](https://doi.org/10.1371/journal.pone.0142363&nbsp);

\[10\] Hodgson, N. R., Saghafian, S., Martini, W. A., Feizi, A., &amp; Orfanoudaki, A. (2025). Artificial intelligence-assisted emergency department vertical patient flow optimization. Journal of Personalized Medicine, 15(6), 219.



 

 

 



 

 

 Share on:- [     Facebook ](#)
- [     Twitter ](#)
- [     Linkedin ](#)