I Introduction
Autonomous systems are becoming more capable, better accepted, and more commonplace. Many autonomous systems, including collaborative robots [9] and selfdriving cars [18], operate in dynamic and interactive environments. As an example, a selfdriving car may operate in traffic with multiple other cars, which inform the environment. The environment is dynamic and interactive because the other cars not only operate concurrently with the ego car but also respond to its actions [14].
Decision making for an autonomous system operating in a dynamic and interactive environment needs to take the interactions between the system and its environment into account. For autonomous systems interacting with humans, it is especially important to account for these interactions to ensure safety. Despite much progress made, this problem remains challenging and unsolved in many application scenarios [18, 8].
Game theory is a useful tool for modeling strategic interactions between intelligent agents [16]. Among various game theoretic frameworks, cognitive hierarchy theory (CHT) has drawn attention from game theorists and practitioners since the 90’s [17, 19] due to improved accuracy in predicting human behavior compared to equilibriumbased theories in many experimental studies [6, 5]. CHT describes human thought processes in strategic games by characterizing human behavior based on levels of iterated rationalizability. In particular, it assumes bounded rationality of decision makers in contrast to the assumption of unbounded/perfect rationality in many equilibriumbased theories. The assumption of bounded rationality can be more realistic than that of unbounded rationality in many practical situations because the reasoning capability of a decision maker is often limited by the complexity of the decision problem and the time available to make a decision [7].
In this paper, we describe a framework based on CHT for autonomous system/intelligent agent decision making when operating in a dynamic and interactive environment. The framework synergizes CHT, Bayesian inference, and recedinghorizon optimal control to solve for decision strategies. In particular, hard constraints are probabilistically enforced over the planning horizon to incorporate safety requirements. The interactive decision making process is formulated as a constrained partially observable Markov decision process, and a recently developed algorithm
[13, 11] is applied to solve it.We note that although CHT has been utilized for modeling multiagent interactions in the literature [14, 22, 15, 10], most of the existing works exploit the “levelk” framework [17, 19], which assumes that a level decision maker treats the interactive environment as a level() decision maker and responds to it accordingly. Unfortunately, level decision rules may lead to poor decisions if the ()assumption about the interactive environment’s cognitive level is incorrect [12].
In this paper, we consider a different framework called the “cognitive hierarchy” framework [4], where CH decisions are optimized to strategically respond to the interactive environment by modeling it using a mixture of level, , decisionmaker models. The cognitive hierarchy framework enables the autonomous agent to strategically interact with environments with different cognitive levels. When the autonomous agent has good knowledge about its operating environment, the mixing ratio of level models could be prespecified [1]
. When operating in an uncertain environment, reasoning about the interactive environment’s cognitive level could be incorporated in the decision making process, which is one of the key developments of this paper. Although heuristic techniques for estimating the environment’s cognitive level similar in spirit to the level reasoning algorithm in this paper have been proposed in
[10, 12, 21] for specific applications, the approach in this paper is based on Bayesian inference, hence having a firmer theoretical foundation, and has broader applicability.This paper is organized as follows: In Section II, we formulate decision making in a dynamic and interactive environment as a dynamic game. In Section III, we review two related but distinct frameworks of cognitive hierarchy theory, the levelk framework and the cognitive hierarchy framework. In Section IV, we describe the decision making process based on cognitive hierarchy theory and its solution method based on a constrained partially observable Markov decision process formulation. In Section V, we apply the proposed decision making algorithm to autonomous vehicle control in three interactive traffic scenarios. Discussion and conclusions are given in Section VI.
Ii Problem formulation
In this paper, we consider a decision making process by an intelligent agent operating in a dynamic and interactive environment. The interactions between the ego agent and the environment are modeled as a twoplayer dynamic game represented as a 6tuple,
(1) 
where represent the two players with denoting the ego agent and denoting the environment; is a finite set of states with denoting the state of the agentenvironment system at the discrete time instant ; is a finite set of actions with denoting the action set of the ego agent and denoting the action set of the environment; represents a transition of the state as a result of an action pair , in particular, is defined by the following dynamic model,
(2) 
are reward functions of the two players representing their decision objectives, in particular,
(3) 
i.e., each player’s reward at the time instant depends on the state and both players’ actions ; and with being a set of “safe” states, representing hard constraints for decision making of the ego agent.
We let the ego agent make decisions based on a recedinghorizon optimization of the form,
(4a)  
(4b) 
where denotes a prediction of player ’s action at the time instant over a planning horizon of length with the prediction made at the current time instant , denotes a prediction of the system state under the sequence of action pairs , and is a factor to discount future rewards.
Clearly, the above optimization problem is not welldefined yet, as the uncontrolled variables are unknown. One way to proceed is to consider worstcase scenarios, i.e., replace (4a) with
(5) 
However, as (5) assumes an adversarial player , rather than a rational player that pursues its own objectives and is not necessarily against the ego agent, (5) may lead to overlyconservative decisions for the ego agent. Therefore, we pursue an alternative solution, which is based on cognitive hierarchy theory (CHT) and is described in the next section.
Iii Two cognitive hierarchy frameworks
Cognitive hierarchy theory (CHT) is concerned with behavioral models describing human thought processes in strategic games. It characterizes human behavior based on levels of iterated rationalizability. Two frameworks have been developed based on CHT: the levelk framework [17, 19] and the cognitive hierarchy framework [4], which are closely related but have distinct features. They are reviewed in what follows.
Iiia The levelk framework
In the levelk framework, it is assumed that each player in a strategic game bases its decision on a finite depth of reasoning about the likely actions of the other players, which is referred to as its “cognitive level.” In particular, the reasoning hierarchy starts from some nonstrategic behavioral model, called level. Then, a level player, , assumes that all of the other players are level(), based on which it predicts the actions of the other players and makes its own decision as the optimal response to these predicted actions. In short, a level decision optimally responds to level() decisions. Note, however, that a level decision may turn out to be poor if the other players execute level decisions with [12].
IiiB The cognitive hierarchy framework
The cognitive hierarchy (CH) framework is similar to the levelk framework in terms of characterizing each player’s behavior also by a bounded cognitive level . The unique feature of the CH framework is the hypothesis that a player can act under the assumption that some percentage of the population fits each archetype. More specifically, a CH player assumes that each of the other players is level for some and optimizes its decision corresponding to its beliefs about the other players’ levels. This feature makes a CH player “smarter” than a level player by enabling a CH player to optimally respond to level decisions for all as long as it has the correct beliefs about the other players’ levels.
The levelk framework has been exploited by various researchers for modeling humanhuman, humanmachine, and machinemachine^{1}^{1}1On the basis of the fact that many machine systems pursue humanlike decision making, e.g., in [23]. interactions in automotive systems [14, 12, 21], aerospace systems [22, 15], as well as in cyberphysical security [10]. It has been revealed in [12] that an ego agent using a decision strategy corresponding to a level model with some fixed may behave poorly when interacting with another agent using a decision strategy of level with . On the basis of this observation, algorithms that estimate the level of the other agent according to its historical behavior have been proposed in [10, 12, 21] so that the ego agent can adapt its decision strategy to the level estimate. In the next section, we present a rigorous formulation of a decision making process based on the CH framework, motivated, in part, by preliminary developments in [10, 12, 21]. We then recast the problem as a partially observable Markov decision process (POMDP). Compared to the level estimation algorithms in [10, 12, 21]
, which are rather heuristic although shown to be effective in specific applications, the decision making framework introduced in the following is based on Bayesian statistics and hence has a firmer theoretical foundation.
Iv Decision making based on CHT and its POMDPbased solution method
Iva Levelk models of the environment
A policy , , is a stochastic map from states to actions . Specifically, is such that
(6) 
for all , where
denotes conditional probabilities.
To define the level models of the environment for arbitrary , we start from defining a level model of the ego agent, defined by a policy , and a level model of the environment, defined by a policy . The level model of the environment, , with is then constructed based on the “softmax decision rule” [20], which captures the suboptimality and variability in decision making [3, 2], as follows,
(7) 
in which the function of stateaction pairs is defined as
(8)  
where is the level() model of the ego agent, which for is defined as
(9) 
in which
(10)  
In short, the level model of the environment is constructed based on the level() model of the ego agent, which is constructed based on the level() model of the environment. Therefore, the level models of the environment as well as the level models of the ego agent are constructed recursively for . We note that when constructing the level models of the ego agent, we drop the hard constraints (4b) to reduce computational complexity but can promote their satisfaction through imposing penalties in the reward function .
IvB Ego decision making based on the CH framework
After the level models of the environment for have been constructed, we define the augmented state of the agentenvironment system as , where represents the actual cognitive level of the environment and is assumed to be unknown to the ego agent. Then, we consider the following augmented dynamic model of the agentenvironment system,
(11a)  
(11b) 
where is referred to as an “observation.” On the basis of the level policies, the action of the environment, in (11a), can be viewed as a stochastic disturbance satisfying
(12) 
for all and .
It is assumed that the ego agent has a prior belief about
, as a probability distribution,
, defined over . Let us collect all historical observations up to time and all previously executed actions by the ego agent up to timeinto a data vector,
(13) 
which, roughly speaking, will be used by the ego agent as evidence to infer the actual cognitive level of the environment, i.e., to obtain the posterior belief about , , defined for all .
Then, we consider the following decision making process by the ego agent,
(14a)  
(14b) 
where defines a required level of confidence in constraint satisfaction.
Comparing the processes (4) and (14), we can observe two major differences: Firstly, the unknowns in (4) have been modeled as stochastic disturbances in (14), which is achieved in (12) by exploiting the level models of the environment . Secondly, to account for the stochasticities, the objective has been changed from maximizing the value of a function in (4a) to maximizing the expected value of the function in (14a), and the hard constraint (4b) has been changed to a probabilistic requirement of satisfaction, i.e., the chance constraint (14b) with being a design parameter. We note that (14) is a welldefined optimization problem [13, 11]. We present a solution method for it in the following section.
IvC POMDPbased solution method
Considering randomized decision rules, we transform the optimization problem (14) defined in the decision space to the optimization problem (16) defined in the space of probabilities as follows:
Firstly, we define , , as a probability distribution over the set , based on which the predicted action is chosen, i.e.,
(15) 
Then, we reformulate (14) as the following optimization problem:
(16a)  
(16b) 
where is the ()dimensional probability simplex.
The problem (14) along with its transformation (16) is referred to as a partially observable Markov decision process (POMDP) with a timejoint chance constraint, where the partial observability comes from the unobservability of the hidden state . A solution method for general problems in the form of (16) has been described in [13, 11]. In particular, the following Propositions 1 and 2, which represent matrixcomputational implementations of the corresponding mathematical expressions in [13, 11] applied to the specific problem setting of this paper, are used in solving (16).
Proposition 1: Suppose that the reward function can be written as , where represents the next state transitioned from through the dynamic model (2). Then, for any given , it holds that
(17) 
where is a vector collecting the reward values associated with every , in which , and is a vector representing the predicted distribution of the augmented state.
Proof: Omitted.
In particular, the reward vector is constructed offline and the distribution vector is computed online using the recursive formula,
(18)  
in which denotes the
dimensional identity matrix,
denotes the dimensional allones vector, represents the Kronecker product, and(19)  
with representing the transition kernel of the augmented state, constructed offline as
(20)  
with denoting the setmembership indicator function.
The recursive formula (18) starts with the initial term , the posterior belief about the augmented state inferred according to the evidence , which is updated at every decision step using the Bayesian inference formula,
(21) 
Proposition 2: For any given , the lefthand side of the constraint (16b) can be evaluated using the following algorithm:

Initialize , , and .

Update

Update

If then go to Step 5); otherwise update
and go to Step 2).

Output
Proof: The steps 2), 3) and 4) realize, respectively, the recursive formulas (29), (28), and (30) of Theorem 1 in [13]. Therefore, the proof follows from Theorem 1 of [13].
On the basis of Propositions 1 and 2, a standard nonlinear programming solver exploiting gradient and Hessian information of the cost and constraint functions of (16), which can be numerically estimated based on function evaluations for any of interest, can be used to solve for .
V Application to autonomous driving in interactive traffic scenarios
In the near to medium term, autonomous vehicles will operate in traffic together with humandriven vehicles. Ensuring safety in the associated interactive traffic scenarios remains a challenging problem for autonomous vehicle control [18]. In this section, we apply the decision making framework based on cognitive hierarchy theory described in the previous sections to controlling an autonomous ego vehicle in various traffic scenarios where it needs to interact with a humandriven vehicle. The traffic scenarios we consider include a fourway intersection scenario, a highway overtaking scenario, and a highway forced merging scenario.
Decision making of the humandriven vehicle is modeled based on the levelk framework described in Section IIIA. Experimental studies [6, 5] suggest that humans are most commonly level and level reasoners. Therefore, we consider level and level models in the form of (7). Note that different human drivers may have different cognitive levels, and the autonomous ego vehicle does not know in advance the specific level, , of the human driver it is interacting with but has to infer based on its observed information. When without any information at step , we initialize the ego vehicle’s beliefs in level and level models of the humandriven vehicle as and .
In all of the three traffic scenarios discussed in this section, we use the following discretetime model to represent vehicle kinematics in the longitudinal direction,
(22) 
where denotes position, denotes velocity, denotes acceleration, the subscript represents the discrete time, the first superscript distinguishes the autonomous ego vehicle from the humandriven vehicle , the second superscript denotes the or direction, and [s] is the sampling period. We model lane changes as instantaneous events, i.e., completed in one time step. The acceleration , taking values in a finite acceleration set , and the lane change command are the actions to be decided on.
As described in Section IIIA, in order to formulate the level models of the humandriven vehicle, , we need to define the level models of both vehicles, and . Following [12, 21], we let a level vehicle select actions to maximize the same reward function as that for level and vehicles but treat other vehicles on road as stationary obstacles. Note that “as stationary obstacles” defines a way to predict the other vehicle’s actions in the decision making process (4), so the ego vehicle’s optimal actions, and hence the level0 policy, can be determined.
Va Intersection
As shown in Fig. 1, the autonomous ego vehicle (blue car) encounters a humandriven vehicle (red car) at an unsignalized fourway intersection. Both vehicles are driving straight through the intersection. Such an objective is represented by the following reward function,
(23) 
where for the autonomous ego vehicle and for the humandriven vehicle.
In the formulated recedinghorizon optimization problem in the form of (4), we choose the planning horizon as .
Moreover, given the safety requirement to maintain the positions in the safe set,
(24) 
where , represents the Euclidean norm, and [m] is the car length, we impose the following chance constraint over the planning horizon,
(25) 
Fig. 1(a1) and (a2) show two subsequent steps in the simulation of autonomous ego vehicle interacting with a level humandriven vehicle, and Fig. 1(b1) and (b2) show those of interacting with a level humandriven vehicle. When interacting with a level humandriven vehicle, which, on the basis of our level model introduced above, represents a cautious/conservative driver, the autonomous ego vehicle decides to drive through the intersection first. When interacting with a level humandriven vehicle (aggressive, based on our specified level model), the autonomous ego vehicle yields the right of way to the humandriven vehicle. The autonomous ego vehicle responds to the two different human drivers in different ways because it gains knowledge of the human driver’s cognitive level by observing his/her actions for the first few steps after which it can predict his/her future actions and respond optimally.
VB Overtaking
The second scenario we consider is shown in Fig. 2, where the autonomous ego vehicle (blue car) is overtaking a humandriven vehicle (red car). A similar scenario has been considered in [13, 11] but not in a game theoretic formulation.
We consider the following reward function,
(26) 
where the first term is used to encourage the autonomous ego vehicle to overtake the humandriven vehicle, and the second term is used to penalize the autonomous ego vehicle for driving in the left passing lane so that it is encouraged to come back to the right traveling lane after the overtaking as quickly as it can.
We choose the planning horizon as and impose a chance constraint over the planning horizon in the form of (25) where the safe set is now defined as
(27)  
in which [m] is the lane width. The safe set (27) represents the requirement that overtaking can occur only when the two vehicles are traveling in different lanes, otherwise they shall keep a reasonable distance in the longitudinal direction to improve safety.
Fig. 2(a1)(a4) show four subsequent steps in the simulation of autonomous ego vehicle interacting with a level humandriven vehicle, and Fig. 2(b1)(b4) show those of interacting with a level humandriven vehicle. We note that in this simulation the maximum speed of the humandriven vehicle is restricted to be smaller than that of the autonomous ego vehicle to ensure the possibility of an overtaking. When interacting with a level driver, the autonomous ego vehicle completes the overtaking relatively quickly because, as can be seen in Fig. 2(a2), the level driver drives slowly to let it cut in. When interacting with a level driver, the autonomous ego vehicle needs to drive in the passing lane for a longer period of time before it can come back to the traveling lane.
VC Merging
The last scenario we consider is a highway forced merging scenario. Differently from overtaking, which may improve travel speed but is usually unnecessary, merging, oftentimes, has to be accomplished within a certain road section. We consider the scenario shown in Fig. 3, where the autonomous ego vehicle (blue car) originally driving in the right lane needs to merge into the traffic in the left lane. In particular, the merging can only be and has to be accomplished within the road section with the greydashed lane marking.
We consider the reward function
(28) 
in the recedinghorizon optimization (14), where the first term is used to encourage the autonomous ego vehicle to maintain a reasonable travel speed and the second term is used to encourage it to merge into the left lane. For the safe set, we choose
(29)  
in the chance constraint (25) so that the autonomous ego vehicle has to merge into the left lane within the road section specified by (29). The planning horizon is again chosen as .
Subsequent steps in the simulation with a level humandriven vehicle (red car) traveling in the left lane are shown in Fig. 3(a1)(a4), and those with a level humandriven vehicle are in Fig. 3(b1)(b4). When the humandriven vehicle is level, which, on the basis of our level model introduced at the beginning of Section V, represents a cautious/conservative driver, the autonomous ego vehicle decides to merge into the left lane ahead of the humandriven vehicle. When the humandriven vehicle is level, which represents an aggressive driver, the autonomous ego vehicle merges behind the humandriven vehicle as it predicts that the humandriven vehicle will likely not yield.
Vi Discussion and Conclusions
In this paper, we described a framework synergizing cognitive behavioral models, Bayesian inference, and recedinghorizon optimal control for autonomous decision making in a dynamic and interactive environment with uncertainty.
In the current version of the framework, the environment, which responds to the ego agent’s actions, is modeled as a single intelligent agent with a certain cognitive level . Simulation examples representing traffic scenarios where an autonomous ego vehicle interacts with a humandriven vehicle illustrate the application of the current version of the framework. When the environment is composed of multiple intelligent agents, the proposed framework may be extended, where each of the other agents, , is modeled separately as a level decision maker and the ego agent estimates each according to agent ’s historical behavior. We envision that such an extension is mainly a computational challenge rather than a theoretical one. Addressing it is left as a topic to future research.
References
 [1] (2016) Cognitive hierarchy theory for heterogeneous uplink multiple access in the internet of things. In International Symposium on Information Theory (ISIT), pp. 1252–1256. Cited by: §I.
 [2] (2014) On the origins of suboptimality in human probabilistic inference. PLoS Computational Biology 10 (6), pp. e1003661. Cited by: §IVA.
 [3] (2012) Not noisy, just wrong: the role of suboptimal inference in behavioral variability. Neuron 74 (1), pp. 30–39. Cited by: §IVA.
 [4] (2004) A cognitive hierarchy model of games. The Quarterly Journal of Economics 119 (3), pp. 861–898. Cited by: §I, §III.
 [5] (2009) Comparing models of strategic thinking in van huyck, battalio, and beil’s coordination games. Journal of the European Economic Association 7 (23), pp. 365–376. Cited by: §I, §V.
 [6] (2006) Cognition and behavior in twoperson guessing games: an experimental study. American Economic Review 96 (5), pp. 1737–1768. Cited by: §I, §V.
 [7] (2002) Bounded rationality: the adaptive toolbox. MIT press. Cited by: §I.
 [8] (2008) Human–robot interaction: a survey. Foundations and Trends® in Human–Computer Interaction 1 (3), pp. 203–275. Cited by: §I.
 [9] (2017) Safetycritical advanced robots: a survey. Robotics and Autonomous Systems 94, pp. 43–52. Cited by: §I.
 [10] (2019) Nonequilibrium dynamic games and cyber–physical security: a cognitive hierarchy approach. Systems & Control Letters 125, pp. 59–66. Cited by: §I, §I, §IIIB.
 [11] (2019) Stochastic predictive control for partially observable Markov decision processes with timejoint chance constraints and application to autonomous vehicle control. Journal of Dynamic Systems, Measurement, and Control 141 (7), pp. 071007. Cited by: §I, §IVB, §IVC, §VB.
 [12] (2018) Game theoretic modeling of vehicle interactions at unsignalized intersections and application to autonomous vehicle control. In American Control Conference (ACC), pp. 3215–3220. Cited by: §I, §I, §IIIA, §IIIB, §V.
 [13] (2018) Tractable stochastic predictive control for partially observable Markov decision processes with timejoint chance constraints. In Conference on Decision and Control (CDC), pp. 3276–3282. Cited by: §I, §IVB, §IVC, §IVC, §VB.
 [14] (2018) Game theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems. IEEE Transactions on Control Systems Technology 26 (5), pp. 1782–1797. Cited by: §I, §I, §IIIB.
 [15] (2016) Unmanned aircraft systems airspace integration: a game theoretical framework for concept evaluations. Journal of Guidance, Control, and Dynamics. Cited by: §I, §IIIB.
 [16] (2013) Game theory. Harvard university press. Cited by: §I.
 [17] (1995) Unraveling in guessing games: an experimental study. The American Economic Review 85 (5), pp. 1313–1326. Cited by: §I, §I, §III.
 [18] (2018) Planning and decisionmaking for autonomous vehicles. Annual Review of Control, Robotics, and Autonomous Systems 1, pp. 187–210. Cited by: §I, §I, §V.
 [19] (1995) On players’ models of other players: theory and experimental evidence. Games and Economic Behavior 10 (1), pp. 218–254. Cited by: §I, §I, §III.
 [20] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §IVA.
 [21] (2018) Adaptive gametheoretic decision making for autonomous vehicle control at roundabouts. In Conference on Decision and Control (CDC), pp. 321–326. Cited by: §I, §IIIB, §V.
 [22] (2014) Predicting pilot behavior in mediumscale scenarios using game theory and reinforcement learning. Journal of Guidance, Control, and Dynamics. Cited by: §I, §IIIB.
 [23] (2018) A humanlike game theorybased controller for automatic lane changing. Transportation Research Part C: Emerging Technologies 88, pp. 140–158. Cited by: footnote 1.
Comments
There are no comments yet.