SlideShare a Scribd company logo
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 54
Planning in Markov Stochastic Task Domains
Yong (Yates) Lin ylin@uta.edu
Computer Science & Engineering
University of Texas at Arlington
Arlington, TX 76019, USA
Fillia Makedon makedon@uta.edu
Computer Science & Engineering
University of Texas at Arlington
Arlington, TX 76019, USA
Abstract
In decision theoretic planning, a challenge for Markov decision processes (MDPs)
and partially observable Markov decision processes (POMDPs) is, many problem
domains contain big state spaces and complex tasks, which will result in poor
solution performance. We develop a task analysis and modeling (TAM) approach,
in which the (PO)MDP model is separated into a task view and an action view. In
the task view, TAM models the problem domain using a task equivalence model,
with task-dependent abstract states and observations. We provide a learning
algorithm to obtain the parameter values of task equivalence models. We present
three typical examples to explain the TAM approach. Experimental results
indicate our approach can greatly improve the computational capacity of task
planning in Markov stochastic domains.
Keywords: Markov decision processes, POMDP, task planning, uncertainty, decision-making.
1. INTRODUCTION
We often refer to a specific process with goals or termination conditions as a task. Tasks are
highly related to situation assessment, decision making, planning and execution. For each task,
we achieve the goals by a series of actions. Complex task contains not only different kinds of
actions, but also various internal relationships, such as causality, hierarchy, etc.
Existing problems of (PO)MDPs have often been constrained in small state spaces and simple
tasks. For example, Hallway is a task in which a robot tries to reach a target in a 15-grids
apartment [11]. From the perspective of task, this process has only a single goal. The difficulties
come from noisy observations by imprecise sensors equipped on the robot, instead of task.
Although (PO)MDPs have been accepted as successful mathematical approaches to model
planning and controlling processes, without an efficient solution for big state spaces and complex
tasks, we cannot apply these models on more general problems in the real world. In a simple task
of grasping an object, the number of states reaches |S| =1253 [8]. If the task domain becomes
complex, it will be even harder to utilize these models. Suppose an agent aims to build a house,
there will be thousands of tasks, with different configurations of states, actions and observations.
It is hardly to rely simply on (PO)MDPs to solve this problem domain. Compared to other task
planning approaches, such as STRIPS or Hierarchical Task Network [10], (PO)MDPs consider
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 55
the optimization for every step of the planning. Therefore, (PO)MDPs are more suitable for
planning and controlling problems of intelligent agents. However, for task management, the
(PO)MDP framework is not as powerful as Hierarchical Task Network (HTN) planning [3]. HTN is
designed to handle problems with many tasks. Primitive tasks can be executed directly, and non-
primitive tasks will be decomposed into subtasks, until everyone becomes a primitive task. This
idea is adopted in the hierarchical partially observable Markov decision processes (HPOMDPs)
[12]. Actions in HPOMDPs are arranged in a tree. A task will be decomposed into subtasks. Each
subtask has an action set containing primitive actions and/or abstract actions. In fact, a
hierarchical framework for (PO)MDPs is an approach that builds up a hierarchical structure to
invoke the abstract action sub-functions. Although inherited the merits of task management from
HTN, it does not specially address the solving of the big state space problem.
Another solution considers multiple tasks as a merging problem using multiple simultaneous
MDPs [15]. This solution does not specially consider the characteristic of different tasks, and it
limits the problem domains to be MDPs.
To improve the computational capacity of complex tasks planning, we develop a task analysis
and modeling approach. We decompose the model into a task view and an action view. This
enables us to strip out the details, such that we can focus on the task view. After a learning
process from the action view, the task view becomes an independent task equivalence model,
with task-dependent abstract states and observations. If the problem domain is MDP, we have
already solved it by the task view learning algorithm. If it is POMDP, we can solve it using any
existing POMDP algorithms, without considering the hierarchical relationship anymore. We apply
the TAM approach on existing MDP and POMDP problems. Experimental results indicate the
TAM approach brings us closer to the optimum solution of multi-task planning and controlling
problems.
This paper is organized as follows. We begin by a brief review of MDPs and POMDPs. Then we
discuss how to utilize the TAM method on (PO)MDP problems. Three typical examples from
MDPs and POMDPs are presented in this part, to explain the design of task equivalence models.
In the following section, we present a solution based on knowledge acquisition and model-
learning for the task equivalence models. We provide our experimental results for the comparison
of the task equivalence model and the original POMDP model. Finally, we briefly introduce some
related work and conclude the paper.
2. BACKGROUND
A Markov decision process (MDP) is a tuple , where the S is a set of
states, the A is a set of actions, the T(s, a, s0) is the transition probability from state s to s0 using
action a, R(s, a) is the reward when executing action a in state s, and is the discount factor.
The optimal situation-action mapping for the t
th
step, denoted as , can be reached by the
optimal (t-1)-step value function :
.
A POMDP models an agent action in uncertainty world. At each time step, the agent needs to
make a decision based on the historical information from previous executions. A policy is a
function of action selection under stochastic state transitions and noisy observations. A POMDP
can be represented as , where is a finite set of states, is
a set of actions, is a set of observations. In each time step, the agent lies in a state .
After taking an action , the agent goes into a new state s0. The transition is a conditional
probability function T(s, a, s’) = p(s’|s, a), which presents the probability the agent lies in s’, after
taking action a in state s. The agent makes an observation to gather information. This
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 56
can be modeled as a conditional probability (s, a, o) = p(o|s, a).
When belief state is taken into consideration, the original partially observable POMDP model
changes to a fully observable MDP model, denoted as . Here,
the B is a set of belief states, i.e. belief space. The is a probability the agent
changes from b to b0 after taking action a. The is the reward for belief
state b. The b0 is an initial belief state.
The POMDP framework is used as a control model of an agent. In a control problem, utility is
defined as a real-valued reward to determine the action of an agent in each time step, denoted as
R(s, a), which is a function of state s and action a. The optimal action selection becomes a
problem to find a sequence of actions a1..t, in order to maximize the expected sum of rewards
. In this process, what we concern is the controlling effect, achieved from the
relative relationship of the values. When we use a discount factor , the relative relationship
remains unchanged, but the values can converge to a fixed number. When states are not fully
observable, the goal is changed to maximize expected reward for each belief state. The n
th
horizon value function can be built from previous value n
th
using a backup operator H, i.e. V = HV’.
The value function is formulated as the following Bellman equation
.
Here, b’ is the next step belief state,
,
where is a normalizing constant.
When optimized exactly, this value function is always piece-wise linear and convex in the belief
space.
3. EQUIVALENCE MODELS ON TASK DOMAINS
Tasks serve as basic units of everyday activities of humans and intelligent agents. A task-
oriented agent builds its policies on the context of different tasks. Generally speaking, a task
contains a series of actions and some certain relationships, with an initial state s0, where it starts
from, and one or multiple absorbing states sg (goals and/or termination states), where the task
ends in. (RockSample [16] is a typical example using termination state instead of goals.
Theoretically, infinite tasks may not have goal or termination state, we can simply set sg = null).
From this notion, every (PO)MDP problem can be described as a task (For POMDP, the initial
state becomes b0, and absorbing states become bg). To improve the computational capacity of
task planning, we develop a task analysis and modeling (TAM) approach.
3.1 Task Analysis
Due to the size of state space, and the complex relationships among task states, it is hard to
analyze tasks. Therefore, we separate a task, which is a tuple M, into a task view and an action
view. The task view, denoted as , reflects how we define an abstract model for the original task.
Actions used in a task view is defined in an action view, denoted as . It contains all of the
actions in the original task. Before further discussion about the task view and the action view, let
us first go over some terms used in this framework.
An action a is a single or a set of operational instructions an agent takes to finish a primitive task.
A Markov decision model is a framework to decide which action should be taken in each state. If
an action defined in a Markov stochastic domain is used by a primitive task, we assume it to be a
primitive action.
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 57
We define an abstract action ac as a set of actions. We want to find out the ac that is related with a
set of abstract states in , with transitional relationship. Only this kind of ac will contribute to our
model. In , ac is like a subtask. It has an initial state s0, an absorbing state sg (goal or
termination state). In , we deal ac the same as a primitive action.
With the help of abstract actions, we build an equivalence model for the original task M on the
abstract task domain, using . Our purpose is to find out a set of policies, such that the
overall cost is minimized, and the solution is optimum. Denote the optimal policies as .
Definition 1. Given a Markov stochastic model , if there exist a pair of task view and action
view , such that , we say is an equivalence model of ,
denoted as .
Let us introduce the equivalence model for MDP and POMDP task domains respectively.
3.2 MDP Task
For an MDP task , a well-defined task view , and a well-defined
ac in action view are both MDP models. For the task view, ,
where is a set of states for the task, is a set of actions for the task, including primitive
actions and abstract actions . The is the transition array, and is a set of rewards. For
the action view, . For each .
Up to now, we still have not explained how to build the model of task view . In order to know
the details of task states, in the TAM approach, we develop a Task State Navigation (TSN) graph
to clearly depict the task relationship. A TSN graph contains a set of grids. Each grid represents a
state of the task, labeled by the state ID. Neighboring grids with ordinary line indicate there is a
transitional relationship among these states. Neighboring grids with bold line indicate there is no
transitional relationship.
Let us take the taxi task [5] as an example, to interpret how to build the TSN graph, as well as
how to construct the equivalence model . The taxi task is introduced as an episodic
task. A taxi inhabits a 5×5 grid world. There are four specially-designated locations {R, B, G, Y} in
the world. In each episode, the taxi starts in a randomly-chosen state. There is a passenger at
one of the four randomly chosen locations, and he wishes to be transported to one of four
locations. The taxi has six actions {North, South, East, West, Pickup, Putdown}. The episode
ends when the passenger has been putdown at the destination.
This is a classical problem, used by many hierarchical MDP algorithms to build the models. We
present the TAM solution here. First, we build the TSN graph for taxi task in Figure 1. Label Te
represents the taxi is empty, and Tu indicates the taxi has user. Lt is the start location of taxi, Lu is
the location of user, and Ld is the location of destination. There are 5 task states in the TSN graph,
{TeLt, TeLu, TuLu, TuLd, TeLd}. The initial state is s0 = TeLt, representing an empty taxi in the random
location. The absorbing state (goal) is sg = TeLd, representing the taxi is empty and at the user's
destination. We mark a star in the grid of the absorbing state. A reward of +20 is given for a
successful passenger delivery, a penalty of -10 for performing Pickup or Putdown at wrong
locations, and -1 for all other actions.
From the TSN graph, it is clear that the taxi task is a simple linear problem. The transition
probabilities for the neighboring states are 1. There are four actions in the task domain,
Pickup, Putdown}, where is the abstract action going from Lt to Lu, and
is the abstract action going from Lu to Ld. This model has two abstract actions ,
and it is easy to know that . However, and
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 58
have different s0 and sg. We call this kind of abstract actions isomorphic action.
Details about how to solve are discussed in next section. In this section, we will
continue to introduce the TAM approach for POMDP task.
3.3 POMDP Task
For a POMDP task , where b0 is the initial belief state, bg
are the absorbing belief states (goal belief states and/or termination belief states). The
equivalence model can be . In this work, we
only consider the most common model, .
Some existing POMDP problems have simple relationship in task domain, such that there is no
abstract action. Thereafter, the equivalence model becomes a single model, . The coffee
task [1] has this kind of equivalence task model. It can be solved using action network and
decision tree in [1]. Here, we propose the TAM approach for coffee task. The TSN graph for
coffee task is shown in Figure 2. The L is an appointed as the initial state in the coffee. From the
beginning O, since the weather has 0.8 probability to be rainy, denoted as R, and 0.2 probability
to be sunny, denoted as ¬R, we get the transition probability from L to R and ¬R. If the weather is
rainy, the agent needs to take umbrella with successful probability of 0.8, and 0.2 to fail. We
denote the agent with umbrella as U, and ¬U if it fails to take umbrella. If the agent has umbrella,
it has probability 1 to be ¬L/¬W (dry when it comes to the shop). If it has no umbrella and the
weather is rainy, the agent will be ¬L/¬W for 0.2, and be ¬L/W (wet in shop) for 0.8. The ¬L/W
has probability 1 to be C/W (coffee wet), and the ¬L/¬W has probability 1 to be C/¬W. Whether it
be C/W or C/¬W, the agent has 0.9 probability to deliver the coffee to the user H/W (user has wet
coffee), and the H/¬W (user has dry coffee). The agent has 0.1 probability to fail to deliver the
coffee ¬H/W (user does not have coffee and coffee wet), ¬H/¬W (user does not have coffee and
coffee dry).
There are 11 observations for this problem: r (rainy), ¬r (sunny), u (agent with umbrella), ¬u
(agent without umbrella), w (agent wet), ¬w (agent dry), nil (none), h/w (user with coffee and
coffee wet), ¬h/w (user without coffee and coffee wet), ¬h/¬w (user without coffee and coffee dry),
h/¬w (user with coffee and coffee wet). The observation probability for rainy when it is raining is
0.8, and the observation probability is 0.8 for sunny when it is sunshine. The agent gets a reward
of 0.9 if the user has coffee and 0.1 if it stays dry.
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 59
The POMDP problem of RockSample[n, k] [16] has the task domain model of .
Difficulty of RockSample POMDP problems relies on the big state spaces. RockSample[n, k]
describes a rover samples rocks in a map of size n×n. The k rocks have equally probability to be
Good and Bad. If the rover samples a Good rock, the rock will become Bad, and the rover
receives a reward of 10. If the rock is bad, the rover receives a reward of -10. All other moves
have no cost or reward. The observation probability p for Checki is determined by the efficiency η,
which decreases exponentially as a function of Euclidean distance from the target. The η=1
always returns correct value, and η=0 has equal chance to return Good or Bad.
In the TSN graph for RockSample[4, 4] (Figure 3), the S0 represents the rover in initial location,
the Ri represents the rover stays with rocki, the Exit is the absorbing state. Except for the Exit,
there are 16 task states related with each grid, indicating {Good, Bad} states for 4 rocks. Thus,
|S
t
|=81. For observations, |O
t
| = 3k+2: 1 observation for the rover residing on place without rock,
k observations for the rover residing with a rock, 2k observations for Good and Bad of each rock,
and 1 observation for the Exit.
There are 2k
2
+k+1 actions in the task domain: Check1,…,Checkk, Sample; For each Ri, there are
k-1 abstract actions going to Rj (i ǂ j), 1 abstract action going to Exit, and there are k-1 abstract
actions going from Rj to Ri (i ǂ j), 1 abstract action going from S0 to Ri. All abstract actions for a
specific RockSample[n, k] problem are isomorphic. It is possible that an isomorphic abstract
action ac relates with multiple states. We assign an index for each state related with ac, and call it
y index, denoted as y(s), where s is the state.
4. SOLVING TASK EQUIVALENCE MODELS BY KNOWLEDGE
ACQUISITION
For a simple task domain problem, such as coffee task, it only has , without abstract action.
The solution is the same with any other POMDP problems. The difference between task
equivalence models and the original POMDPs relies on the task models, instead of the algorithms.
4.1 Learning Knowledge for Model from TMDP
Our purpose in the designing of task domains is to handle complex task problems efficiently. This
can be achieved by a learning process. The taxi task has an equivalence model ,
with two isomorphic abstract actions. The idea is, with the knowledge of ac, which can be
acquired from , will be solved using standard MDP algorithms. Currently, the only
knowledge missing for is the reward . Next, we will obtain by a Task MDP
(TMDP) value iteration.
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 60
Details about TMDP value iteration are listed in Algorithm 1. The difference between TMDP
algorithm and MDP algorithm is, there is an action view single step value iteration ,
which is shown in Algorithm 2. When T(s, a, s) = 1, action a is not related with state s. Thus we
rely on T(s, a, s) < 1 to bypass these unrelated actions. For state s in the MDP model of ac, the
optimal policy in the action view is determined. The value is influenced
by y index. Finally, we obtain the reward R
t
(s; a) by the difference .
4.2 Improved Computational Capacity by Task Equivalence Models
As a result of the learning process, we got the knowledge about for the task view, and
for the action view. Thus, we can focus on the task view alone in future computation
of POMDP problems. After the fully observable task view is learned, the partially observable
task view becomes a general POMDP problem. We can solve it using any existing
POMDP algorithms.
RockSample[n, k] is an example of the equivalence model . In our POMDP value
iteration algorithm, the computational cost is ,
. The sizes for different arrays are listed in
Table 1:
In each round of value iteration, by rough estimation, we get the computational complexity of
as , and as . This conclusion can be utilized to general
POMDP problems that can be transformed to a task equivalence model with abstract action
(the number of states being ), and a task view has k task nodes (not including the initial and
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 61
absorbing nodes). When , the equivalence model created by TAM approach can
greatly improve the computational capacity. However, if , the equivalence model
cannot improve the performance. Alternatively, it may degrade a little the computational capacity.
We can take this conclusion as a condition for applying the TAM approach on POMDP tasks, to
improve the performance, although it can be used on every existing POMDP problem.
In sum, the TAM approach first analyzes the task model, and creates a task view and an action
view. The action view is responsible for learning the model knowledge. The trained action view
will then be saved for future computing in task view. The task view is an equivalence model with
better computational capacity than the original POMDP model.
5. EXPERIMENTAL RESULTS
In order to provide a better understanding and detailed evaluation of the TAM approach, we
implement several experiments in simulation domains using MATLAB.
In the experiments, we aim to find out the reward and execution time for the task equivalence
model for the aforementioned problems. Results are achieved by 10 times of execution for each
problem, except for RockSample[10, 10]. The execution of RockSample[10, 10]
is over one week in our system. Therefore, it is only executed once.
All of the experiments are implemented in the same software and hardware environment, with the
same POMDP algorithm. For the equivalence models, is pre-computed. The system uses
the trained data of . In the experiments, the performance of every task equivalence model
improves greatly than the original POMDP model , except for the RockSample[5,
7]. The performance comparison of different models is presented in Table 2. Since the execution
of Taxi and Coffee domains are fast enough, we implement it directly by . The detailed
comparison is made on the RockSample domains. The equivalence model has much
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 62
smaller state space than the original plain POMDP . Although it increases the size of
action space, the observation space is also smaller than plain POMDP models. As a result, the
performance has been improved for each domain. Especially for the RockSample[10,10], its
execution time is only 1/707 of the original model.
These results adapts perfectly with our previous analysis. Considering RockSample[n, k], the
greater of n and k, the greater the performance of comparing with .
Considering the results from prior works, HSVI2 algorithm is able to finish RockSample[10, 10] in
10014 seconds [17]. It is more efficient than the implemented in our platform, and close
to , which is a more effective model than that is used in HSVI2. As is discussed in [9],
HSVI2 implements an α-vector masking technique, which opportunistically computes selected
entries in the α-vectors. This technique is beneficial for the special problems of Rock Sample, in
which movement of the robot is certain and the position is fully observable. Effectiveness of the
masking technique will degrade for uncertain movements and noisy observations. Our
experimental results imply, even if not incorporating the masking technique in the implementation,
we can still achieve the same performance using the efficient task equivalence model. The task
equivalence model is a general approach. We can apply it on general problem domains to
improve the performance.
6. RELATED WORK
Hierarchical Task Network (HTN) planning [3] is an approach concerning a set of tasks to be
carried out, together with constraints on ordering of the tasks, and the possible assignments of
task variables. The HTN does not maintain the Markov properties we utilize in the POMDP
problem domains.
Several hierarchical approaches for POMDP have been proposed [7,12]. From some perspective,
our approach also has some hierarchical features. However, we try to weak the hierarchy in the
TAM approach. Our optimal solution is mainly achieved in the task view. The action view is finally
used as knowledge to build up the task view, by a learning process. Thus, what we use to solve a
problem does not belong to the hierarchical model. This will be helpful for the modeling of
complex tasks, because complex tasks themselves may have inherent hierarchy or network
relationships, rather than the hierarchy between task and action.
The MAXQ [5] is a successful approach defined for the MDP problems. Primitive actions and
subtasks are all organized as nodes in the MAXQ graph, which is called subroutines. An
alternative algorithm is the HEXQ [13]. It automates the decomposition of a POMDP problem
from bottom up, by finding repetitive regions of states and actions. Policy iteration is used for
hierarchical planning, called hierarchical Finite-State Controller (FSC) [7]. The FSC method
leverages a programmer-defined task hierarchy to decompose a POMDP into a number of
smaller, related POMDPs.
A similar approach concerning the solving of complex tasks is the decomposition techniques for
POMDP problems [4]. It decomposes global planning problems into a number of local problems,
and solves these local problems respectively.
Another approach helps to improve the efficiency of the POMDPs is to reduce the state space,
called the value-directed compression. A linear loss compressions technique is proposed in [14].
This approach does not concern task domains and task relationship.
An equivalence model for MDP is discussed in [6]. It tries to utilize a model minimization
technique to reduce the big state space. However, as stated in the same paper, most MDP
problems cannot use this approach to find out minimized models.
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 63
The goal achievement issue for tasks is discussed in [2]. In that paper, the planning and
execution algorithm is defined within the scope of STRIPS.
7. CONSLUSIONS
We propose the TAM approach to create task equivalence models for MDP and POMDP
problems. Parameter values for a task equivalence model can be learned as model knowledge
using TMDP. As a result, we can solve the problem in the task view, which is not hierarchical any
more. We demonstrate the effectiveness of the task view approach for (PO)MDP problems. This
can greatly reduce the size of state space and improve the computational capacity of (PO)MDP
algorithms.
Current research works relating with (PO)MDP problems still addresses simple tasks. We hope
the introduction of the TAM approach can be a breakthrough, so that (PO)MDPs can be applied
on the planning and execution of complex task domains.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their valuable comments and
suggestions. This work is supported in part by the National Science Foundation under award
numbers CT-ISG 0716261 and MRI 0923494.
REFERENCES
[1] Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structure in policy construction.
In Proceedings of IJCAI, pp. 1104-1113.
[2] Chang, A., & Amir, E. (2006). Goal achievement in partially known, partially observable
domains. In Proceedings of ICAPS, pp. 203-211. AAAI.
[3] Deák, F., Kovács, A., Váncza, J., & Dobrowiecki, T. P. (2001). Hierarchical knowledge-based
process planning in manufacturing. In Proceedings of the IFIP 11 International PROLAMAT
Conference on Digital Enterprise, pp. 428-439.
[4] Dean, T., & hong Lin, S. (1995). Decomposition techniques for planning in stochastic domains.
In Proceedings of IJCAI, pp. 1121-1127. Morgan Kaufmann.
[5] Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function
decomposition. Journal of Artificial Intelligence Research, 13, 227-303.
[6] Givan, R., Dean, T., & Grieg, M. (2003). Equivalence notions and model minimization in
Markov decision processes. Artificial Intelligence, 147 (1-2), 163-223.
[7] Hansen, E. A., & Zhou, R. (2003). Synthesis of hierarchical finite-state controllers for POMDPs.
In Proceedings of ICAPS, pp. 113-122. AAAI.
[8] Hsiao, K., Kaelbling, L. P., & Lozano-Pérez, T. (2007). Grasping POMDPs. In Proceedings of
ICRA, pp. 4685-4692.
[9] Kurniawati, H., Hsu, D., & Lee, W. S. (2008). Sarsop: Efficient point-based pomdp planning by
approximating optimally reachable belief spaces. In Proceedings of Robotics Science and
Systems.
[10] Lekavý, M., & Návrat, P. (2007). Expressivity of STRIPS-like and HTN-like planning. In Agent
Yong (Yates) Lin & Fillia Makedon
International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 64
and Multi-Agent Systems: Technologies and Applications, First KES International Symposium,Vol.
4496, pp. 121-130. Springer.
[11] Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Learning policies for partially
observable environments: scaling up. In Proceedings of ICML, pp. 362-370.
[12] Pineau, J., Roy, N., & Thrun, S. (2001). A hierarchical approach to pomdp planning and
execution. In Workshop on Hierarchy and Memory in Reinforcement Learning (ICML).
[13] Potts, D., & Hengst, B. (2004). Discovering multiple levels of a task hierarchy concurrently.
Robotics and Autonomous Systems, 49 (1-2), 43-55.
[14] Poupart, P., & Boutilier, C. (2002). Value-directed compression of POMDPs. In Proceedings
of NIPS, pp. 1547-1554.
[15] Singh, S. P., & Cohn, D. (1997). How to dynamically merge markov decision processes. In
Proceedings of NIPS.
[16] Smith, T., & Simmons, R. G. (2004). Heuristic search value iteration for POMDPs. In
Proceedings of UAI.
[17] Smith, T., & Simmons, R. G. (2005). Point-based POMDP algorithms: Improved analysis and
implementation. In Proceedings of UAI, pp. 542-547.

More Related Content

DOC
Hibridization of Reinforcement Learning Agents
PPTX
Optimization Using Evolutionary Computing Techniques
PDF
Differential evolution optimization technique
PDF
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
PPT
Lecture 4 duality and sensitivity in lp
PDF
Markov decision process
PDF
Generalized Reinforcement Learning
Hibridization of Reinforcement Learning Agents
Optimization Using Evolutionary Computing Techniques
Differential evolution optimization technique
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Lecture 4 duality and sensitivity in lp
Markov decision process
Generalized Reinforcement Learning

What's hot (20)

PDF
International Refereed Journal of Engineering and Science (IRJES)
PDF
International Refereed Journal of Engineering and Science (IRJES)
PDF
Queuing model estimating response time goals feasibility_CMG_Proc_2009_9097
PPS
Revised DEMATEL1
PPT
Reinforcement learning 7313
PDF
Reinforcement learning
PDF
A0311010106
PPTX
An introduction to reinforcement learning (rl)
PPTX
Operation's research models
PDF
REINFORCEMENT LEARNING
DOCX
Linear programing
PPTX
Collaborative filtering with CCAM
PPTX
Linear Programming Problem
PDF
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Multiobjective optimization and Genetic algorithms in Scilab
PDF
Multiobjective optimization and trade offs using pareto optimality
PPTX
Concept of Duality
PDF
Nature-Inspired Metaheuristic Algorithms
PDF
A brief introduction to Searn Algorithm
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
Queuing model estimating response time goals feasibility_CMG_Proc_2009_9097
Revised DEMATEL1
Reinforcement learning 7313
Reinforcement learning
A0311010106
An introduction to reinforcement learning (rl)
Operation's research models
REINFORCEMENT LEARNING
Linear programing
Collaborative filtering with CCAM
Linear Programming Problem
Particle Swarm Optimization in the fine-tuning of Fuzzy Software Cost Estimat...
Reinforcement Learning : A Beginners Tutorial
Multiobjective optimization and Genetic algorithms in Scilab
Multiobjective optimization and trade offs using pareto optimality
Concept of Duality
Nature-Inspired Metaheuristic Algorithms
A brief introduction to Searn Algorithm
Ad

Viewers also liked (18)

PDF
Active Control of Tool Position in the Presence of Nonlinear Cutting Forces i...
PDF
Mathematical Derivation of Annuity Interest Rate and its Application
PDF
INDEX SORT
PDF
Decision Tree Classifiers to determine the patient’s Post-operative Recovery ...
PDF
Automatic E-Comic Content Adaptation
PDF
A Method to Provide Accessibility for Visual Components to Vision Impaired
PDF
Data mining visualization to support biochemical markers for liver fibrosis i...
PDF
From TION-EMO Theory To A New Beginning Of Artificial Emotions
PDF
Audio Art Authentication and Classification with Wavelet Statistics
PDF
A Simple Integrative Solution For Simultaneous Localization And Mapping
PDF
it & Economic Performance a Critical Review of the Empirical Data
PDF
Computer Assisted System for Enhancing the Application of Ergonomics in Manuf...
PDF
Design of A Spell Corrector For Hausa Language
PDF
Design and Implementation of Sliding Mode Algorithm: Applied to Robot Manipul...
PDF
Multi User Detection in CDMA System Using Linear and Non Linear Detector
PDF
A Business Review of E-Retailing in India
PDF
An Efficient Hybrid Successive Markov Model for Predicting Web User Usage Beh...
PDF
A Robotic Prototype System for Child Monitoring
Active Control of Tool Position in the Presence of Nonlinear Cutting Forces i...
Mathematical Derivation of Annuity Interest Rate and its Application
INDEX SORT
Decision Tree Classifiers to determine the patient’s Post-operative Recovery ...
Automatic E-Comic Content Adaptation
A Method to Provide Accessibility for Visual Components to Vision Impaired
Data mining visualization to support biochemical markers for liver fibrosis i...
From TION-EMO Theory To A New Beginning Of Artificial Emotions
Audio Art Authentication and Classification with Wavelet Statistics
A Simple Integrative Solution For Simultaneous Localization And Mapping
it & Economic Performance a Critical Review of the Empirical Data
Computer Assisted System for Enhancing the Application of Ergonomics in Manuf...
Design of A Spell Corrector For Hausa Language
Design and Implementation of Sliding Mode Algorithm: Applied to Robot Manipul...
Multi User Detection in CDMA System Using Linear and Non Linear Detector
A Business Review of E-Retailing in India
An Efficient Hybrid Successive Markov Model for Predicting Web User Usage Beh...
A Robotic Prototype System for Child Monitoring
Ad

Similar to Planning in Markov Stochastic Task Domains (20)

PPT
Cs221 lecture8-fall11
PPT
Hierarchical Pomdp Planning And Execution
PPT
Hierarchical Pomdp Planning And Execution
PDF
Derya_Sezen_POMDP_thesis
PPT
Hierarchical Reinforcement Learning
PPT
RL intro
PDF
Markovian sequential decision-making in non-stationary environments: applicat...
PDF
Results IPPC 2011: MDPS and POMDPS
PDF
Deep reinforcement learning from scratch
PDF
Introduction to Reinforcement Learning
PDF
Solving Hidden-Semi-Markov-Mode Markov Decision problems
PDF
Optimization of probabilistic argumentation with Markov processes
PDF
MarkovDecisionProcess&POMDP-MDP_PPTX.pdf
ODP
Partially observable Markov decision processes for spoken dialog systems
PDF
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
PDF
Intrusion Prevention through Optimal Stopping.
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION
PDF
REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION
PPTX
REINFORCEMENT_LEARNING POWER POINT PRESENTATION.pptx
Cs221 lecture8-fall11
Hierarchical Pomdp Planning And Execution
Hierarchical Pomdp Planning And Execution
Derya_Sezen_POMDP_thesis
Hierarchical Reinforcement Learning
RL intro
Markovian sequential decision-making in non-stationary environments: applicat...
Results IPPC 2011: MDPS and POMDPS
Deep reinforcement learning from scratch
Introduction to Reinforcement Learning
Solving Hidden-Semi-Markov-Mode Markov Decision problems
Optimization of probabilistic argumentation with Markov processes
MarkovDecisionProcess&POMDP-MDP_PPTX.pdf
Partially observable Markov decision processes for spoken dialog systems
[한국어] Safe Multi-Agent Reinforcement Learning for Autonomous Driving
Intrusion Prevention through Optimal Stopping.
Reinforcement Learning 3. Finite Markov Decision Processes
REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION
REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION
REINFORCEMENT_LEARNING POWER POINT PRESENTATION.pptx

More from Waqas Tariq (20)

PDF
The Use of Java Swing’s Components to Develop a Widget
PDF
3D Human Hand Posture Reconstruction Using a Single 2D Image
PDF
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...
PDF
A Proposed Web Accessibility Framework for the Arab Disabled
PDF
Real Time Blinking Detection Based on Gabor Filter
PDF
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...
PDF
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...
PDF
Collaborative Learning of Organisational Knolwedge
PDF
A PNML extension for the HCI design
PDF
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 Board
PDF
An overview on Advanced Research Works on Brain-Computer Interface
PDF
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...
PDF
Principles of Good Screen Design in Websites
PDF
Progress of Virtual Teams in Albania
PDF
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...
PDF
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...
PDF
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...
PDF
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
PDF
An Improved Approach for Word Ambiguity Removal
PDF
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
The Use of Java Swing’s Components to Develop a Widget
3D Human Hand Posture Reconstruction Using a Single 2D Image
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...
A Proposed Web Accessibility Framework for the Arab Disabled
Real Time Blinking Detection Based on Gabor Filter
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...
Collaborative Learning of Organisational Knolwedge
A PNML extension for the HCI design
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 Board
An overview on Advanced Research Works on Brain-Computer Interface
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...
Principles of Good Screen Design in Websites
Progress of Virtual Teams in Albania
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
An Improved Approach for Word Ambiguity Removal
Parameters Optimization for Improving ASR Performance in Adverse Real World N...

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Trump Administration's workforce development strategy
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Cell Types and Its function , kingdom of life
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Trump Administration's workforce development strategy
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Final Presentation General Medicine 03-08-2024.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Final Presentation General Medicine 03-08-2024.pptx
human mycosis Human fungal infections are called human mycosis..pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Yogi Goddess Pres Conference Studio Updates
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Abdominal Access Techniques with Prof. Dr. R K Mishra
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
O7-L3 Supply Chain Operations - ICLT Program
Supply Chain Operations Speaking Notes -ICLT Program
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
RMMM.pdf make it easy to upload and study
STATICS OF THE RIGID BODIES Hibbelers.pdf
GDM (1) (1).pptx small presentation for students
Cell Types and Its function , kingdom of life
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf

Planning in Markov Stochastic Task Domains

  • 1. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 54 Planning in Markov Stochastic Task Domains Yong (Yates) Lin ylin@uta.edu Computer Science & Engineering University of Texas at Arlington Arlington, TX 76019, USA Fillia Makedon makedon@uta.edu Computer Science & Engineering University of Texas at Arlington Arlington, TX 76019, USA Abstract In decision theoretic planning, a challenge for Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) is, many problem domains contain big state spaces and complex tasks, which will result in poor solution performance. We develop a task analysis and modeling (TAM) approach, in which the (PO)MDP model is separated into a task view and an action view. In the task view, TAM models the problem domain using a task equivalence model, with task-dependent abstract states and observations. We provide a learning algorithm to obtain the parameter values of task equivalence models. We present three typical examples to explain the TAM approach. Experimental results indicate our approach can greatly improve the computational capacity of task planning in Markov stochastic domains. Keywords: Markov decision processes, POMDP, task planning, uncertainty, decision-making. 1. INTRODUCTION We often refer to a specific process with goals or termination conditions as a task. Tasks are highly related to situation assessment, decision making, planning and execution. For each task, we achieve the goals by a series of actions. Complex task contains not only different kinds of actions, but also various internal relationships, such as causality, hierarchy, etc. Existing problems of (PO)MDPs have often been constrained in small state spaces and simple tasks. For example, Hallway is a task in which a robot tries to reach a target in a 15-grids apartment [11]. From the perspective of task, this process has only a single goal. The difficulties come from noisy observations by imprecise sensors equipped on the robot, instead of task. Although (PO)MDPs have been accepted as successful mathematical approaches to model planning and controlling processes, without an efficient solution for big state spaces and complex tasks, we cannot apply these models on more general problems in the real world. In a simple task of grasping an object, the number of states reaches |S| =1253 [8]. If the task domain becomes complex, it will be even harder to utilize these models. Suppose an agent aims to build a house, there will be thousands of tasks, with different configurations of states, actions and observations. It is hardly to rely simply on (PO)MDPs to solve this problem domain. Compared to other task planning approaches, such as STRIPS or Hierarchical Task Network [10], (PO)MDPs consider
  • 2. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 55 the optimization for every step of the planning. Therefore, (PO)MDPs are more suitable for planning and controlling problems of intelligent agents. However, for task management, the (PO)MDP framework is not as powerful as Hierarchical Task Network (HTN) planning [3]. HTN is designed to handle problems with many tasks. Primitive tasks can be executed directly, and non- primitive tasks will be decomposed into subtasks, until everyone becomes a primitive task. This idea is adopted in the hierarchical partially observable Markov decision processes (HPOMDPs) [12]. Actions in HPOMDPs are arranged in a tree. A task will be decomposed into subtasks. Each subtask has an action set containing primitive actions and/or abstract actions. In fact, a hierarchical framework for (PO)MDPs is an approach that builds up a hierarchical structure to invoke the abstract action sub-functions. Although inherited the merits of task management from HTN, it does not specially address the solving of the big state space problem. Another solution considers multiple tasks as a merging problem using multiple simultaneous MDPs [15]. This solution does not specially consider the characteristic of different tasks, and it limits the problem domains to be MDPs. To improve the computational capacity of complex tasks planning, we develop a task analysis and modeling approach. We decompose the model into a task view and an action view. This enables us to strip out the details, such that we can focus on the task view. After a learning process from the action view, the task view becomes an independent task equivalence model, with task-dependent abstract states and observations. If the problem domain is MDP, we have already solved it by the task view learning algorithm. If it is POMDP, we can solve it using any existing POMDP algorithms, without considering the hierarchical relationship anymore. We apply the TAM approach on existing MDP and POMDP problems. Experimental results indicate the TAM approach brings us closer to the optimum solution of multi-task planning and controlling problems. This paper is organized as follows. We begin by a brief review of MDPs and POMDPs. Then we discuss how to utilize the TAM method on (PO)MDP problems. Three typical examples from MDPs and POMDPs are presented in this part, to explain the design of task equivalence models. In the following section, we present a solution based on knowledge acquisition and model- learning for the task equivalence models. We provide our experimental results for the comparison of the task equivalence model and the original POMDP model. Finally, we briefly introduce some related work and conclude the paper. 2. BACKGROUND A Markov decision process (MDP) is a tuple , where the S is a set of states, the A is a set of actions, the T(s, a, s0) is the transition probability from state s to s0 using action a, R(s, a) is the reward when executing action a in state s, and is the discount factor. The optimal situation-action mapping for the t th step, denoted as , can be reached by the optimal (t-1)-step value function : . A POMDP models an agent action in uncertainty world. At each time step, the agent needs to make a decision based on the historical information from previous executions. A policy is a function of action selection under stochastic state transitions and noisy observations. A POMDP can be represented as , where is a finite set of states, is a set of actions, is a set of observations. In each time step, the agent lies in a state . After taking an action , the agent goes into a new state s0. The transition is a conditional probability function T(s, a, s’) = p(s’|s, a), which presents the probability the agent lies in s’, after taking action a in state s. The agent makes an observation to gather information. This
  • 3. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 56 can be modeled as a conditional probability (s, a, o) = p(o|s, a). When belief state is taken into consideration, the original partially observable POMDP model changes to a fully observable MDP model, denoted as . Here, the B is a set of belief states, i.e. belief space. The is a probability the agent changes from b to b0 after taking action a. The is the reward for belief state b. The b0 is an initial belief state. The POMDP framework is used as a control model of an agent. In a control problem, utility is defined as a real-valued reward to determine the action of an agent in each time step, denoted as R(s, a), which is a function of state s and action a. The optimal action selection becomes a problem to find a sequence of actions a1..t, in order to maximize the expected sum of rewards . In this process, what we concern is the controlling effect, achieved from the relative relationship of the values. When we use a discount factor , the relative relationship remains unchanged, but the values can converge to a fixed number. When states are not fully observable, the goal is changed to maximize expected reward for each belief state. The n th horizon value function can be built from previous value n th using a backup operator H, i.e. V = HV’. The value function is formulated as the following Bellman equation . Here, b’ is the next step belief state, , where is a normalizing constant. When optimized exactly, this value function is always piece-wise linear and convex in the belief space. 3. EQUIVALENCE MODELS ON TASK DOMAINS Tasks serve as basic units of everyday activities of humans and intelligent agents. A task- oriented agent builds its policies on the context of different tasks. Generally speaking, a task contains a series of actions and some certain relationships, with an initial state s0, where it starts from, and one or multiple absorbing states sg (goals and/or termination states), where the task ends in. (RockSample [16] is a typical example using termination state instead of goals. Theoretically, infinite tasks may not have goal or termination state, we can simply set sg = null). From this notion, every (PO)MDP problem can be described as a task (For POMDP, the initial state becomes b0, and absorbing states become bg). To improve the computational capacity of task planning, we develop a task analysis and modeling (TAM) approach. 3.1 Task Analysis Due to the size of state space, and the complex relationships among task states, it is hard to analyze tasks. Therefore, we separate a task, which is a tuple M, into a task view and an action view. The task view, denoted as , reflects how we define an abstract model for the original task. Actions used in a task view is defined in an action view, denoted as . It contains all of the actions in the original task. Before further discussion about the task view and the action view, let us first go over some terms used in this framework. An action a is a single or a set of operational instructions an agent takes to finish a primitive task. A Markov decision model is a framework to decide which action should be taken in each state. If an action defined in a Markov stochastic domain is used by a primitive task, we assume it to be a primitive action.
  • 4. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 57 We define an abstract action ac as a set of actions. We want to find out the ac that is related with a set of abstract states in , with transitional relationship. Only this kind of ac will contribute to our model. In , ac is like a subtask. It has an initial state s0, an absorbing state sg (goal or termination state). In , we deal ac the same as a primitive action. With the help of abstract actions, we build an equivalence model for the original task M on the abstract task domain, using . Our purpose is to find out a set of policies, such that the overall cost is minimized, and the solution is optimum. Denote the optimal policies as . Definition 1. Given a Markov stochastic model , if there exist a pair of task view and action view , such that , we say is an equivalence model of , denoted as . Let us introduce the equivalence model for MDP and POMDP task domains respectively. 3.2 MDP Task For an MDP task , a well-defined task view , and a well-defined ac in action view are both MDP models. For the task view, , where is a set of states for the task, is a set of actions for the task, including primitive actions and abstract actions . The is the transition array, and is a set of rewards. For the action view, . For each . Up to now, we still have not explained how to build the model of task view . In order to know the details of task states, in the TAM approach, we develop a Task State Navigation (TSN) graph to clearly depict the task relationship. A TSN graph contains a set of grids. Each grid represents a state of the task, labeled by the state ID. Neighboring grids with ordinary line indicate there is a transitional relationship among these states. Neighboring grids with bold line indicate there is no transitional relationship. Let us take the taxi task [5] as an example, to interpret how to build the TSN graph, as well as how to construct the equivalence model . The taxi task is introduced as an episodic task. A taxi inhabits a 5×5 grid world. There are four specially-designated locations {R, B, G, Y} in the world. In each episode, the taxi starts in a randomly-chosen state. There is a passenger at one of the four randomly chosen locations, and he wishes to be transported to one of four locations. The taxi has six actions {North, South, East, West, Pickup, Putdown}. The episode ends when the passenger has been putdown at the destination. This is a classical problem, used by many hierarchical MDP algorithms to build the models. We present the TAM solution here. First, we build the TSN graph for taxi task in Figure 1. Label Te represents the taxi is empty, and Tu indicates the taxi has user. Lt is the start location of taxi, Lu is the location of user, and Ld is the location of destination. There are 5 task states in the TSN graph, {TeLt, TeLu, TuLu, TuLd, TeLd}. The initial state is s0 = TeLt, representing an empty taxi in the random location. The absorbing state (goal) is sg = TeLd, representing the taxi is empty and at the user's destination. We mark a star in the grid of the absorbing state. A reward of +20 is given for a successful passenger delivery, a penalty of -10 for performing Pickup or Putdown at wrong locations, and -1 for all other actions. From the TSN graph, it is clear that the taxi task is a simple linear problem. The transition probabilities for the neighboring states are 1. There are four actions in the task domain, Pickup, Putdown}, where is the abstract action going from Lt to Lu, and is the abstract action going from Lu to Ld. This model has two abstract actions , and it is easy to know that . However, and
  • 5. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 58 have different s0 and sg. We call this kind of abstract actions isomorphic action. Details about how to solve are discussed in next section. In this section, we will continue to introduce the TAM approach for POMDP task. 3.3 POMDP Task For a POMDP task , where b0 is the initial belief state, bg are the absorbing belief states (goal belief states and/or termination belief states). The equivalence model can be . In this work, we only consider the most common model, . Some existing POMDP problems have simple relationship in task domain, such that there is no abstract action. Thereafter, the equivalence model becomes a single model, . The coffee task [1] has this kind of equivalence task model. It can be solved using action network and decision tree in [1]. Here, we propose the TAM approach for coffee task. The TSN graph for coffee task is shown in Figure 2. The L is an appointed as the initial state in the coffee. From the beginning O, since the weather has 0.8 probability to be rainy, denoted as R, and 0.2 probability to be sunny, denoted as ¬R, we get the transition probability from L to R and ¬R. If the weather is rainy, the agent needs to take umbrella with successful probability of 0.8, and 0.2 to fail. We denote the agent with umbrella as U, and ¬U if it fails to take umbrella. If the agent has umbrella, it has probability 1 to be ¬L/¬W (dry when it comes to the shop). If it has no umbrella and the weather is rainy, the agent will be ¬L/¬W for 0.2, and be ¬L/W (wet in shop) for 0.8. The ¬L/W has probability 1 to be C/W (coffee wet), and the ¬L/¬W has probability 1 to be C/¬W. Whether it be C/W or C/¬W, the agent has 0.9 probability to deliver the coffee to the user H/W (user has wet coffee), and the H/¬W (user has dry coffee). The agent has 0.1 probability to fail to deliver the coffee ¬H/W (user does not have coffee and coffee wet), ¬H/¬W (user does not have coffee and coffee dry). There are 11 observations for this problem: r (rainy), ¬r (sunny), u (agent with umbrella), ¬u (agent without umbrella), w (agent wet), ¬w (agent dry), nil (none), h/w (user with coffee and coffee wet), ¬h/w (user without coffee and coffee wet), ¬h/¬w (user without coffee and coffee dry), h/¬w (user with coffee and coffee wet). The observation probability for rainy when it is raining is 0.8, and the observation probability is 0.8 for sunny when it is sunshine. The agent gets a reward of 0.9 if the user has coffee and 0.1 if it stays dry.
  • 6. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 59 The POMDP problem of RockSample[n, k] [16] has the task domain model of . Difficulty of RockSample POMDP problems relies on the big state spaces. RockSample[n, k] describes a rover samples rocks in a map of size n×n. The k rocks have equally probability to be Good and Bad. If the rover samples a Good rock, the rock will become Bad, and the rover receives a reward of 10. If the rock is bad, the rover receives a reward of -10. All other moves have no cost or reward. The observation probability p for Checki is determined by the efficiency η, which decreases exponentially as a function of Euclidean distance from the target. The η=1 always returns correct value, and η=0 has equal chance to return Good or Bad. In the TSN graph for RockSample[4, 4] (Figure 3), the S0 represents the rover in initial location, the Ri represents the rover stays with rocki, the Exit is the absorbing state. Except for the Exit, there are 16 task states related with each grid, indicating {Good, Bad} states for 4 rocks. Thus, |S t |=81. For observations, |O t | = 3k+2: 1 observation for the rover residing on place without rock, k observations for the rover residing with a rock, 2k observations for Good and Bad of each rock, and 1 observation for the Exit. There are 2k 2 +k+1 actions in the task domain: Check1,…,Checkk, Sample; For each Ri, there are k-1 abstract actions going to Rj (i ǂ j), 1 abstract action going to Exit, and there are k-1 abstract actions going from Rj to Ri (i ǂ j), 1 abstract action going from S0 to Ri. All abstract actions for a specific RockSample[n, k] problem are isomorphic. It is possible that an isomorphic abstract action ac relates with multiple states. We assign an index for each state related with ac, and call it y index, denoted as y(s), where s is the state. 4. SOLVING TASK EQUIVALENCE MODELS BY KNOWLEDGE ACQUISITION For a simple task domain problem, such as coffee task, it only has , without abstract action. The solution is the same with any other POMDP problems. The difference between task equivalence models and the original POMDPs relies on the task models, instead of the algorithms. 4.1 Learning Knowledge for Model from TMDP Our purpose in the designing of task domains is to handle complex task problems efficiently. This can be achieved by a learning process. The taxi task has an equivalence model , with two isomorphic abstract actions. The idea is, with the knowledge of ac, which can be acquired from , will be solved using standard MDP algorithms. Currently, the only knowledge missing for is the reward . Next, we will obtain by a Task MDP (TMDP) value iteration.
  • 7. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 60 Details about TMDP value iteration are listed in Algorithm 1. The difference between TMDP algorithm and MDP algorithm is, there is an action view single step value iteration , which is shown in Algorithm 2. When T(s, a, s) = 1, action a is not related with state s. Thus we rely on T(s, a, s) < 1 to bypass these unrelated actions. For state s in the MDP model of ac, the optimal policy in the action view is determined. The value is influenced by y index. Finally, we obtain the reward R t (s; a) by the difference . 4.2 Improved Computational Capacity by Task Equivalence Models As a result of the learning process, we got the knowledge about for the task view, and for the action view. Thus, we can focus on the task view alone in future computation of POMDP problems. After the fully observable task view is learned, the partially observable task view becomes a general POMDP problem. We can solve it using any existing POMDP algorithms. RockSample[n, k] is an example of the equivalence model . In our POMDP value iteration algorithm, the computational cost is , . The sizes for different arrays are listed in Table 1: In each round of value iteration, by rough estimation, we get the computational complexity of as , and as . This conclusion can be utilized to general POMDP problems that can be transformed to a task equivalence model with abstract action (the number of states being ), and a task view has k task nodes (not including the initial and
  • 8. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 61 absorbing nodes). When , the equivalence model created by TAM approach can greatly improve the computational capacity. However, if , the equivalence model cannot improve the performance. Alternatively, it may degrade a little the computational capacity. We can take this conclusion as a condition for applying the TAM approach on POMDP tasks, to improve the performance, although it can be used on every existing POMDP problem. In sum, the TAM approach first analyzes the task model, and creates a task view and an action view. The action view is responsible for learning the model knowledge. The trained action view will then be saved for future computing in task view. The task view is an equivalence model with better computational capacity than the original POMDP model. 5. EXPERIMENTAL RESULTS In order to provide a better understanding and detailed evaluation of the TAM approach, we implement several experiments in simulation domains using MATLAB. In the experiments, we aim to find out the reward and execution time for the task equivalence model for the aforementioned problems. Results are achieved by 10 times of execution for each problem, except for RockSample[10, 10]. The execution of RockSample[10, 10] is over one week in our system. Therefore, it is only executed once. All of the experiments are implemented in the same software and hardware environment, with the same POMDP algorithm. For the equivalence models, is pre-computed. The system uses the trained data of . In the experiments, the performance of every task equivalence model improves greatly than the original POMDP model , except for the RockSample[5, 7]. The performance comparison of different models is presented in Table 2. Since the execution of Taxi and Coffee domains are fast enough, we implement it directly by . The detailed comparison is made on the RockSample domains. The equivalence model has much
  • 9. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 62 smaller state space than the original plain POMDP . Although it increases the size of action space, the observation space is also smaller than plain POMDP models. As a result, the performance has been improved for each domain. Especially for the RockSample[10,10], its execution time is only 1/707 of the original model. These results adapts perfectly with our previous analysis. Considering RockSample[n, k], the greater of n and k, the greater the performance of comparing with . Considering the results from prior works, HSVI2 algorithm is able to finish RockSample[10, 10] in 10014 seconds [17]. It is more efficient than the implemented in our platform, and close to , which is a more effective model than that is used in HSVI2. As is discussed in [9], HSVI2 implements an α-vector masking technique, which opportunistically computes selected entries in the α-vectors. This technique is beneficial for the special problems of Rock Sample, in which movement of the robot is certain and the position is fully observable. Effectiveness of the masking technique will degrade for uncertain movements and noisy observations. Our experimental results imply, even if not incorporating the masking technique in the implementation, we can still achieve the same performance using the efficient task equivalence model. The task equivalence model is a general approach. We can apply it on general problem domains to improve the performance. 6. RELATED WORK Hierarchical Task Network (HTN) planning [3] is an approach concerning a set of tasks to be carried out, together with constraints on ordering of the tasks, and the possible assignments of task variables. The HTN does not maintain the Markov properties we utilize in the POMDP problem domains. Several hierarchical approaches for POMDP have been proposed [7,12]. From some perspective, our approach also has some hierarchical features. However, we try to weak the hierarchy in the TAM approach. Our optimal solution is mainly achieved in the task view. The action view is finally used as knowledge to build up the task view, by a learning process. Thus, what we use to solve a problem does not belong to the hierarchical model. This will be helpful for the modeling of complex tasks, because complex tasks themselves may have inherent hierarchy or network relationships, rather than the hierarchy between task and action. The MAXQ [5] is a successful approach defined for the MDP problems. Primitive actions and subtasks are all organized as nodes in the MAXQ graph, which is called subroutines. An alternative algorithm is the HEXQ [13]. It automates the decomposition of a POMDP problem from bottom up, by finding repetitive regions of states and actions. Policy iteration is used for hierarchical planning, called hierarchical Finite-State Controller (FSC) [7]. The FSC method leverages a programmer-defined task hierarchy to decompose a POMDP into a number of smaller, related POMDPs. A similar approach concerning the solving of complex tasks is the decomposition techniques for POMDP problems [4]. It decomposes global planning problems into a number of local problems, and solves these local problems respectively. Another approach helps to improve the efficiency of the POMDPs is to reduce the state space, called the value-directed compression. A linear loss compressions technique is proposed in [14]. This approach does not concern task domains and task relationship. An equivalence model for MDP is discussed in [6]. It tries to utilize a model minimization technique to reduce the big state space. However, as stated in the same paper, most MDP problems cannot use this approach to find out minimized models.
  • 10. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 63 The goal achievement issue for tasks is discussed in [2]. In that paper, the planning and execution algorithm is defined within the scope of STRIPS. 7. CONSLUSIONS We propose the TAM approach to create task equivalence models for MDP and POMDP problems. Parameter values for a task equivalence model can be learned as model knowledge using TMDP. As a result, we can solve the problem in the task view, which is not hierarchical any more. We demonstrate the effectiveness of the task view approach for (PO)MDP problems. This can greatly reduce the size of state space and improve the computational capacity of (PO)MDP algorithms. Current research works relating with (PO)MDP problems still addresses simple tasks. We hope the introduction of the TAM approach can be a breakthrough, so that (PO)MDPs can be applied on the planning and execution of complex task domains. ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is supported in part by the National Science Foundation under award numbers CT-ISG 0716261 and MRI 0923494. REFERENCES [1] Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structure in policy construction. In Proceedings of IJCAI, pp. 1104-1113. [2] Chang, A., & Amir, E. (2006). Goal achievement in partially known, partially observable domains. In Proceedings of ICAPS, pp. 203-211. AAAI. [3] Deák, F., Kovács, A., Váncza, J., & Dobrowiecki, T. P. (2001). Hierarchical knowledge-based process planning in manufacturing. In Proceedings of the IFIP 11 International PROLAMAT Conference on Digital Enterprise, pp. 428-439. [4] Dean, T., & hong Lin, S. (1995). Decomposition techniques for planning in stochastic domains. In Proceedings of IJCAI, pp. 1121-1127. Morgan Kaufmann. [5] Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13, 227-303. [6] Givan, R., Dean, T., & Grieg, M. (2003). Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147 (1-2), 163-223. [7] Hansen, E. A., & Zhou, R. (2003). Synthesis of hierarchical finite-state controllers for POMDPs. In Proceedings of ICAPS, pp. 113-122. AAAI. [8] Hsiao, K., Kaelbling, L. P., & Lozano-Pérez, T. (2007). Grasping POMDPs. In Proceedings of ICRA, pp. 4685-4692. [9] Kurniawati, H., Hsu, D., & Lee, W. S. (2008). Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In Proceedings of Robotics Science and Systems. [10] Lekavý, M., & Návrat, P. (2007). Expressivity of STRIPS-like and HTN-like planning. In Agent
  • 11. Yong (Yates) Lin & Fillia Makedon International Journal of Artificial Intelligence and Expert Systems (IJAE), Volume (1): Issue (3) 64 and Multi-Agent Systems: Technologies and Applications, First KES International Symposium,Vol. 4496, pp. 121-130. Springer. [11] Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Learning policies for partially observable environments: scaling up. In Proceedings of ICML, pp. 362-370. [12] Pineau, J., Roy, N., & Thrun, S. (2001). A hierarchical approach to pomdp planning and execution. In Workshop on Hierarchy and Memory in Reinforcement Learning (ICML). [13] Potts, D., & Hengst, B. (2004). Discovering multiple levels of a task hierarchy concurrently. Robotics and Autonomous Systems, 49 (1-2), 43-55. [14] Poupart, P., & Boutilier, C. (2002). Value-directed compression of POMDPs. In Proceedings of NIPS, pp. 1547-1554. [15] Singh, S. P., & Cohn, D. (1997). How to dynamically merge markov decision processes. In Proceedings of NIPS. [16] Smith, T., & Simmons, R. G. (2004). Heuristic search value iteration for POMDPs. In Proceedings of UAI. [17] Smith, T., & Simmons, R. G. (2005). Point-based POMDP algorithms: Improved analysis and implementation. In Proceedings of UAI, pp. 542-547.