Return to Index
Operations Research Models and Methods
 
Computation Section
Subunit Dynamic Programming Examples
 - Doors Problem

The inspiration for this MDP version of the Random Walk problem is the paper Combinatorial Design of a Stochastic Markov Decision Process, by Nedialko B. Dimitrov and David P. Morton (in Operations Research and Cyber-Infrastructure, M.J.\ Saltzman, J.W. Chinnek and B.\ Kristjansson (eds.), Springer, New York, 2009, 167-193). It is the MDP option of the Random Walk problem described in the DP Data section of this site.

One of the problems considered by this paper is described by the authors in this slightly modified quotation.

Suppose an individual moves randomly on the grid network depicted in Figure 1.1 (repeated below). The top-left cell (northwest corner) and the bottom-right cell (southeast corner) are special. We seek to guide the individual to the northwest corner, but he vanishes from the grid if he first reaches the southeast corner. The set of grid cells, i.e., nodes in the individual’s spatial network, form the state space. There are five actions available in each cell: do nothing, close a one-way door blocking the north exit, the south exit, the east exit, or the west exit. If we do nothing, the individual has a 1/4 probability of transitioning to the adjacent cell to the north, south, east, or west in the next time period. If we close a one-way door blocking one of the cell’s exits, the individual has a 1/3 probability of exiting to each of the remaining three neighboring cells. The doors are one-way since they block movement out of, but not into, the cell. The special cells are different: If the individual reaches the northwest corner, or the southeast corner he departs the grid in the next time step, regardless of the action we take. In the former case we receive a unit reward. The one-step reward for the latter case, along with that of all other cells, is 0. As a result, the goal is to maximize the probability the individual reaches the northwest corner prior to reaching the southeast corner and vanishing. Equivalently, the goal can viewed as “protecting” the southeast corner, i.e., as minimizing the probability the individual is allowed to transit into that cell.

The add-in does not solve exactly the problem considered in the paper, and the paper goes far beyond the simplest of the problems considered here.

The MDP model adds the blocking action to the Markov Model of the random walk problem. At each state, we allow the action of blocking movement in one of the directions. The figure below illustrates the problem. For this case we assume that the grid lines wrap. When the walker reaches one of the boundaries and attempts to move beyond it, the walker goes to the opposite boundary. For example with the walker at (0, 1) and attempts to move north, he will find himself at the next step at (3, 1). The wrapping is illustrated by the oval shapes that appear to go behind the grid.

Two states are designated as final states, (0,0) and (3,3). If the walker wanders to state (0,0) indicated by the gold color, we win a reward of $1 and the walker leaves the grid and goes to a final state that is external to the grid. If the walker wanders to state (3,3) indicated by the black color, we win no reward and the walker leaves the grid and again goes to the final state. This is called a transient MDP because the walker will eventually leave the grid.

The actions of the MDP involve closing the door against movement in one of the four directions. In each state we may decide to restrict movement or not. If restricted, we must choose the north, south, west or east direction. We combine the decision to restrict with the four movement directions by including the null decision along with the four directions. If not restricted, the walker moves in the four directions with equal probability of 1/4. When the walker is restricted, the probability of moving in the restricted direction is 0, and the probability of moving in each of the other directions is increased to 1/3. Doors are one-way in that they only restrict movement from the state restricted, not from some other state moving into the state restricted.

We illustrate a door by the short blue line close to the node representing the state. The solution shown has the black node entirely sealed off. All avenues of direct movement into the black node are blocked. Note that all four doors of the solution are not simultaneously present. A door is closed only when the walker is in the associated state.

 

Data

 

 

The dialog for the MDP version of Random Walk data type is shown at the left. Only the Final option is allowed for exit states. For this example we choose the wrap option for boundaries, and specify 2 as the number of exit states.

The data form for the Doors problem shows the data that is fixed for the model with yellow cells.

The decision to place a door at a cell is called the blocking decision and this form has data defining the cost of placing the block.

We have filled in the specific data for the problem. Note the prize for the final state (0,0) is set to 1. In contrast to the problem presented in the paper we indicate a block prize of -0.001. This is a small penalty placed on using a door. Otherwise, there are many alternative optimum solutions to the problem. The small penalty will encourage a solution with as few doors as possible.

There are five moving decisions, one for each direction and one for the no-move option. To correctly reflect the doors problem the probability of the null move must set at 0.

 

States

 

Clicking the Build Model button creates a worksheet for the model. The MDP model has all three elements, States, Actions and Events. The following discusses each element and illustrates the associated lists that define the problem for the DP Solver.

 
The state space is identified by enumerating all integer indices from 0 to 15 in cell E14. We choose the index 12 and the resultant state (3,0) for illustration.

The state list, obtained by enumeration, has sixteen states representing the nodes on the grid and one additional state called the final state. This new state is included for any problem that has exiting states designated as a final state.

The final states are designated in the state subset range of the worksheet shown below. Here states (0,0) and (3,3) are defined as exit states with the word TRUE in V37 and V43. The prize of 1 for exiting at state (0,0) is in V36 of the first subset. The third subset includes all the non-exit states.

 

Actions

 

The five actions are identified with the integers 1 through 5. For example, the integer 4 is the block action of closing a door to the east. The penalty for using the blocking action is in M21. The cell holds a formula that links to the corresponding value on the data worksheet.

The blocked probability cell in M22 is transferred from the move probability cells on the data worksheet. The value shown is the move to the east probability.

The action list is obtained by the enumerating the actions.

 

Events

 

There are four move events. The interesting thing about this problem is the computation of the event probability in cell R22. Cell R22 holds a formula that sums P22 and Q22. Cell Q22 holds the value of the move probability transferred from the data worksheet.

Cell P22 holds the contribution from the block decision. When it is the same as the move event, then the negative of the blocked probability is entered in P22 so that the net probability is 0.

When the move is different than the block decision as is shown at the left, 1/3 of the blocked probability is placed in P22. The net probability of the westward move is the desired value of 1/3. The evaluation of P22, Q22, and R22 are all accomplished through Excel formulas. The results are valid with any valid move probabilities entered on the data worksheet.

The formula in P22 is the only reference to the action required on the model worksheet.

The event list is obtained by the enumerating the events.

 

Decisions

 

The decision list is created by enumerating all states and all actions. Only feasible combinations are included. There is only one decision for state (0,0). Since (0,0) is final state, no actions are enumerated and the action in state (0,0) is specified as Null. The prize for state (0,0) as well as the penalties for all blocking decisions are in column Z. There are no decision blocks for this model, so the list is a straight forward enumeration of all state/action options, except for states (0,0) and (3,3).

The last few decisions are shown below to illustrate the transition at the exit state (3,3). Row 72 shows the transition to the final state with the associated prize. The final state is in row 73.

 

Transitions

 

The transitions are defined by the transition blocks of the model. Although the actions are included in the transition block definition, the same transition blocks are used as were used in the Markov Chain models. The actions play no role in the transitions except in the determination of the probability of transition in the event description. These are transferred to the transition blocks and ultimately appear as the transition probabilities in column AG.

Each decision has a set of transitions, and each member of the set is the result of a different event. For the example, the first row indicates the transition from state (0,0) to the final state. Transitions 2 through 4 are for the action of blocking north from state (0,1). The move north event is not shown because its probability is zero. The other three move options each have a probability of 1/3. Transition 5 illustrates a wrap transition where the state (0,1) moves to state (3,1) with a northward move. Only the null action has four move events, each with probability 1/4.

There are 227 transitions for the model. The transition from (3,3) to the final state is 226. The final state (state 17) is an absorbing state with 0 transition cost and probability equal to 1.

 

Click the Transfer to DP Solver button at the top of the models worksheet to automatically build the Solver model and fill the solver worksheet with the five lists shown above. The lists of states, actions, events, decisions and transitions comprise the input data for DP Solver model. The solution obtained is described on the next page.

 
  
Return to Top

tree roots

Operations Research Models and Methods
Internet
by Paul A. Jensen
Copyright 2004 - All rights reserved