Return to Index
Operations Research Models and Methods
 
Computation Section
Subunit Dynamic Programming Examples
 - Log Gambler: Solution

Clicking the Transfer to DP Solver button on the model worksheet causes the DP Solver worksheet to be constructed to hold the information included in the lists derived by the enumeration process. The upper left corner of the solver worksheet is shown below. The worksheet stores the final values in column H. This example illustrates the use of the final value column with finite iterations.

 
solver worksheet
  We have chosen to represent the transition probabilities in a matrix. The beginning of the transition probability matrix is shown in the three columns on the right of the figure below. There are 20 columns in this matrix, one for each state, and 110 rows, one for each decision.
solver decisions

 

Solution

 
iteration dialog This is a finite horizon problem since the gambler plays a fixed number of times. We choose to evaluate 15 plays. With finite horizon, only the value method of DP analysis is appropriate. The Policy method is only useful for infinite horizon problems. We choose to maximize the total value.
show dialog

A finite horizon problem should occasion the need to show results from the individual steps of the solution. Otherwise only the final values remain on the DP worksheet at the end of the iteration steps. When the Show All Iterations from the iteration form, the dialog at the left directs the set of information to show on the States worksheet. Here we choose to show the action name and the state values.

The display is automatically cleared when the Initialize buttons are clicked on the iteration dialog.

  After 15 iterations the process stops and the results of the final iteration are displayed on the solver worksheet. Recall the DP problems are solved backward in time. The results show the optimum solution 15 plays before the end of the game. The optimum action is to bet $1 for all wealth values between 2 and 19 except when the wealth is 11. There we bet $2. At the wealths of 1 and 20, no more bets occur.
iteration 15
 

Ross solves the analytic version of this problem for continuous state and action space. The single state variable, the gambler's wealth is nonnegative and unbounded from above. The optimum solution is to place no bets if the probability of winning is 0.5 or less. When the probability of winning is more than 0.5, the optimum bet is:

Bet = Wealth*(p - q)

where p is the probability of winning and q is the probability of losing.

This solution is valid for every stage. For this example, the value of p is 0.6 and the value of q is 0.4, so the optimal bet is 20% of the wealth. The optimal bets for the range in wealth considered in this example are listed below. It is important to note that the analytic solution does not limit the total wealth to a finite value. Thus we see the size of the bet increasing proportionally to the wealth.

bet

In contrast the finite policy obtained with the add-in is restricts wealth to a maximum of 20 and bets in integer values. We review the decisions as the steps go backward from the final reward. Step 1 is one step before the gambler receives his payment. The restricted bets in step 1 follow roughly the optimum for the first 15 states, but rather than increasing for the higher states, the bets decrease. The restricted problem does not allow bets that might result in a wealth higher than 20. As the number of steps to the end increase, the effect of the boundary becomes more and more obvious even at the lower wealth values.

iteration 1_4
 

The effect of the boundaries become more pronounced as we step backward in time. The final strategy is to bet only $1 in states 1 through 19, except in state 11 where we bet $2.

steps 11 to 15
  This effect is common when approximating continuous, unbounded state and action space by the discretized, bounded model necessary for computation. The analytic solution is powerful but limited. It provides an simple, optimum policy good for all time. It is limited because it only works for the LOG gambler. If the value of the final wealth is other than ln(wealth), the solution is not valid.

 

The Effect of a Smaller State and Action Interval

  The DP Model add-in has a feature that allows non-integral step sizes. A second version of the gambler model is constructed with the Include Step Size box checked.
 
  The model is constructed with a new row that specifies a step size, row 21. For the state and action element we use a step size of 0.25 and modify the bounds to reflect the new step size. The step for event remains the same, 1, because it is an indicator variable for win and lose.
step elements
  The state and action lists are four times longer than the corresponding lists with a step size of 1. The first few values are shown below.
  The DP solution was obtained as before. We show the first step back from the final payout and the 15th step. The optimum continuous solution the optimal bet is 20% of the wealth. This holds pretty well for the first step, but not so well for the 15th step. The effects of the boundary on the finite, discrete model becomes more and more apparent as the time horizon increases.
 
step results

 

Summary

  The Log gambler problem is interesting because it is a finite horizon problem where the only evaluation is the wealth at the final state. It is also interesting because there is a simple analytical solution. For computation, it is necessary to discretize the state and action spaces and include finite ranges. As seen by this example, the limiting boundaries do effect the approximate solution. To reduce the effects, use smaller step sizes and large upper bounds for the computational model. Of course that increases the size of the model in terms of states and decisions.
 
  
Return to Top

tree roots

Operations Research Models and Methods
Internet
by Paul A. Jensen
Copyright 2004 - All rights reserved