Optimizing Jack's Car Rental

While enjoying an excellent book, titled “Reinforecement Learning” by Richard Sutton and Andrew Barto, a toy problem on page 81 has drawn my attention. It is titled “Jack’s Car Rental” and goes as follows:
Jack manages two locations for a nationwide car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited 10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight, at a cost of 2 per car moved. We assume that the number of cars requested and returned at each location are Poisson random variables, meaning that the probability that the number
is , where is the expected number. Suppose is 3 and 4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we assume that there can be no more than 20 cars at each location (any additional cars are returned to the nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one location to the other in one night. We take the discount rate to be = 0.9 and formulate this as a continuing finite MDP, where the time steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net numbers of cars moved between the two locations overnight.
The main objective of the problem is to determine an optimal policy - the number of cars to be transferred from location 2 to location 1 (or vise versa) in order to maximise the expected profit. Below I would like to share my solution.
Lets start by introducing some mathematical notations to describe Jack’s daily routine:
- At the end of the day Jack counts
cars at the first location (P1) and cars at the second (P2), so the state is . - Next, Jack decides to take an action and move
cars from P1 to P2, which gets him to car arrangement at the start of the next day. - At the end of next day at each location
( ) cars were rented out and were returned. The new state at the end of next day is
The reward (profit) at the end of next day is
where
Policy and Value
For the sake of consistency, lets briefly review key quantities we are looking to find, policy and value. Policy determines which action
Intuitively it is clear that the agent seeks to find the optimal policy that maximises the reward. But which reward is it? An immideate reward for the next day, or a comdined reward for the next 10 years? Following a “long-term greedy” principle we seek to maximise the total reward for all days that follow
The value that corresponds to an optimal policy satisfies Bellman optimality equation (see page 63 in “Reinforcement Learning” book). It is a non-linear equation, which can be solved using an iterative algorithm. At every iteration both value and policy approach their optima.
Evaluate and Impove
Consider the iterative algorithm for optimising the car transfer policy on page 80. The algorithm evaluates the value function
and then uses it to improve the policy
Here
Since there are 11 different car moving options (up to 5 cars between both parkings), both states
Looking at
Firstly, notice that the right hand sides of both equations are mostly the same, and can be simplified:
where
Expected reward and state transition probability
Expected reward tensor
where
is the probability of
Next, lets further simplify the transition probability. Like first two terms in the above equation for the expected reward, the transition probability depends on
For
where
Therefore, the 5D tensor of transition probabilities
Implementation
The maths above enables to write both expected reward and transition probability in a compact form
This enables to write the right-hand-size of Bellmans equation as
where matrix
Here is a list of paramets given by the problem
# Maximum number of cars at each location
N_MAX_CARS = 20
# Maximum number of cars that can be transferred by Jack
A_MAX = 5
# Mean number of cars of rented out at each location
LAMBDA = [3, 4]
# Mean number of cars returned to each location
MU = [3, 2]
# Dicount rate
GAMMA = 0.9
# Reward for each car Jack decides to transfer
R_A = -2
# Reward for each car that gets rented
R_X = 10
The right-hand-side matrix RHS_N__
in the code, is calculated as
RHS_N__ = R_N__ + GAMMA * P1.T.dot(value).dot(P2)
where R_N__
is P1
(P2
(
The Value Iteration code looks like this
def IterValue(value, policy, toll=1e-3, max_iter=100):
"""
Iterate value until maximum difference between successive
iteration is under a tollerance threshold.
"""
delta = 1
it = 0
while delta > toll and it < max_iter:
it += 1
value_new = np.zeros(value.shape)
# Combine all matrixes in (N_, N__) on the right-hand-side of
# the Bellmans Equation
RHS_N__ = R_N__ + GAMMA * P1.T.dot(value).dot(P2)
for N1 in range(value.shape[0]):
for N2 in range(value.shape[1]):
a = policy[N1, N2]
N1__ = N1 - a
N2__ = N2 + a
if N1__ < 0 or N1__ > N_MAX_CARS: continue
if N2__ < 0 or N2__ > N_MAX_CARS: continue
value_new[N1, N2] = RHS_N__[N1__, N2__] + R_A * abs(a)
delta = np.max(np.abs(value - value_new))
value = value_new
if (it < max_iter): print(f"Value iteration converged in {it}
iterations.")
else: print(f"Too many iteration. Final delta --- {delta}.")
return value
Using Value
and Policy
, initialized with zeros it can be ran like so
Value = np.zeros((N_MAX_CARS + 1, N_MAX_CARS + 1))
Policy = np.zeros((N_MAX_CARS + 1, N_MAX_CARS + 1), dtype=np.int16)
Value = IterValue(Value, Policy)
and takes about a 0.25 seconds to run a 100 iterations on my desktop. Next, lets consider the Policy improvement code
def ImrovePolicy(value, policy):
"""
Apply policy improvement part of the algorithm.
"""
# Combine all matrixes in (N1__, N2__) on the right-hand-side of
# the Bellmans Equation
RHS_N__ = R_N__ + GAMMA * P1.T.dot(value).dot(P2)
for N1 in range(value.shape[0]):
for N2 in range(value.shape[1]):
v_max = -np.inf
a_max = 0
for a in range(-A_MAX, A_MAX + 1):
N1__ = N1 - a
N2__ = N2 + a
if N1__ < 0 or N1__ > N_MAX_CARS: continue
if N2__ < 0 or N2__ > N_MAX_CARS: continue
v_candidate = RHS_N__[N1__, N2__] + R_A * abs(a)
if v_candidate > v_max:
v_max = v_candidate
a_max = a
policy[N1, N2] = a_max
return policy
Running both stages sequentially gives a single iteration of the Evaluate and Improve algorithm:
Value = IterValue(Value, Policy, toll=1e-3)
Policy = ImrovePolicy(Value, Policy)
Results
The algorithm converges in 5 iterations. The result for Policy looks like this
Jack can use the above map much like a chart of Texas Holdem starting hands. The cell that corresponds to the number of cars at the end of the day in both location gives a number of cars to be transferred from P1 to P2 that gives an optimal expected Value.
Under this optimal Policy the Value that Jack should expect to get looks like this: