Reinforcement Learning For Adaptive Traffic Signal Control With Limited Information
Machine Learning (CS 229) Final Project, Fall 2015
Jeff Glick (jdglick@stanford.edu), M.S. Candidate, Department of Management Science and Engineering, Stanford University
Recent research has effectively applied reinforcement learning to
adaptive traffic signal control problems. Learning agents learn
best with a high level of intelligence about what state the
environment is in to determine the right actions for a given state.
Most researchers have provided this access to perfect state
information. However, in a real-life deployment, this level of
intelligence would require extensive physical instrumentation.
This study explores the possibility of training a control agent that
has only access to information from a limited number of vehicles
using cell phone geo-location data in the interest of comparing
performance against legacy fixed phase timing policy and under
regimes where the agent has access to perfect information.
Motivation
Next Steps
Motivation here
Reinforcement Learning Cycle
• Q-learning develops quality-values for each pair
which is an estimate for the true value
• Continuous asynchronous updating for on-line learning; assume
infinite visits to states for convergence
• Q-learning update:
• States: Discretized state space; number of states for problem:
• Learning Rate : Initially = 1 ignores previous experience; As
, we are weighting previous experience more heavily
• Discount factor : Use to prevent myopic decision making
• Control policy Given a state try action with probability:
(soft max distribution) where controls exploration; if is large,
actions are chosen with equal
. As , policy becomes
deterministic and choose
with probability of 1.
• Reward : determined by objective function:
Algorithm & Key Parameters
Simulation Build & Data Generation
Simulation Setup:
• Open Street Map and Java
Open Street Map Editor
• Simulation for Urban
Mobility (SUMO)
• Using realistic, variable arrival
and turn rates for a single 8-
phase 4-way intersection
(arrivalGen.py)
Simulation Architecture & Learning Pipeline
map.osm
• Prepared in Java OSM Editor
in.net.xml
• Lanes, Intersections,
Lights, Speed Limits
SUMO NETCONVERT
in.add.xml
• Induction Loops
• Misc. simulation inputs
in.rou.xml
• Vehicle routes & times
arrivalGen.py
• Fit polynomial arrival rate functions to synthetic data
• Generate random vehicle arrival schedule
• Tag selected vehicles if GEOLOCATION is ON (~30%)
palm.sumocfg
• Simulation control file
• Run in SUMO GUI or Command Line
out.queue.xml
• Lane queue sizes at time t
out.fcd.xml
• Vehicle status at time t
out.full.xml
• Full simulation output (lane
throughput, occupancy)
out.ind.xml
• Induction loop counts and
status at time t
parseFull.py
• Validate simulation
• Analyze & visualize output
• Assess performance of learning
algorithms & adjust tuning params
controller.py
• Decide light phase changes
• Collect reward based on
objective function
• Learn optimal policy via Q-
learning
SERVER
CLIENT
Control stop
light via SUMO
Traffic Control
Interface
(TraCI) API
detectState.py
• Maximum likelihood estimate
of non-homogenous arrival
rates, queue sizes & waiting
times
in.set.xml
• SUMO GUI settings
CLIENT
Action: Every s
seconds,
transition to
phase (1) or
continue current
phase (0)
• Validated traffic dynamics; Selected bucket thresholds for
discrete queue sizes and waiting times
• Queues blowing up
• Learning rate shrinking quickly
• Crude discretization (most of
25k states not being visited)
• Challenges with volatility
• Reward should = change in
objective function (reward
improvement)
Initial Results
Acknowledgements: Michael Bennon (Stanford Global Projects Center), Allen Huang
(CS229 Student), Jesiska Tandy (CS229 Student)
• Throttled learning rate
system still performing
better during off-peak;
• Some improvement by increasing
bucket thresholds, delaying the
progression of the learning rate
• Increased (important when
rewards are negative)
• Still performance issues; MDP
assumption may not hold
• Continue to experiment with learning strategy, parameters
and objective function; improve discretization
• Work on state detection problem (limited information); learn
arrival rates or use hour of day in the state space
• Change arrival rate dynamics to test robustness of process
评论0