Interactive visualization of a Deep Q-Network, showing how neural networks learn to make decisions through reinforcement learning.
Learns action-value function Q(s,a) that gives expected utility of taking action a in state s
Approximates Q-function with weights θ, overcoming limitations of tabular methods
Stores transitions (s,a,r,s') in memory buffer to break correlations between samples
Separate network with weights θ⁻ that are periodically updated to stabilize learning
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights θ
Initialize target action-value function Q̂ with weights θ⁻ = θ
for episode = 1, M do
Initialize sequence s₁ = {x₁} and preprocessed sequence φ₁ = φ(s₁)
for t = 1, T do
With probability ε select a random action aₜ
otherwise select aₜ = argmaxₐ Q(φ(sₜ), a; θ)
Execute action aₜ in emulator and observe reward rₜ and image xₜ₊₁
Set sₜ₊₁ = sₜ, aₜ, xₜ₊₁ and preprocess φₜ₊₁ = φ(sₜ₊₁)
Store transition (φₜ, aₜ, rₜ, φₜ₊₁) in D
Sample random minibatch of transitions (φⱼ, aⱼ, rⱼ, φⱼ₊₁) from D
Set yⱼ = rⱼ if episode terminates at step j+1
rⱼ + γ maxₐ′ Q̂(φⱼ₊₁, a′; θ⁻) otherwise
Perform gradient descent step on (yⱼ − Q(φⱼ, aⱼ; θ))²
Every C steps reset Q̂ = Q