General
What is TD?
Temporal difference relates to our prediction about something at state x given our knowledge of the bellman structure.
The TD target is our prediction about something given the bellman structure.
So if we are trying to learn the value function for a specific policy, we know that
$$
V(x_k) = \mathbb{E}[r(x_k, u_k) + \lambda V(x_{k+1})]
$$
So we can say the TD target is
The reason we do this is to move from an expectation to a sample. If we know the distribution of future states like we do when we have the model of the environment, we can just use value or policy iteration where we compute the actual expectation and substitute that in. In TD learning we are basically taking a monte-carlo estimate of this by observing the environment transitions.
$$
V_{t+1}(x_k) \approx R_k + \lambda V_t(x_{k+1})
$$
and the TD error is the target minus our current
$$
R_k + \lambda V_t(x_{k+1}) - V_t(x_k)
$$
Then we learn by updating that $V_{t+1}$ with some learning rate to approximate the TD target.
$$
V_{t+1}(x_k) = V_t(x_k) + \alpha [R_k + \lambda V_t(x_{k+1}) - V_t(x_k)]
$$
or equivalently
$$
V_{t+1}(x_k) = (1-\alpha) V_t(x_k)+ \alpha [R_k + \lambda V_t(x_{k+1})]
$$
In general that is TD learning.
SARSA also fits into this TD learning idea.
$$
Q_{t+1}(x_k, u_k) = (1-\alpha) Q_t(x_k, u_k) + \alpha [r(x_k, u_k) + \lambda Q_t(x_{k+1}, u_{k+1})]
$$
Algorithms and use-cases:
On vs off policy
Model based or model free
Discrete vs continuous state and action spaces
General use case statements
Value iteration
Policy iteration
DDPG
Off policy, model free, requires continuous action space, state space can be discrete or continuous.
Use someting like DDPG when we need high sample efficiency.
DDPG is very difficult to tune. If you are going to use DDPG, use TD3 instead. Or SAC. TD3 is just DDPG with stability fixes. SAC is also off policy, but more in line with PPO while still requiring a continuous action space.