A sequence of Observations,
Actions and Rewards that the Agent has seen so far.
Ht
= A1, O1, R1, ….., At, Ot,
Rt
All observable variables up
to time t
What happens next depends on
the history:
Agent(our Algorithm) selects
Actions
Environment selects
Observations/Rewards
State:
History is not very useful as
it's a very long stream of data.
State is the information used
to determine what happens next.
Function of History
St = f(Ht)
E.g,
take only last 3 observations
Environment/World State:
State of environment used to
determine how to generate next Observation and Reward.
Usually not accessible to
Agent.
Even if visible, may not have
any information useful for Agent
Agent State:
Agent's internal
representation
Information used by Agent(our
algorithm) to pick next Action
Can be any function of
History
Markov Assumption:
Information State: State used
by Agent is a sufficient statistic of History. In order to predict the
future you only only need the current state of the environment.
State St is Makov if and only if:
P(St+1|St,
At) = P(St+1|S1,S2,..,St,
At)
Future is independent of Past
given Present
E.g., State defined by a
Trading Algorithm used by HFT traders.
Detailed explanation and illustrative examples can be found on Youtube: