Characteristics:
- No Supervisor, Reward signal
- Delayed Feedback
- Data depends on Agent's actions
Examples:
- Learning to play chess
- Making a robot walk
- Fly stunt manoeuvres in Helicopter
Reward (Rt):
- A scalar feedback signal
- Indicates how well agent is doing at time t
- Agent tries to maximize cumulative Reward over time
E.g.,
- Chess : +ve reward for winning the game, -ve for losing
- Robot walk : +ve for forward movement, -ve for falling
- Helicopter manoeuvres : +ve for following trajectory, -ve for crashing
Sequential Decision Making:
- Select actions to maximize total future rewards
- Long-term consequences
- May need to sacrifice immediate rewards for better Long-term rewards
E.g.,
Some moves in chess,
financial investments
Agent and Environment:
At each step t, the Agent:
- Executes action At
- Receives observation Ot
- Receives scalar reward Rt
Environment:
- Receives action At
- Emits observation Ot+1
- Emits reward Rt+1
t increments at
environment step
Detailed explanation and illustrative examples can be found on Youtube:
Subscribe to KnowledgeCenter
History & State:
History:
- A sequence of Observations, Actions and Rewards that the Agent has seen so far.
Ht
= A1, O1, R1, ….., At, Ot,
Rt
- All observable variables up to time t
- What happens next depends on the history:
- Agent(our Algorithm) selects Actions
- Environment selects Observations/Rewards
State:
- History is not very useful as it's a very long stream of data.
- State is the information used to determine what happens next.
- Function of History
St = f(Ht)
E.g,
take only last 3 observations
Environment/World State:
- State of environment used to determine how to generate next Observation and Reward.
- Usually not accessible to Agent.
- Even if visible, may not have any information useful for Agent
Agent State:
- Agent's internal representation
- Information used by Agent(our algorithm) to pick next Action
- Can be any function of History
Markov Assumption:
- Information State: State used by Agent is a sufficient statistic of History. In order to predict the future you only only need the current state of the environment.
- State St is Makov if and only if:
P(St+1|St,
At) = P(St+1|S1,S2,..,St,
At)
- Future is independent of Past given Present
- E.g., State defined by a Trading Algorithm used by HFT traders.
Detailed explanation and illustrative examples can be found on Youtube:
Subscribe to KnowledgeCenter
Environment :
Fully Observable
Environment:
- Agent directly observes environment state
O(t) = Sa(t) = Se(t)
- Agent State = Environment State
- Markov Decision Process(MDP)
Partially Observable
Environment:
- Agent indirectly observes Environment:
- A HFT trader observes only current prices
- A robot with camera vision doesn't observe its absolute location
- Poker playing agent observes only public cards
- Agent State != Environment State
- Partially Observable Markov Decision Process (POMDP)
- Agent has to construct its own representation of State
E.g.,
- Sa(t) = H(t)
- Sa(t) = (P[Se(t) = s1],....., P[Se(t) = sn])
- Sa(t) = f(Sa(t-1), O(t))
Detailed description can be found on youtube:
No comments:
Post a Comment