Wednesday, April 17, 2019

Reinforcement Learning : Introduction







Characteristics:

  1. No Supervisor, Reward signal
  2. Delayed Feedback
  3. Data depends on Agent's actions




Examples:

  1. Learning to play chess
  2. Making a robot walk
  3. Fly stunt manoeuvres in Helicopter




Reward (Rt):

  • A scalar feedback signal
  • Indicates how well agent is doing at time t
  • Agent tries to maximize cumulative Reward over time
E.g.,
  • Chess : +ve reward for winning the game, -ve for losing
  • Robot walk : +ve for forward movement, -ve for falling
  • Helicopter manoeuvres : +ve for following trajectory, -ve for crashing



Sequential Decision Making:

  • Select actions to maximize total future rewards
  • Long-term consequences
  • May need to sacrifice immediate rewards for better Long-term rewards
E.g.,
Some moves in chess, financial investments





Agent and Environment:

At each step t, the Agent:
  • Executes action At
  • Receives observation Ot
  • Receives scalar reward Rt

Environment:
  • Receives action At
  • Emits observation Ot+1
  • Emits reward Rt+1
t increments at environment step



Detailed explanation and illustrative examples can be found on Youtube:


Subscribe to KnowledgeCenter

History & State:
History:
  • A sequence of Observations, Actions and Rewards that the Agent has seen so far.
            Ht = A1, O1, R1, ….., At, Ot, Rt

  • All observable variables up to time t

  • What happens next depends on the history:
    • Agent(our Algorithm) selects Actions
    • Environment selects Observations/Rewards

State:
  • History is not very useful as it's a very long stream of data.

  • State is the information used to determine what happens next.

  • Function of History
St = f(Ht)
E.g, take only last 3 observations

Environment/World State:
  • State of environment used to determine how to generate next Observation and Reward.

  • Usually not accessible to Agent.

  • Even if visible, may not have any information useful for Agent

Agent State:
  • Agent's internal representation

  • Information used by Agent(our algorithm) to pick next Action

  • Can be any function of History


Markov Assumption:
  • Information State: State used by Agent is a sufficient statistic of History. In order to predict the future you only only need the current state of the environment.

  • State St is Makov if and only if:
P(St+1|St, At) = P(St+1|S1,S2,..,St, At)

  • Future is independent of Past given Present

  • E.g., State defined by a Trading Algorithm used by HFT traders.

Detailed explanation and illustrative examples can be found on Youtube:


Subscribe to KnowledgeCenter

Environment :
Fully Observable Environment:

  • Agent directly observes environment state
           O(t) = Sa(t) = Se(t)

  • Agent State = Environment State

  • Markov Decision Process(MDP)

Partially Observable Environment:

  • Agent indirectly observes Environment:
    • A HFT trader observes only current prices
    • A robot with camera vision doesn't observe its absolute location
    • Poker playing agent observes only public cards


  • Agent State != Environment State

  • Partially Observable Markov Decision Process (POMDP)

  • Agent has to construct its own representation of State
E.g.,
  • Sa(t) = H(t)
  • Sa(t) = (P[Se(t) = s1],....., P[Se(t) = sn])
  • Sa(t) = f(Sa(t-1), O(t))

Detailed description can be found on youtube:


Subscribe to KnowledgeCenter

No comments:

Post a Comment

LeetCode 30 Day Challenge | Day 7 | Counting Elements

Given an integer array  arr , count element  x  such that  x + 1  is also in  arr . If there're duplicates in  arr , count them sepe...