Neural Networks and Artificial Intelligence: Reinforcement Learning : Introduction

Characteristics:

No Supervisor, Reward signal
Delayed Feedback
Data depends on Agent's actions

Examples:

Learning to play chess
Making a robot walk
Fly stunt manoeuvres in Helicopter

Reward (Rt):

A scalar feedback signal
Indicates how well agent is doing at time t
Agent tries to maximize cumulative Reward over time

E.g.,

Chess : +ve reward for winning the game, -ve for losing
Robot walk : +ve for forward movement, -ve for falling
Helicopter manoeuvres : +ve for following trajectory, -ve for crashing

Sequential Decision Making:

Select actions to maximize total future rewards
Long-term consequences
May need to sacrifice immediate rewards for better Long-term rewards

E.g.,

Some moves in chess, financial investments

Agent and Environment:

At each step t, the Agent:

Executes action At
Receives observation Ot
Receives scalar reward Rt

Environment:

Receives action At
Emits observation Ot+1
Emits reward Rt+1

t increments at environment step

Detailed explanation and illustrative examples can be found on Youtube:

Subscribe to KnowledgeCenter

History & State:

History:

A sequence of Observations, Actions and Rewards that the Agent has seen so far.

Ht = A1, O1, R1, ….., At, Ot, Rt

All observable variables up to time t

What happens next depends on the history:

Agent(our Algorithm) selects Actions
Environment selects Observations/Rewards

State:

History is not very useful as it's a very long stream of data.

State is the information used to determine what happens next.

Function of History

St = f(Ht)

E.g, take only last 3 observations

Environment/World State:

State of environment used to determine how to generate next Observation and Reward.

Usually not accessible to Agent.

Even if visible, may not have any information useful for Agent

Agent State:

Agent's internal representation

Information used by Agent(our algorithm) to pick next Action

Can be any function of History

Markov Assumption:

Information State: State used by Agent is a sufficient statistic of History. In order to predict the future you only only need the current state of the environment.

State St is Makov if and only if:

P(St+1|St, At) = P(St+1|S1,S2,..,St, At)

Future is independent of Past given Present

E.g., State defined by a Trading Algorithm used by HFT traders.

Detailed explanation and illustrative examples can be found on Youtube:

Subscribe to KnowledgeCenter

Environment :

Fully Observable Environment:

Agent directly observes environment state

O(t) = Sa(t) = Se(t)

Agent State = Environment State

Markov Decision Process(MDP)

Partially Observable Environment:

Agent indirectly observes Environment:

A HFT trader observes only current prices
A robot with camera vision doesn't observe its absolute location
Poker playing agent observes only public cards

Agent State != Environment State

Partially Observable Markov Decision Process (POMDP)

Agent has to construct its own representation of State

E.g.,

Sa(t) = H(t)
Sa(t) = (P[Se(t) = s1],....., P[Se(t) = sn])
Sa(t) = f(Sa(t-1), O(t))

Detailed description can be found on youtube:

Subscribe to KnowledgeCenter

Neural Networks and Artificial Intelligence

Wednesday, April 17, 2019

Reinforcement Learning : Introduction

No comments:

Post a Comment

LeetCode 30 Day Challenge | Day 7 | Counting Elements