DDPG by gymnasium２日目

Contents

前回

前回はハーフチーター環境をランダムな行動で動かすところまでいきました。

actionの決定は、行動空間からのランダムサンプリング

action = env.action_space.sample()

になっています。いわゆる「方策：ポリシー」と呼ばれるものです。

エージェント（行動する者）はポリシー（方策・方針）を定めることによって、その状況（環境、観測情報）に応じて行動を選択します。その決定は確率的であったり一意的あるいは決定論的であったりします。

このポリシーを何かしらのアルゴリズムで調整・改善することで最大収益が得られるようにしていくのがDDPGなどの強化学習手法の目的です。

「収益が最大になるようにポリシーを改善していく」のほうが正しい言い方でしょうか。

改善案

Agentクラスを新規作成して、インスタンスagentを作り、DDPG的な学習ができるようなメソドを作成していく。

ではAgentDDPGクラスを作成していく。

class AgentDDPG:
    def __init__(self):
        print('AgentDDPG is working')
    def choose_action(self, obs):
        action = env.action_space.sample()
        return action

class AgentDDPG:

def __init__(self):

print('AgentDDPG is working')

def choose_action(self, obs):

action = env.action_space.sample()

return action

メインスクリプトでagent = AgenDDPG()インスタンスを生成してから

action = agent.choose_action(obs)メソドを実行することで、行動空間action_spaceから行動をランダムに選択してactionをひとつ選択されたものを戻り値とすることができました。まだこの時点ではDDPG的な要素を入れていません。

DDPG部分を作っていく

DDPGは方策勾配法を基礎としており、目的関数J= E[ Σ G(τ) * grad log π(θ)]を最大化するために最初は適当な方策πをちょっとずつ自動調整していく方法です。

勾配 grad Jを使って、パラメータθを最適化していくのですがどう表現したらよいでしょうか。方策勾配法の発展経緯を追っていくと、下記のように読み取れます。

基本の方策勾配法： E[ Σ G(τ) * grad log π(θ)]
REINFORCE的に収益ノイズ除去： E[Σ G(t) * grad log π(θ)] ]
ベースライン付き： E[Σ ( G(t)-b) * grad log π(θ)] ]
ベースラインを価値関数とする：E[Σ ( G(t)-V(w) )* grad log π(θ)] ]
TD法であること：( r+γV(w)[s_t+1] – V(w) [s_t]) * grad log π(θ)
方策π(θ)をニューラルネットワークで表現：actorという。入力s、出力π(a|s)（行動確率probと表現することもある)
価値関数V(w)をニューラルネットワークで表現：criticという。※真の価値関数vは求めない。Vは中途半端な推定値でも方策πは学習できる。入力はactorと同じs、加えてactorの出力であるπがcriticの入力として使用されます。

「actorの出力をcriticの入力とする」部分がDDPGが連続値に対応できるポイントです。この要素を除くと出力が離散的になりそのアルゴリズムはactor-criticと呼んでいました。actor-criticは行動の選択肢が左右の２つある場合、「右に行く」と決めるような状況で使います。

DDPGは連続値なので、「車のハンドルを右へ15.2°回転させつつ、ブレーキを20%踏み込む」という出力が得られます。（のはず！）

では、ちょっとずつコーディングしていく。

python、プログラミング初級者でも理解できるようにちょっとずつ変えていきます。

1.エージェントインスタンスのchoose_actionメソドを使って行動actionを得るように変更する。

変更前：action = env.action_space.sample()

変更後：action = agent.choose_action(obs)

2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG()

3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self):

def choose_action(self, obs):

4.方策（アクター）はニューラルネットワークで表現する。ActorNNクラスを新規作成し、インスタンスactorとして使用する。

def choose_action(self, obs):

action = self.actor.forward(obs)

return action

5.ActorNNクラスのインスタンスを生成する

def __init__(self):

self.actor = ActorNN()

6.ActorNNクラスを新規作成する

class ActorNN:

def __init__(self):

def forward(self, obs):

action = [0.0 for i in range(6)]

return action

まだニューラルネットワーク構造まで作成していないので、obsは入力として使っていませんし、actionも仮出力として[0,0,0,0,0,0]にしています。

ActorNNクラスを作りこんでいく

３層の全結合ネットワーク、そして活性化関数はreluにしています。

class ActorNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, output_dim)

    def forward(self, obs):

        x = self.fc1(obs)
        x = F.relu(x)
        action = self.fc2(x)
        print('action :', action)

        return action

class ActorNN(nn.Module):

def __init__(self, input_dim, output_dim):

print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(input_dim, 64)

self.fc2 = nn.Linear(64, output_dim)

def forward(self, obs):

x = self.fc1(obs)

x = F.relu(x)

action = self.fc2(x)

print('action :', action)

return action

AgentDDPGクラスを作りこんでいく

# 3.エージェントクラスを定義する
class AgentDDPG:
    def __init__(self, input_dim, output_dim):
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.actor = ActorNN(input_dim=self.input_dim, output_dim=self.output_dim)

    def choose_action(self, obs):
        print('AgentDDPG.choose_action is working.')

        action = self.actor.forward(obs)
        action = action.detach().numpy()

        return action

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, input_dim, output_dim):

self.input_dim = input_dim

self.output_dim = output_dim

self.actor = ActorNN(input_dim=self.input_dim, output_dim=self.output_dim)

def choose_action(self, obs):

print('AgentDDPG.choose_action is working.')

action = self.actor.forward(obs)

action = action.detach().numpy()

return action

メインスクリプトと整合性を合わせる

agent = AgentDDPG(input_dim=17, output_dim=6)    

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10
DELAY_TIME = 0.00 # sec
total_rewards = []
for eposode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)
env.observation_space)
    print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(10):
        env.render()
        
        # ここをDDPGに置き換えていく
        # action = env.action_space.sample()
        action = agent.choose_action(obs)

        next_state, reward, done, _, info = env.step(action)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)

        total_reward += reward
        
        time.sleep(DELAY_TIME)
    total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() 
print('script is done.')

agent = AgentDDPG(input_dim=17, output_dim=6)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10

DELAY_TIME = 0.00 # sec

total_rewards = []

for eposode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

env.observation_space)

print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(10):

env.render()

# ここをDDPGに置き換えていく

# action = env.action_space.sample()

action = agent.choose_action(obs)

next_state, reward, done, _, info = env.step(action)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

total_reward += reward

time.sleep(DELAY_TIME)

total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close()

print('script is done.')

こんな感じで作っていきました。

訓練について

actorは目的関数J(θ)が最大になるように学習する。実際の計算は-J(θ)が最小になるように学習する。
target_actorを正解データとして教師あり学習する
criticはtarget_actorを正解データとして教師あり学習する

この記事を書いた人
最新の記事

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

前回

改善案

ではAgentDDPGクラスを作成していく。

DDPG部分を作っていく

では、ちょっとずつコーディングしていく。

1.エージェントインスタンスのchoose_actionメソドを使って行動actionを得るように変更する。

4.方策（アクター）はニューラルネットワークで表現する。ActorNNクラスを新規作成し、インスタンスactorとして使用する。

5.ActorNNクラスのインスタンスを生成する

6.ActorNNクラスを新規作成する

ActorNNクラスを作りこんでいく

AgentDDPGクラスを作りこんでいく

メインスクリプトと整合性を合わせる

訓練について

Keita N

最新記事 by Keita N (全て見る)