DDPG by gymnasium ９日目

前回まででうまいこと学習が進むようになりましたので、今回はパラメータの保存と読出をやってみましょう。

例えば、エピソードを１００回繰り返しある程度ハーフチーターが前に進む方策を得たらパラメータをいったん保存します。

プログラムを止めて次回動かすときは保存したパラメータを読み込んで、学習済みの状態から動かすことができます。

これで突然プログラムが途中で止まってしまっても被害を最小限にできますね。

パラメータ保存

メインスクリプトでエピソードの終わり、次のエピソードが始まる直前に下記コードを入れます。

    # 47.各ニューラルネットワークのパラメータを１０エピソード毎に保存する
    print('episode, total_reward : ', episode , total_reward)
    if episode % 10 == 0:
        T.save(agent.actor.state_dict(), 'actor_params.pt')
        T.save(agent.critic.state_dict(), 'critic_params.pt')
        T.save(agent.target_actor.state_dict(), 'target_actor_params.pt')
        T.save(agent.target_critic.state_dict(), 'target_critic_params.pt')
        print('==== params were saved. ====')

# 47.各ニューラルネットワークのパラメータを１０エピソード毎に保存する

print('episode, total_reward : ', episode , total_reward)

if episode % 10 == 0:

T.save(agent.actor.state_dict(), 'actor_params.pt')

T.save(agent.critic.state_dict(), 'critic_params.pt')

T.save(agent.target_actor.state_dict(), 'target_actor_params.pt')

T.save(agent.target_critic.state_dict(), 'target_critic_params.pt')

print('==== params were saved. ====')

パラメータ読込

ActorNN(n.Module)クラスの__init__()内に入れて、actor, target_actorのインスタンス生成と同時にパラメータを引き継いでもらうようにします。

        # 48. actorパラメータの読み出し
        # もし、パラメータのデータが存在していたらそのパラメータで初期化する。
        # パラメータファイルの存在チェック
        if T.cuda.is_available():
            map_location = 'cuda'
        else:
            map_location = 'cpu'

        if os.path.isfile('actor_params.pt'):
            # パラメータファイルが存在する場合はロード
            self.load_state_dict(T.load('actor_params.pt', map_location=map_location))
            print("パラメータファイルをロードしました:", 'actor_params.pt')
        else:
            print("パラメータファイルが見つかりません:", 'actor_params.pt')

# 48. actorパラメータの読み出し

# もし、パラメータのデータが存在していたらそのパラメータで初期化する。

# パラメータファイルの存在チェック

if T.cuda.is_available():

map_location = 'cuda'

else:

map_location = 'cpu'

if os.path.isfile('actor_params.pt'):

# パラメータファイルが存在する場合はロード

self.load_state_dict(T.load('actor_params.pt', map_location=map_location))

print("パラメータファイルをロードしました:", 'actor_params.pt')

else:

print("パラメータファイルが見つかりません:", 'actor_params.pt')

CriticNN(nn.Module)クラスも同様に。

        # 49. criticパラメータの読み出し
        # もし、パラメータのデータが存在していたらそのパラメータで初期化する。
        # パラメータファイルの存在チェック
        if T.cuda.is_available():
            map_location = 'cuda'
        else:
            map_location = 'cpu'

        if os.path.isfile('critic_params.pt'):
            # パラメータファイルが存在する場合はロード
            self.load_state_dict(T.load('critic_params.pt', map_location=map_location))
            print("パラメータファイルをロードしました:", 'critic_params.pt')
        else:
            print("パラメータファイルが見つかりません:", 'critic_params.pt')

# 49. criticパラメータの読み出し

# もし、パラメータのデータが存在していたらそのパラメータで初期化する。

# パラメータファイルの存在チェック

if T.cuda.is_available():

map_location = 'cuda'

else:

map_location = 'cpu'

if os.path.isfile('critic_params.pt'):

# パラメータファイルが存在する場合はロード

self.load_state_dict(T.load('critic_params.pt', map_location=map_location))

print("パラメータファイルをロードしました:", 'critic_params.pt')

else:

print("パラメータファイルが見つかりません:", 'critic_params.pt')

これで、動きます。

学習のノウハウ

学習のコツを編み出しました。

学習初期はエージェントの動きが小さくなかなか前進しません。

最初はステップ数を10～100程度に小さくして、スタートダッシュだけを覚えさせました。

そこでいったん止めて、ステップ数を200、400と増やしていくと安定して走り続けるハーフチーターが得られます。

計算の高速化（3Dモデルの表示をオフにする）

env = gym.make(“HalfCheetah-v4”, render_mode= ‘human’)

の中のrender_modeを’depth_array’に変更すればOKです。

学習の進行状況のリアルタイム可視化

エピソード数とそのリワードだけを表示しています。

print(‘episode, total_reward : ‘, episode , total_reward)

パラメータの保存を10エピソード毎にやっています。

if episode % 10 == 0:

print(‘==== params were saved. ====’)

↓↓↓出力

episode, total_reward :  498 259.49914064186675
episode, total_reward :  499 360.5913236729034
episode, total_reward :  500 493.48175790551875
==== params were saved. ====
episode, total_reward :  501 -38.671055063944046
episode, total_reward :  502 260.85835046795177
episode, total_reward :  503 327.2605501737727
episode, total_reward :  504 355.8699644344001
episode, total_reward :  505 381.4410905904467
episode, total_reward :  506 342.0158764598478
episode, total_reward :  507 314.5073192976202
episode, total_reward :  508 82.38588849764184
episode, total_reward :  509 355.04443705923273
episode, total_reward :  510 337.79242797960603
==== params were saved. ====
episode, total_reward :  511 380.88193759326816
episode, total_reward :  512 437.873030625086
episode, total_reward :  513 277.1307694476558

episode, total_reward : 498 259.49914064186675

episode, total_reward : 499 360.5913236729034

episode, total_reward : 500 493.48175790551875

==== params were saved. ====

episode, total_reward : 501 -38.671055063944046

episode, total_reward : 502 260.85835046795177

episode, total_reward : 503 327.2605501737727

episode, total_reward : 504 355.8699644344001

episode, total_reward : 505 381.4410905904467

episode, total_reward : 506 342.0158764598478

episode, total_reward : 507 314.5073192976202

episode, total_reward : 508 82.38588849764184

episode, total_reward : 509 355.04443705923273

episode, total_reward : 510 337.79242797960603

==== params were saved. ====

episode, total_reward : 511 380.88193759326816

episode, total_reward : 512 437.873030625086

episode, total_reward : 513 277.1307694476558

解決できた課題

パラメータのセーブとロード
途中で止まった時に続行可能にしたい
計算の高速化（print文の無効化）
計算の高速化（3Dモデルの表示をオフにする）
学習の進行状況のリアルタイム可視化
適切なステップ数
適切なニューラルネットワーク構造

未解決の課題・疑問点

計算の高速化（GPUの利用）
適切なエピソード数
適切なメモリバッファ数
model.train()とmodel.eval()の使い方が分からない。
ネットワークの入力値？パラメータ？の正規化。
保存したパラメータを読み出すのはactorとtarget_actorまたcriticとtarget_criticで共通で良いのだろうか。

これまでのスクリプト

import gymnasium as gym
import time
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import os

""" pip install
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install gymnasium
pip install matplotlib
pip install mujoco
pip install gymnasium[mujoco]
"""
# 44.OUActionNOoiseクラスを作成する
class OUActionNoise(object):
    def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None):
        self.mu = mu
        self.sigma = sigma
        self.theta = theta
        self.dt = dt
        self.x0 = x0
        self.reset()

    def __call__(self):
        x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \
        self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
        self.x_prev = x
        return x

    def reset(self):
        self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)


# 10. ReplayBufferクラスを新規作成する
class ReplayBuffer:
    def __init__(self, max_memory_size, n_obs_space, n_action_space):
        self.max_memory_size = max_memory_size
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.memory_count = 0

        self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))
        self.action_memory =  np.zeros((self.max_memory_size, self.n_action_space))
        self.reward_memory =  np.zeros(self.max_memory_size)
        self.next_state_memory =  np.zeros((self.max_memory_size, self.n_obs_space))
        self.terminal_memory =  np.zeros(self.max_memory_size)
        #self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

    # 11.トランジション保存のためstore_transitionメソドを作成する
    def store_transition(self, obs, action, reward, next_state, done):
        #print('store_transition is working.')
        index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック
        #print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())
        self.state_memory[index] = obs.detach().numpy().flatten()
        self.action_memory[index] = action.flatten()
        self.reward_memory[index] = reward.flatten()
        self.next_state_memory[index] = next_state.flatten()
        self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように
        #print('state_memory :', self.state_memory)
        #print('action_memory :', self.action_memory)
        #print('reward_memory :', self.reward_memory)
        #print('next_state_memory :', self.next_state_memory)
        #print('memory.state_memory :', self.terminal_memory)

        #print('type of state_memory :', type(self.state_memory[0][0]))
        #print('type of action_memory :', type(self.action_memory[0][0]))
        #print('type of reward_memory :', type(self.reward_memory[0]))
        #print('type of next_state_memory :', type(self.next_state_memory[0][0]))
        #print('type of memory.state_memory :', type(self.terminal_memory[0]))

        self.memory_count += 1
        #print('memory_count :', agent.memory.memory_count)

    # 16 バッファメモリーからランダムに抽出する
    def sample_buffer(self, batch_size):
        # indexが最大メモリに到達していない場合を想定する。
        max_index = min(self.max_memory_size, self.memory_count)
        choosed_index = np.random.choice(max_index, batch_size)
        
        observations = self.state_memory[choosed_index]
        actions = self.action_memory[choosed_index]
        rewards = self.reward_memory[choosed_index]
        next_states = self.next_state_memory[choosed_index]
        terminals = self.terminal_memory[choosed_index]

        return observations, actions, rewards, next_states, terminals


# 6.ActorNNクラスを新規作成する
class ActorNN(nn.Module):
    def __init__(self, alpha=0.001, n_obs_space=17, n_action_space=6,
                                    layer1_size=256, layer2_size=256, batch_size=64):
        #print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(n_obs_space, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, n_action_space)

        #26.最適化処理としてアダムを設定する
        self.optimizer = optim.Adam(self.parameters(), lr=alpha)

        # 48. actorパラメータの読み出し
        # もし、パラメータのデータが存在していたらそのパラメータで初期化する。
        # パラメータファイルの存在チェック
        if T.cuda.is_available():
            map_location = 'cuda'
        else:
            map_location = 'cpu'

        if os.path.isfile('actor_params.pt'):
            # パラメータファイルが存在する場合はロード
            self.load_state_dict(T.load('actor_params.pt', map_location=map_location))
            print("パラメータファイルをロードしました:", 'actor_params.pt')
        else:
            print("パラメータファイルが見つかりません:", 'actor_params.pt')

    def forward(self, obs):
        #print('AgetDDPG.ActorNN.forward is working')
        #print('====ここまではOK1====')
        x = self.fc1(obs)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        action = F.tanh(x)

        return action

# 22.CriticNNクラスを新規作成する
class CriticNN(nn.Module):

    def __init__(self, beta=0.001, n_obs_space=17, n_action_space=6,
                 layer1_size=256, layer2_size=256, batch_size=64):
        #print('CriticNN.__init__ is working.')
        super(CriticNN, self).__init__()

        # クリティックNNは観察空間+行動空間の２つを入力とする構造
        input_dim = n_obs_space + n_action_space
        self.fc1 = nn.Linear(input_dim, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

        #27.最適化処理としてアダムを設定する
        self.optimizer = optim.Adam(self.parameters(), lr=beta)

        # 49. criticパラメータの読み出し
        # もし、パラメータのデータが存在していたらそのパラメータで初期化する。
        # パラメータファイルの存在チェック
        if T.cuda.is_available():
            map_location = 'cuda'
        else:
            map_location = 'cpu'

        if os.path.isfile('critic_params.pt'):
            # パラメータファイルが存在する場合はロード
            self.load_state_dict(T.load('critic_params.pt', map_location=map_location))
            print("パラメータファイルをロードしました:", 'critic_params.pt')
        else:
            print("パラメータファイルが見つかりません:", 'critic_params.pt')

    def forward(self, obs, action):
        input_data = T.cat([obs, action], dim=1)
        x = self.fc1(input_data)
        x = F.relu(x)
        x =self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x #一つの状態価値を出力する。

# 3.エージェントクラスを定義する
class AgentDDPG:

    def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64):
        #print('AgentDDPG.__init__ is working.')
        # 5.ActorNNクラスのインスタンスを生成する
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        self.tau = tau
        
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.n_state_action_value = n_state_action_value

        self.layer1_size = layer1_size
        self.layer2_size = layer2_size

        # 13.バッチサイズを決めておく
        self.batch_size = batch_size 

        self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                            layer1_size=64, layer2_size=64, batch_size=64)        
        
        # 9.memoryインスタンスを追加
        self.MAX_MEMORY_SIZE = 10000
        self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,
                                   n_obs_space=self.n_obs_space,
                                   n_action_space=self.n_action_space)
        
        # 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する
        # actorとtarget_actorのネットワークは同じActorNNで良い
        self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)

        
        # 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する
        self.target_critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)
        
        # 24.クリティックネットワークインスタンスcriticを作成する。
        self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,
                               layer1_size=64, layer2_size=64, batch_size=64)
        
        # アクターロスとクリティックロス
        self.actor_loss = 0
        self.critic_loss = 0
        
        # 45.行動ノイズのインスタンス化
        self.noise = OUActionNoise(mu=np.zeros(n_action_space))

    def choose_action(self, obs):
        #print('AgentDDPG.choose_action is working.')
        # 4.方策（アクター）はニューラルネットワークで表現する。
        #   ActorNNクラスを新規作成し、インスタンスactorとして使用する。
        action = self.actor.forward(obs)
        

        # 46.行動ノイズを入れて探索性を向上させる。
        action += T.tensor(self.noise(), dtype=T.float32)
        action = action.detach().numpy()

        return action
    
    # 8.remenberメソドを追加
    def remember(self, obs, action, reward, next_state, done):
        self.memory.store_transition(obs, action, reward, next_state, done)

    # 13.learnメソドを追加
    def learn(self):
        # 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。
        if self.memory.memory_count < self.batch_size:
            return
        
        # 15.メモリバッファからデータを抜き出す sample_buffer()
        # バッチ化されているので変数名を複数形にする
        observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)
        #print('s:', observations)
        #print(observations.shape)
        #print('a :', actions)
        #print('r :', rewards)
        #print('s_ :', next_states)
        #print('terminal :', terminals)

        # 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する
        observations = T.tensor(observations, dtype=T.float32)
        actions = T.tensor(actions, dtype=T.float32)
        rewards = T.tensor(rewards, dtype=T.float32)
        next_states = T.tensor(next_states, dtype=T.float32)
        terminals = T.tensor(terminals, dtype=T.float32)
       
        # 18.ターゲットアクターネットワークインスタンスtarget_actorに
        # 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。
        # このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。
        #print('next_states :', next_states)
        target_actions = self.target_actor.forward(next_states)

        # 20.ターゲットクリティックネットワークインスタンスtarget_criticに
        # 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して
        # 価値関数の推定値ターゲットバリューを出力する。
        # TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。
        # ターゲットクリティックバリューはターゲットアクターネットワークを使う
        target_critic_values = self.target_critic.forward(next_states, target_actions)

        
        # 23.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に
        # 現在の状態observationsと行動actionsを入力して
        # クリティックバリューを算出する
        critic_values = self.critic.forward(observations, actions)

        # 25.TDターゲットを算出する：r + γ*V(w)[s_t+1]
        td_targets = []
        for i in range(self.batch_size):
            td_target = rewards[i] + self.gamma * target_critic_values[i] * terminals[i]
            td_targets.append(td_target)
        
        # TDターゲットの形をバッチに整える
        td_targets = T.tensor(td_targets, dtype=T.float32)
        td_targets = td_targets.view(self.batch_size, 1) #viewはreshapeと同じ。64x1に見え方を変更した、という意味
        #print('td_targets :', td_targets)

        
        #ここから次回はやっていこう 2023/5/16
        # ==== （１）クリティックの学習 ====

        #self.critic.train()

        # 28.クリティックの勾配をゼロに初期化する
        self.critic.optimizer.zero_grad()

        # 29. TDターゲットと状態価値の二乗誤差を算出して、クリティックの損失関数とする。バッチサイズは６４個
        critic_loss = F.mse_loss(td_targets, critic_values)
        self.critic_loss = critic_loss
        #print('critic_loss : ', critic_loss) # tensor(0.0485, grad_fn=<MseLossBackward0>)

        # 30. クリティックの損失関数を微分して、勾配を算出する
        critic_loss.backward()

        # 31. 勾配からオプティマイザーによってクリティックのパラメータ（重みとバイアス）を更新する
        self.critic.optimizer.step()

        # self.critic.eval()


        # ==== （２）アクターの学習 ====

        # 32. アクターの勾配をゼロに初期化する
        self.actor.optimizer.zero_grad()

        # 33. アクターに観測情報を入力して行動を算出する。バッチサイズは６４個
        predicted_actions = self.actor.forward(observations)

        #self.actor.train()

        # 34.アクターの損失関数を算出する
        #    Actorの目的は、Criticネットワークの出力（行動価値）を最大化するような行動を選択すること。
        #    なので、actorNN→criticNNのDDPG構造全体の出力結果をactor_lossとして、actorNNとcriticNNの両方をbackwardし、
        #    actorだけをパラメータ更新することによりactorの学習をすることができる。

        actor_loss = -self.critic.forward(observations, predicted_actions)
        actor_loss = T.mean(actor_loss)
        self.actor_loss = actor_loss

        #print(f'actor_loss: {actor_loss}, critic_loss: {critic_loss}')

        # 35. DDPG構造全体の損失関数actor_lossを微分し、勾配を算出する
        actor_loss.backward()

        # 36. 勾配からオプティマイザーによってアクターのパラメータだけを（重みとバイアス）を更新する
        self.actor.optimizer.step()
    
        # 37. 全ニューラルネットワークのパラメータを更新する。
        self.update_network_parameters()

    # 37. パラメータ更新メソド。
    def update_network_parameters(self, tau=None):
        if tau is None:
            tau = self.tau
    
        # 38. actor, critic, target_actor, target_criticのネットワーク内の全てのパラメータ（重みとバイアス）とその名前を取得する
        # actorとcriticは先ほど更新されたばかりのパラメーター
        actor_params = self.actor.named_parameters()
        critic_params = self.critic.named_parameters()
        target_actor_params = self.target_actor.named_parameters()
        target_critic_params = self.target_critic.named_parameters()
        #print('actor_params : ', actor_params) # actor_params :  <generator object Module.named_parameters at 0x000001661B2D9D48>

        # 39. パラメータをディクショナリとして取り出す。
        actor_params_dict = dict(actor_params)
        critic_params_dict = dict(critic_params)
        target_actor_params_dict = dict(target_actor_params)
        target_critic_params_dict = dict(target_critic_params)
        #print('actor_params_dict : ', actor_params_dict)
        #print(actor_params_dict.keys())
        """
        actor_params_dict :  {'fc1.weight': Parameter containing:
                                   tensor([[-0.1895, -0.0343,  0.1138,  ...,  0.2157,  0.0527, -0.1173],/

        dict_keys(['fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'])
        """

        # 40. クリティックの各パラメーター毎に 更新重みtau=0.0001の分だけほんの少しcriticパラメータをtarget_criticパラメータに近づける。
        for name in critic_params_dict:
            critic_params_dict[name] = tau * critic_params_dict[name].clone() + \
                                       (1-tau) * target_critic_params_dict[name].clone()
            
        # 41. 更新したcriticパラメータをtarget_criticのパラメータとしてロードする。
        self.target_critic.load_state_dict(critic_params_dict)

        # 42.アクターの各パラメーター毎に 更新重みtau=0.0001の分だけほんの少しactorパラメータをtarget_actorパラメータに近づける。
        for name in actor_params_dict:
            actor_params_dict[name] = tau * actor_params_dict[name].clone() + \
                                      (1 - tau) * target_actor_params_dict[name].clone()
            
        # 43. 更新したactorパラメータをtarget_actorのパラメータとしてロードする。
        self.target_actor.load_state_dict(actor_params_dict)

    """
    # =====================次はここから============================
    def save_models(self):
        self.actor.save_checkpoint()
        self.critic.save_checkpoint()        
        self.target_actor.save_checkpoint()
        self.target_critic.save_checkpoint()

    def load_models(self):
        self.actor.load_checkpoint()
        self.critic.load_checkpoint()        
        self.target_actor.load_checkpoint()
        self.target_critic.load_checkpoint() 
    """
# 2.エージェントクラスのインスタンスを生成する
agent = AgentDDPG(alpha=0.01, beta=0.01, gamma=0.99, tau=0.01,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=256, layer2_size=256, batch_size=64)

#env = gym.make("HalfCheetah-v4", render_mode= 'human')
env = gym.make("HalfCheetah-v4", render_mode='depth_array')

EPISODES = 1001# episodes
STEPS = 400    # steps
DELAY_TIME = 0.00 # sec

total_rewards = []
actor_losses = []
critic_losses = []
for episode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)
    # tensor([ 0.0040,  0.0199, -0.0622,  0.0594, -0.0605,  0.0577, -0.0056,  0.0333,        -0.0072,  0.0532, -0.0512,  0.0173, -0.0529, -0.1104,  0.0946, -0.0559,         0.0824])
    #print(type(obs))
    # observation_space :  Box(-inf, inf, (17,), float64)
    #print('observation_space : ', env.observation_space)
    #print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(STEPS):
        env.render()
        
        # ここをDDPGに置き換えていく
        action = agent.choose_action(obs) # 1.Agentクラスを定義していく
        #action :  [ 0.06660474 -0.11753064  0.02527559  0.06465236  0.1050786   0.05048539]
        #print('====ここまではOK4====')
        #print('action_space : ', env.action_space)
        #print('action : ', action)

        next_state, reward, done, _, info = env.step(action)
        #print('next_state, reward, done, _, info :', next_state, reward, done, _, info)
        """
        action :  [ 0.06660474 -0.11753064  0.02527559  0.06465236  0.1050786   0.05048539]

        next_state, reward, done, _, info : 
        [-0.00265179  0.0229547   0.00463243 -0.04729936 -0.00959038  0.04734605
        0.03672746  0.02857842  0.09980254 -0.32065693  0.04221647  1.58668951
        -2.31089174  1.30338924 -0.25465526  1.08250465 -0.14134398]
        0.07553858359316026
        False
        False
        {'x_position': -0.09233384215910741, 'x_velocity': 0.07920445513883267, 'reward_run': 0.07920445513883267, 'reward_ctrl': -0.0036658715456724168}
        """
        #print('====ここまではOK5====')

        #7. トラジェクトを保存する。経験再生(ReplayBuffer)
        agent.remember(obs, action, reward, next_state, int(done))

        #12. ニューラルネットワークを学習する
        agent.learn()
        
        # 26.エピソード内での報酬を累積していく
        total_reward += reward
        
        # 27. next_stateをobsとして再出発する
        #print('next_state:', next_state)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)
        # 28. チーターの動きを見たいのでスリープを入れる
        time.sleep(DELAY_TIME)

    #print('total_reward : ', total_reward)
    total_rewards.append(total_reward)

    actor_losses.append(float(agent.actor_loss)) 
    critic_losses.append(float(agent.critic_loss)) 

    # print('epsisode', i, 'score %.2f' % score, '100 game sverage %.2f' % np.mean(score_history[-100:]))
 
    # 47. 各ニューラルネットワークのパラメータを１０エピソード毎に保存する
    print('episode, total_reward : ', episode , total_reward)
    if episode % 10 == 0:
        T.save(agent.actor.state_dict(), 'actor_params.pt')
        T.save(agent.critic.state_dict(), 'critic_params.pt')
        T.save(agent.target_actor.state_dict(), 'target_actor_params.pt')
        T.save(agent.target_critic.state_dict(), 'target_critic_params.pt')
        print('==== params were saved. ====')

#print('total_rewards : ', total_rewards)
#plt.plot(total_rewards)

#plt.plot(actor_losses, label='actor_losses')
#plt.plot(critic_losses, label='critic_losses')
plt.plot(total_rewards, label='total_rewards')
plt.legend()
plt.grid(True)
plt.ioff()
plt.show()

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

import gymnasium as gym

import time

import torch as T

import torch.nn as nn

import torch.nn.functional as F

import torch.optim as optim

import numpy as np

import matplotlib.pyplot as plt

import os

""" pip install

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

pip install gymnasium

pip install matplotlib

pip install mujoco

pip install gymnasium[mujoco]

"""

# 44.OUActionNOoiseクラスを作成する

class OUActionNoise(object):

def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None):

self.mu = mu

self.sigma = sigma

self.theta = theta

self.dt = dt

self.x0 = x0

self.reset()

def __call__(self):

x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \

self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)

self.x_prev = x

return x

def reset(self):

self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)

# 10. ReplayBufferクラスを新規作成する

class ReplayBuffer:

def __init__(self, max_memory_size, n_obs_space, n_action_space):

self.max_memory_size = max_memory_size

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.memory_count = 0

self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.action_memory = np.zeros((self.max_memory_size, self.n_action_space))

self.reward_memory = np.zeros(self.max_memory_size)

self.next_state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.terminal_memory = np.zeros(self.max_memory_size)

#self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

# 11.トランジション保存のためstore_transitionメソドを作成する

def store_transition(self, obs, action, reward, next_state, done):

#print('store_transition is working.')

index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック

#print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())

self.state_memory[index] = obs.detach().numpy().flatten()

self.action_memory[index] = action.flatten()

self.reward_memory[index] = reward.flatten()

self.next_state_memory[index] = next_state.flatten()

self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように

#print('state_memory :', self.state_memory)

#print('action_memory :', self.action_memory)

#print('reward_memory :', self.reward_memory)

#print('next_state_memory :', self.next_state_memory)

#print('memory.state_memory :', self.terminal_memory)

#print('type of state_memory :', type(self.state_memory[0][0]))

#print('type of action_memory :', type(self.action_memory[0][0]))

#print('type of reward_memory :', type(self.reward_memory[0]))

#print('type of next_state_memory :', type(self.next_state_memory[0][0]))

#print('type of memory.state_memory :', type(self.terminal_memory[0]))

self.memory_count += 1

#print('memory_count :', agent.memory.memory_count)

# 16 バッファメモリーからランダムに抽出する

def sample_buffer(self, batch_size):

# indexが最大メモリに到達していない場合を想定する。

max_index = min(self.max_memory_size, self.memory_count)

choosed_index = np.random.choice(max_index, batch_size)

observations = self.state_memory[choosed_index]

actions = self.action_memory[choosed_index]

rewards = self.reward_memory[choosed_index]

next_states = self.next_state_memory[choosed_index]

terminals = self.terminal_memory[choosed_index]

return observations, actions, rewards, next_states, terminals

# 6.ActorNNクラスを新規作成する

class ActorNN(nn.Module):

def __init__(self, alpha=0.001, n_obs_space=17, n_action_space=6,

layer1_size=256, layer2_size=256, batch_size=64):

#print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(n_obs_space, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, n_action_space)

#26.最適化処理としてアダムを設定する

self.optimizer = optim.Adam(self.parameters(), lr=alpha)

# 48. actorパラメータの読み出し

# もし、パラメータのデータが存在していたらそのパラメータで初期化する。

# パラメータファイルの存在チェック

if T.cuda.is_available():

map_location = 'cuda'

else:

map_location = 'cpu'

if os.path.isfile('actor_params.pt'):

# パラメータファイルが存在する場合はロード

self.load_state_dict(T.load('actor_params.pt', map_location=map_location))

print("パラメータファイルをロードしました:", 'actor_params.pt')

else:

print("パラメータファイルが見つかりません:", 'actor_params.pt')

def forward(self, obs):

#print('AgetDDPG.ActorNN.forward is working')

#print('====ここまではOK1====')

x = self.fc1(obs)

x = F.relu(x)

x = self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

action = F.tanh(x)

return action

# 22.CriticNNクラスを新規作成する

class CriticNN(nn.Module):

def __init__(self, beta=0.001, n_obs_space=17, n_action_space=6,

layer1_size=256, layer2_size=256, batch_size=64):

#print('CriticNN.__init__ is working.')

super(CriticNN, self).__init__()

# クリティックNNは観察空間+行動空間の２つを入力とする構造

input_dim = n_obs_space + n_action_space

self.fc1 = nn.Linear(input_dim, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

#27.最適化処理としてアダムを設定する

self.optimizer = optim.Adam(self.parameters(), lr=beta)

# 49. criticパラメータの読み出し

# もし、パラメータのデータが存在していたらそのパラメータで初期化する。

# パラメータファイルの存在チェック

if T.cuda.is_available():

map_location = 'cuda'

else:

map_location = 'cpu'

if os.path.isfile('critic_params.pt'):

# パラメータファイルが存在する場合はロード

self.load_state_dict(T.load('critic_params.pt', map_location=map_location))

print("パラメータファイルをロードしました:", 'critic_params.pt')

else:

print("パラメータファイルが見つかりません:", 'critic_params.pt')

def forward(self, obs, action):

input_data = T.cat([obs, action], dim=1)

x = self.fc1(input_data)

x = F.relu(x)

x =self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

return x #一つの状態価値を出力する。

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64):

#print('AgentDDPG.__init__ is working.')

# 5.ActorNNクラスのインスタンスを生成する

self.alpha = alpha

self.beta = beta

self.gamma = gamma

self.tau = tau

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.n_state_action_value = n_state_action_value

self.layer1_size = layer1_size

self.layer2_size = layer2_size

# 13.バッチサイズを決めておく

self.batch_size = batch_size

self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 9.memoryインスタンスを追加

self.MAX_MEMORY_SIZE = 10000

self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,

n_obs_space=self.n_obs_space,

n_action_space=self.n_action_space)

# 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する

# actorとtarget_actorのネットワークは同じActorNNで良い

self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する

self.target_critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 24.クリティックネットワークインスタンスcriticを作成する。

self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# アクターロスとクリティックロス

self.actor_loss = 0

self.critic_loss = 0

# 45.行動ノイズのインスタンス化

self.noise = OUActionNoise(mu=np.zeros(n_action_space))

def choose_action(self, obs):

#print('AgentDDPG.choose_action is working.')

# 4.方策（アクター）はニューラルネットワークで表現する。

# ActorNNクラスを新規作成し、インスタンスactorとして使用する。

action = self.actor.forward(obs)

# 46.行動ノイズを入れて探索性を向上させる。

action += T.tensor(self.noise(), dtype=T.float32)

action = action.detach().numpy()

return action

# 8.remenberメソドを追加

def remember(self, obs, action, reward, next_state, done):

self.memory.store_transition(obs, action, reward, next_state, done)

# 13.learnメソドを追加

def learn(self):

# 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。

if self.memory.memory_count < self.batch_size:

return

# 15.メモリバッファからデータを抜き出す sample_buffer()

# バッチ化されているので変数名を複数形にする

observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)

#print('s:', observations)

#print(observations.shape)

#print('a :', actions)

#print('r :', rewards)

#print('s_ :', next_states)

#print('terminal :', terminals)

# 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する

observations = T.tensor(observations, dtype=T.float32)

actions = T.tensor(actions, dtype=T.float32)

rewards = T.tensor(rewards, dtype=T.float32)

next_states = T.tensor(next_states, dtype=T.float32)

terminals = T.tensor(terminals, dtype=T.float32)

# 18.ターゲットアクターネットワークインスタンスtarget_actorに

# 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。

# このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。

#print('next_states :', next_states)

target_actions = self.target_actor.forward(next_states)

# 20.ターゲットクリティックネットワークインスタンスtarget_criticに

# 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して

# 価値関数の推定値ターゲットバリューを出力する。

# TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。

# ターゲットクリティックバリューはターゲットアクターネットワークを使う

target_critic_values = self.target_critic.forward(next_states, target_actions)

# 23.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に

# 現在の状態observationsと行動actionsを入力して

# クリティックバリューを算出する

critic_values = self.critic.forward(observations, actions)

# 25.TDターゲットを算出する：r + γ*V(w)[s_t+1]

td_targets = []

for i in range(self.batch_size):

td_target = rewards[i] + self.gamma * target_critic_values[i] * terminals[i]

td_targets.append(td_target)

# TDターゲットの形をバッチに整える

td_targets = T.tensor(td_targets, dtype=T.float32)

td_targets = td_targets.view(self.batch_size, 1) #viewはreshapeと同じ。64x1に見え方を変更した、という意味

#print('td_targets :', td_targets)

#ここから次回はやっていこう 2023/5/16

# ==== （１）クリティックの学習 ====

#self.critic.train()

# 28.クリティックの勾配をゼロに初期化する

self.critic.optimizer.zero_grad()

# 29. TDターゲットと状態価値の二乗誤差を算出して、クリティックの損失関数とする。バッチサイズは６４個

critic_loss = F.mse_loss(td_targets, critic_values)

self.critic_loss = critic_loss

#print('critic_loss : ', critic_loss) # tensor(0.0485, grad_fn=<MseLossBackward0>)

# 30. クリティックの損失関数を微分して、勾配を算出する

critic_loss.backward()

# 31. 勾配からオプティマイザーによってクリティックのパラメータ（重みとバイアス）を更新する

self.critic.optimizer.step()

# self.critic.eval()

# ==== （２）アクターの学習 ====

# 32. アクターの勾配をゼロに初期化する

self.actor.optimizer.zero_grad()

# 33. アクターに観測情報を入力して行動を算出する。バッチサイズは６４個

predicted_actions = self.actor.forward(observations)

#self.actor.train()

# 34.アクターの損失関数を算出する

# Actorの目的は、Criticネットワークの出力（行動価値）を最大化するような行動を選択すること。

# なので、actorNN→criticNNのDDPG構造全体の出力結果をactor_lossとして、actorNNとcriticNNの両方をbackwardし、

# actorだけをパラメータ更新することによりactorの学習をすることができる。

actor_loss = -self.critic.forward(observations, predicted_actions)

actor_loss = T.mean(actor_loss)

self.actor_loss = actor_loss

#print(f'actor_loss: {actor_loss}, critic_loss: {critic_loss}')

# 35. DDPG構造全体の損失関数actor_lossを微分し、勾配を算出する

actor_loss.backward()

# 36. 勾配からオプティマイザーによってアクターのパラメータだけを（重みとバイアス）を更新する

self.actor.optimizer.step()

# 37. 全ニューラルネットワークのパラメータを更新する。

self.update_network_parameters()

# 37. パラメータ更新メソド。

def update_network_parameters(self, tau=None):

if tau is None:

tau = self.tau

# 38. actor, critic, target_actor, target_criticのネットワーク内の全てのパラメータ（重みとバイアス）とその名前を取得する

# actorとcriticは先ほど更新されたばかりのパラメーター

actor_params = self.actor.named_parameters()

critic_params = self.critic.named_parameters()

target_actor_params = self.target_actor.named_parameters()

target_critic_params = self.target_critic.named_parameters()

#print('actor_params : ', actor_params) # actor_params : <generator object Module.named_parameters at 0x000001661B2D9D48>

# 39. パラメータをディクショナリとして取り出す。

actor_params_dict = dict(actor_params)

critic_params_dict = dict(critic_params)

target_actor_params_dict = dict(target_actor_params)

target_critic_params_dict = dict(target_critic_params)

#print('actor_params_dict : ', actor_params_dict)

#print(actor_params_dict.keys())

"""

actor_params_dict : {'fc1.weight': Parameter containing:

tensor([[-0.1895, -0.0343, 0.1138, ..., 0.2157, 0.0527, -0.1173],/

dict_keys(['fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'])

"""

# 40. クリティックの各パラメーター毎に更新重みtau=0.0001の分だけほんの少しcriticパラメータをtarget_criticパラメータに近づける。

for name in critic_params_dict:

critic_params_dict[name] = tau * critic_params_dict[name].clone() + \

(1-tau) * target_critic_params_dict[name].clone()

# 41. 更新したcriticパラメータをtarget_criticのパラメータとしてロードする。

self.target_critic.load_state_dict(critic_params_dict)

# 42.アクターの各パラメーター毎に更新重みtau=0.0001の分だけほんの少しactorパラメータをtarget_actorパラメータに近づける。

for name in actor_params_dict:

actor_params_dict[name] = tau * actor_params_dict[name].clone() + \

(1 - tau) * target_actor_params_dict[name].clone()

# 43. 更新したactorパラメータをtarget_actorのパラメータとしてロードする。

self.target_actor.load_state_dict(actor_params_dict)

"""

# =====================次はここから============================

def save_models(self):

self.actor.save_checkpoint()

self.critic.save_checkpoint()

self.target_actor.save_checkpoint()

self.target_critic.save_checkpoint()

def load_models(self):

self.actor.load_checkpoint()

self.critic.load_checkpoint()

self.target_actor.load_checkpoint()

self.target_critic.load_checkpoint()

"""

# 2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG(alpha=0.01, beta=0.01, gamma=0.99, tau=0.01,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=256, layer2_size=256, batch_size=64)

#env = gym.make("HalfCheetah-v4", render_mode= 'human')

env = gym.make("HalfCheetah-v4", render_mode='depth_array')

EPISODES = 1001# episodes

STEPS = 400 # steps

DELAY_TIME = 0.00 # sec

total_rewards = []

actor_losses = []

critic_losses = []

for episode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

# tensor([ 0.0040, 0.0199, -0.0622, 0.0594, -0.0605, 0.0577, -0.0056, 0.0333, -0.0072, 0.0532, -0.0512, 0.0173, -0.0529, -0.1104, 0.0946, -0.0559, 0.0824])

#print(type(obs))

# observation_space : Box(-inf, inf, (17,), float64)

#print('observation_space : ', env.observation_space)

#print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(STEPS):

env.render()

# ここをDDPGに置き換えていく

action = agent.choose_action(obs) # 1.Agentクラスを定義していく

#action : [ 0.06660474 -0.11753064 0.02527559 0.06465236 0.1050786 0.05048539]

#print('====ここまではOK4====')

#print('action_space : ', env.action_space)

#print('action : ', action)

next_state, reward, done, _, info = env.step(action)

#print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

"""

action : [ 0.06660474 -0.11753064 0.02527559 0.06465236 0.1050786 0.05048539]

next_state, reward, done, _, info :

[-0.00265179 0.0229547 0.00463243 -0.04729936 -0.00959038 0.04734605

0.03672746 0.02857842 0.09980254 -0.32065693 0.04221647 1.58668951

-2.31089174 1.30338924 -0.25465526 1.08250465 -0.14134398]

0.07553858359316026

False

{'x_position': -0.09233384215910741, 'x_velocity': 0.07920445513883267, 'reward_run': 0.07920445513883267, 'reward_ctrl': -0.0036658715456724168}

"""

#print('====ここまではOK5====')

#7. トラジェクトを保存する。経験再生(ReplayBuffer)

agent.remember(obs, action, reward, next_state, int(done))

#12. ニューラルネットワークを学習する

agent.learn()

# 26.エピソード内での報酬を累積していく

total_reward += reward

# 27. next_stateをobsとして再出発する

#print('next_state:', next_state)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

# 28. チーターの動きを見たいのでスリープを入れる

time.sleep(DELAY_TIME)

#print('total_reward : ', total_reward)

total_rewards.append(total_reward)

actor_losses.append(float(agent.actor_loss))

critic_losses.append(float(agent.critic_loss))

# print('epsisode', i, 'score %.2f' % score, '100 game sverage %.2f' % np.mean(score_history[-100:]))

# 47. 各ニューラルネットワークのパラメータを１０エピソード毎に保存する

print('episode, total_reward : ', episode , total_reward)

if episode % 10 == 0:

T.save(agent.actor.state_dict(), 'actor_params.pt')

T.save(agent.critic.state_dict(), 'critic_params.pt')

T.save(agent.target_actor.state_dict(), 'target_actor_params.pt')

T.save(agent.target_critic.state_dict(), 'target_critic_params.pt')

print('==== params were saved. ====')

#print('total_rewards : ', total_rewards)

#plt.plot(total_rewards)

#plt.plot(actor_losses, label='actor_losses')

#plt.plot(critic_losses, label='critic_losses')

plt.plot(total_rewards, label='total_rewards')

plt.legend()

plt.grid(True)

plt.ioff()

plt.show()

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

DDPG by gymnasium ８日目

chatGPTより提案されたニューラルネットワーク構造を導入してみます。

actorNNは隠れ層ノードを64から256に増やしました。

criticNNも隠れ層ノードを64から256に増やしました。

actorNNの活性化関数はrelu, relu,tanhで出力のまま変わらず。

criticNnの活性化関数はrelu,reluで最終層は活性化関数なしで出力。こちらも変更なしです。

基本構造は悪くなかったようです。

# 6.ActorNNクラスを新規作成する
class ActorNN(nn.Module):
    def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=256, layer2_size=256, batch_size=64):
        #print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(n_obs_space, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, n_action_space)

        #26.最適化処理としてアダムを設定する
        self.optimizer = optim.Adam(self.parameters(), lr=alpha)

    def forward(self, obs):
        #print('AgetDDPG.ActorNN.forward is working')
        #print('====ここまではOK1====')
        x = self.fc1(obs)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        mu = F.tanh(x)
        #print('action μ:', mu)
        #print('====ここまではOK2====')

        # 必要であればあとでノイズを入れる：action = mu + noize
        action = mu
        return action
        

# 22.CriticNNクラスを新規作成する
class CriticNN(nn.Module):

    def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,
                 layer1_size=256, layer2_size=256, batch_size=64):
        #print('CriticNN.__init__ is working.')
        super(CriticNN, self).__init__()

        # クリティックNNは観察空間+行動空間の２つを入力とする構造
        input_dim = n_obs_space + n_action_space
        self.fc1 = nn.Linear(input_dim, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

        #27.最適化処理としてアダムを設定する
        self.optimizer = optim.Adam(self.parameters(), lr=beta)

    def forward(self, obs, action):
        input_data = T.cat([obs, action], dim=1)
        x = self.fc1(input_data)
        x = F.relu(x)
        x =self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x #一つの状態価値を出力する。

# 6.ActorNNクラスを新規作成する

class ActorNN(nn.Module):

def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=256, layer2_size=256, batch_size=64):

#print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(n_obs_space, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, n_action_space)

#26.最適化処理としてアダムを設定する

self.optimizer = optim.Adam(self.parameters(), lr=alpha)

def forward(self, obs):

#print('AgetDDPG.ActorNN.forward is working')

#print('====ここまではOK1====')

x = self.fc1(obs)

x = F.relu(x)

x = self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

mu = F.tanh(x)

#print('action μ:', mu)

#print('====ここまではOK2====')

# 必要であればあとでノイズを入れる：action = mu + noize

action = mu

return action

# 22.CriticNNクラスを新規作成する

class CriticNN(nn.Module):

def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=256, layer2_size=256, batch_size=64):

#print('CriticNN.__init__ is working.')

super(CriticNN, self).__init__()

# クリティックNNは観察空間+行動空間の２つを入力とする構造

input_dim = n_obs_space + n_action_space

self.fc1 = nn.Linear(input_dim, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

#27.最適化処理としてアダムを設定する

self.optimizer = optim.Adam(self.parameters(), lr=beta)

def forward(self, obs, action):

input_data = T.cat([obs, action], dim=1)

x = self.fc1(input_data)

x = F.relu(x)

x =self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

return x #一つの状態価値を出力する。

変わらず actor_lossesが上昇傾向にあります。

次は、ステップ数を１０から５０に増やしてみます。

前のめりを覚えたようで、たまにひっくり返ります。

しかし、actor_losses, critic_lossesは上昇傾向で変わらす。しかし、なんか前に行こうと頑張っているようには見えます。符号が逆になってないだろうか？

ここで行動にノイズを入れて環境の探索性を上げることで学習が良い方向に進むかやってみます。

DDPGにおけるOUActionNoiseクラスは、行動に対してオルナシュウ-ウーレンベック（Ornstein-Uhlenbeck）過程に基づくノイズを生成するために使用されるクラスです。このノイズは、環境の探索性を増加させるためにアクションに追加されます。

このクラスのインスタンス化時に、平均値（mu）、標準偏差（sigma）、タイムステップの幅（dt）、回帰係数（theta）、初期値（x0）を指定します。__call__メソッドは、ノイズを生成して返します。

DDPGの学習時には、Actorネットワークから生成されたアクションにOUActionNoiseクラスを適用してノイズを追加し、環境への探索性を高めます。これにより、探索と収束のトレードオフを実現し、より良いポリシーの探索を促進することができます。

44.OUActionNOoiseクラスを作成する

# 44.OUActionNOoiseクラスを作成する
class OUActionNoise(object):
    def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None):
        self.mu = mu
        self.sigma = sigma
        self.theta = theta
        self.dt = dt
        self.x0 = x0
        self.reset()

    def __call__(self):
        x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \
        self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
        self.x_prev = x
        return x

    def reset(self):
        self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)

# 44.OUActionNOoiseクラスを作成する

class OUActionNoise(object):

def __init__(self, mu, sigma=0.15, theta=0.2, dt=1e-2, x0=None):

self.mu = mu

self.sigma = sigma

self.theta = theta

self.dt = dt

self.x0 = x0

self.reset()

def __call__(self):

x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \

self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)

self.x_prev = x

return x

def reset(self):

self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)

AgentDDPGクラスに追加

        # 45.行動ノイズのインスタンス化
        self.noise = OUActionNoise(mu=np.zeros(n_action_space))

1 2	# 45.行動ノイズのインスタンス化 self.noise = OUActionNoise(mu=np.zeros(n_action_space))

choose_actionメソド内でactionにノイズを入れる。

    def choose_action(self, obs):
        #print('AgentDDPG.choose_action is working.')
        # 4.方策（アクター）はニューラルネットワークで表現する。
        #   ActorNNクラスを新規作成し、インスタンスactorとして使用する。
        action = self.actor.forward(obs)

        # 46.行動ノイズを入れて探索性を向上させる。
        action += T.tensor(self.noise(), dtype=T.float32)

        action = action.detach().numpy()

        return action

def choose_action(self, obs):

#print('AgentDDPG.choose_action is working.')

# 4.方策（アクター）はニューラルネットワークで表現する。

# ActorNNクラスを新規作成し、インスタンスactorとして使用する。

action = self.actor.forward(obs)

# 46.行動ノイズを入れて探索性を向上させる。

action += T.tensor(self.noise(), dtype=T.float32)

action = action.detach().numpy()

return action

actorにしろcriticにしろ、常にtargetが動いているのでlossが小さくなるわけではないのかなと思い始めました。

前にぴょんぴょん跳ねるような動作が生まれてきました。ノイズのおかげでしょうか。

EPISODES = 1000 # episodes

STEPS = 100 # steps

ではどうでしょうか。

30000ステップを超えたあたりから、ハーフチーターは開始１秒で前進側にすっ飛んでいく挙動が得られました。

しかし、安定してすっ飛んでいくわけではなく、ちょっともたついてから前傾姿勢で進む場合と混ざり合っています。それでも、開始直後に後退する動作はなくなりましたので確実に成長しています。

ニューラルネットワークのパラメータ更新はうまくいっているようです。

リプレイバッファのサイズはまだ1000だけにしていますがもっと多いほうがいいのでしょうか。多すぎると古い情報がなかなか更新されないので学習が遅くなってしまう気がします。

バッチサイズ64に対してベストなバッファサイズはどのように考えればよいでしょうか。課題です。

次回は、ニューラルネットワークモデルのパラメータ保存と読み出しについて考えていきましょう。

前回

actorが行動して集めたデータから、経験再生を使って「次の状態」からtarget_actorが「次の行動」を出力し、「次の行動」と「次の状態」からtarget_criticが「次の状態価値」出力し、TDターゲットを算出しました。

一方で、criticは経験再生を使って「現在の状態」と「そのとき取った行動」から「現在の状態価値」別名ベースラインを算出しました。

今回

ここからは、本当の意味で学習・訓練、つまりパラメータ更新をやっていきます。

オプティマイザーを定義する

オプティマイザーをActorNNクラスとCriticNNクラスの__init__()に定義しておきます。

【ActorNN】#26

self.optimizer = optim.Adam(self.parameters(), lr=alpha)

【CriticNN】#27

self.optimizer = optim.Adam(self.parameters(), lr=beta)

引数のself.parameters()は、モデル自身が持っている重みやバイアスのパラメータです。それを学習率lrでAdamによって最適化（損失関数の最小化）するインスタンスself.optimizerを定義します。

criticの学習

lean()メソド内でのTDターゲット算出後からやっていきます。

# ==== （１）クリティックの学習 ====

# 28.クリティックの勾配をゼロに初期化する
self.critic.optimizer.zero_grad()

# 29. TDターゲットと状態価値の二乗誤差を算出して、クリティックの損失関数とする。バッチサイズは６４個
critic_loss = F.mse_loss(td_targets, critic_values)

# 30. クリティックの損失関数を微分して、勾配を算出する
critic_loss.backward()

# 31. 勾配からオプティマイザーによってクリティックのパラメータ（重みとバイアス）を更新する
self.critic.optimizer.step()

# ==== （１）クリティックの学習 ====

# 28.クリティックの勾配をゼロに初期化する

self.critic.optimizer.zero_grad()

# 29. TDターゲットと状態価値の二乗誤差を算出して、クリティックの損失関数とする。バッチサイズは６４個

critic_loss = F.mse_loss(td_targets, critic_values)

# 30. クリティックの損失関数を微分して、勾配を算出する

critic_loss.backward()

# 31. 勾配からオプティマイザーによってクリティックのパラメータ（重みとバイアス）を更新する

self.critic.optimizer.step()

クリティックの損失関数はtensor(0.0485, grad_fn=<MseLossBackward0>)の形で出力されます。

actorの学習

続けてactorを学習します。

Actorの目的は、Criticネットワークの出力（行動価値）を最大化するような行動を選択することです。

なので、actorNN→criticNNのDDPG構造全体の出力結果をactor_lossとして、actorNNとcriticNNの両方をbackwardすることによってactorにも勾配情報が届きパラメータの更新をすることができます。

# ==== （２）アクターの学習 ====

# 32. アクターの勾配をゼロに初期化する
self.actor.optimizer.zero_grad()

# 33. アクターに観測情報を入力して行動を算出する。バッチサイズは６４個
predicted_actions = self.actor.forward(observations)

#self.actor.train()

# 34.アクターの損失関数を算出する
#    Actorの目的は、Criticネットワークの出力（行動価値）を最大化するような行動を選択すること。
#    なので、actorNN→criticNNのDDPG構造全体の出力結果をactor_lossとして、actorNNとcriticNNの両方をbackwardし、
#    actorだけをパラメータ更新することによりactorの学習をすることができる。

actor_loss = -self.critic.forward(observations, predicted_actions)
actor_loss = T.mean(actor_loss)

# 35. DDPG構造全体の損失関数actor_lossを微分し、勾配を算出する
actor_loss.backward()

# 36. 勾配からオプティマイザーによってアクターのパラメータだけを（重みとバイアス）を更新する
self.actor.optimizer.step()

# ==== （２）アクターの学習 ====

# 32. アクターの勾配をゼロに初期化する

self.actor.optimizer.zero_grad()

# 33. アクターに観測情報を入力して行動を算出する。バッチサイズは６４個

predicted_actions = self.actor.forward(observations)

#self.actor.train()

# 34.アクターの損失関数を算出する

# Actorの目的は、Criticネットワークの出力（行動価値）を最大化するような行動を選択すること。

# なので、actorNN→criticNNのDDPG構造全体の出力結果をactor_lossとして、actorNNとcriticNNの両方をbackwardし、

# actorだけをパラメータ更新することによりactorの学習をすることができる。

actor_loss = -self.critic.forward(observations, predicted_actions)

actor_loss = T.mean(actor_loss)

# 35. DDPG構造全体の損失関数actor_lossを微分し、勾配を算出する

actor_loss.backward()

# 36. 勾配からオプティマイザーによってアクターのパラメータだけを（重みとバイアス）を更新する

self.actor.optimizer.step()

全ニューラルネットワークのパラメータを更新する

learn()メソドの締めくくりとして、３６の直後にself.update_network_parameters()を入れ、メソドとして定義します。

 # 37. パラメータ更新メソド。
    def update_network_parameters(self, tau=None):
        if tau is None:
            tau = self.tau
    
        # 38. actor, critic, target_actor, target_criticのネットワーク内の全てのパラメータ（重みとバイアス）とその名前を取得する
        # actorとcriticは先ほど更新されたばかりのパラメーター
        actor_params = self.actor.named_parameters()
        critic_params = self.critic.named_parameters()
        target_actor_params = self.target_actor.named_parameters()
        target_critic_params = self.target_critic.named_parameters()
        print('actor_params : ', actor_params) # actor_params :  <generator object Module.named_parameters at 0x000001661B2D9D48>

        # 39. パラメータをディクショナリとして取り出す。
        actor_params_dict = dict(actor_params)
        critic_params_dict = dict(critic_params)
        target_actor_params_dict = dict(target_actor_params)
        target_critic_params_dict = dict(target_critic_params)
        print('actor_params_dict : ', actor_params_dict)
        print(actor_params_dict.keys())
        """
        actor_params_dict :  {'fc1.weight': Parameter containing:
                                   tensor([[-0.1895, -0.0343,  0.1138,  ...,  0.2157,  0.0527, -0.1173],/

        dict_keys(['fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'])
        """

        # 40. クリティックの各パラメーター毎に 更新重みtau=0.0001の分だけほんの少しcriticパラメータをtarget_criticパラメータに近づける。
        for name in critic_params_dict:
            critic_params_dict[name] = tau * critic_params_dict[name].clone() + \
                                       (1-tau) * target_critic_params_dict[name].clone()
            
        # 41. 更新したcriticパラメータをtarget_criticのパラメータとしてロードする。
        self.target_critic.load_state_dict(critic_params_dict)

        # 42.アクターの各パラメーター毎に 更新重みtau=0.0001の分だけほんの少しactorパラメータをtarget_actorパラメータに近づける。
        for name in actor_params_dict:
            actor_params_dict[name] = tau * actor_params_dict[name].clone() + \
                                      (1 - tau) * target_actor_params_dict[name].clone()
            
        # 43. 更新したactorパラメータをtarget_actorのパラメータとしてロードする。
        self.target_actor.load_state_dict(actor_params_dict)

# 37. パラメータ更新メソド。

def update_network_parameters(self, tau=None):

if tau is None:

tau = self.tau

# 38. actor, critic, target_actor, target_criticのネットワーク内の全てのパラメータ（重みとバイアス）とその名前を取得する

# actorとcriticは先ほど更新されたばかりのパラメーター

actor_params = self.actor.named_parameters()

critic_params = self.critic.named_parameters()

target_actor_params = self.target_actor.named_parameters()

target_critic_params = self.target_critic.named_parameters()

print('actor_params : ', actor_params) # actor_params : <generator object Module.named_parameters at 0x000001661B2D9D48>

# 39. パラメータをディクショナリとして取り出す。

actor_params_dict = dict(actor_params)

critic_params_dict = dict(critic_params)

target_actor_params_dict = dict(target_actor_params)

target_critic_params_dict = dict(target_critic_params)

print('actor_params_dict : ', actor_params_dict)

print(actor_params_dict.keys())

"""

actor_params_dict : {'fc1.weight': Parameter containing:

tensor([[-0.1895, -0.0343, 0.1138, ..., 0.2157, 0.0527, -0.1173],/

dict_keys(['fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'])

"""

# 40. クリティックの各パラメーター毎に更新重みtau=0.0001の分だけほんの少しcriticパラメータをtarget_criticパラメータに近づける。

for name in critic_params_dict:

critic_params_dict[name] = tau * critic_params_dict[name].clone() + \

(1-tau) * target_critic_params_dict[name].clone()

# 41. 更新したcriticパラメータをtarget_criticのパラメータとしてロードする。

self.target_critic.load_state_dict(critic_params_dict)

# 42.アクターの各パラメーター毎に更新重みtau=0.0001の分だけほんの少しactorパラメータをtarget_actorパラメータに近づける。

for name in actor_params_dict:

actor_params_dict[name] = tau * actor_params_dict[name].clone() + \

(1 - tau) * target_actor_params_dict[name].clone()

# 43. 更新したactorパラメータをtarget_actorのパラメータとしてロードする。

self.target_actor.load_state_dict(actor_params_dict)

学習結果の確認

クリティックとアクタークリティックのパラメータ更新まで行ってアクターとターゲットアクターのパラメーター更新をしない状態で試してみました。

動作確認のつもりでやりましたが、学習は進んでいるようです。

10step x 100epsode　で学習したところ、ハーフチーターはエピソード開始直後に前へ倒れこむような挙動を獲得し、リワードを稼ぐようになりました。10stepでは走り続ける動作を獲得するのは無理なようです。

続いて、アクターとターゲットアクターのパラメーター更新も追加して同じことを行いました。

こちらは、足を折りたたんで低い姿勢なることでリワードを稼ぎに行っているようです。しかし80エピソードから成績が悪化していっています。うまく学習が進んでいないようです。

まあ、ニューラルネットワーク構造もまだ適当に作っているので、改善の余地があります。

また下記のようにactor_lossesとcritic_lossesの変化も可視化してみると下記のように悪化していく方向にあります。

課題

計算の高速化（print文の無効化）
適切なニューラルネットワーク構造
適切なステップ数
適切なエピソード数
パラメータのセーブとロード
途中で止まった時に続行可能にしたい
計算の高速化（GPUの利用）
学習の進行状況のリアルタイム可視化

現在までのスクリプト

import gymnasium as gym
import time
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# 10. ReplayBufferクラスを新規作成する
class ReplayBuffer:
    def __init__(self, max_memory_size, n_obs_space, n_action_space):
        self.max_memory_size = max_memory_size
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.memory_count = 0

        self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))
        self.action_memory =  np.zeros((self.max_memory_size, self.n_action_space))
        self.reward_memory =  np.zeros(self.max_memory_size)
        self.next_state_memory =  np.zeros((self.max_memory_size, self.n_obs_space))
        self.terminal_memory =  np.zeros(self.max_memory_size)
        #self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

    # 11.トランジション保存のためstore_transitionメソドを作成する
    def store_transition(self, obs, action, reward, next_state, done):
        #print('store_transition is working.')
        index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック
        #print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())
        self.state_memory[index] = obs.detach().numpy().flatten()
        self.action_memory[index] = action.flatten()
        self.reward_memory[index] = reward.flatten()
        self.next_state_memory[index] = next_state.flatten()
        self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように
        #print('state_memory :', self.state_memory)
        #print('action_memory :', self.action_memory)
        #print('reward_memory :', self.reward_memory)
        #print('next_state_memory :', self.next_state_memory)
        #print('memory.state_memory :', self.terminal_memory)

        #print('type of state_memory :', type(self.state_memory[0][0]))
        #print('type of action_memory :', type(self.action_memory[0][0]))
        #print('type of reward_memory :', type(self.reward_memory[0]))
        #print('type of next_state_memory :', type(self.next_state_memory[0][0]))
        #print('type of memory.state_memory :', type(self.terminal_memory[0]))

        self.memory_count += 1
        print('memory_count :', agent.memory.memory_count)

    # 16 バッファメモリーからランダムに抽出する
    def sample_buffer(self, batch_size):
        # indexが最大メモリに到達していない場合を想定する。
        max_index = min(self.max_memory_size, self.memory_count)
        choosed_index = np.random.choice(max_index, batch_size)
        
        observations = self.state_memory[choosed_index]
        actions = self.action_memory[choosed_index]
        rewards = self.reward_memory[choosed_index]
        next_states = self.next_state_memory[choosed_index]
        terminals = self.terminal_memory[choosed_index]

        return observations, actions, rewards, next_states, terminals


# 6.ActorNNクラスを新規作成する
class ActorNN(nn.Module):
    def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64):
        #print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(n_obs_space, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, n_action_space)

        #26.最適化処理としてアダムを設定する
        self.optimizer = optim.Adam(self.parameters(), lr=alpha)

    def forward(self, obs):
        #print('AgetDDPG.ActorNN.forward is working')
        #print('====ここまではOK1====')
        x = self.fc1(obs)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        mu = F.tanh(x)
        #print('action μ:', mu)
        #print('====ここまではOK2====')

        # 必要であればあとでノイズを入れる：action = mu + noize
        action = mu
        return action
        

# 22.CriticNNクラスを新規作成する
class CriticNN(nn.Module):

    def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,
                 layer1_size=64, layer2_size=64, batch_size=64):
        #print('CriticNN.__init__ is working.')
        super(CriticNN, self).__init__()

        # クリティックNNは観察空間+行動空間の２つを入力とする構造
        input_dim = n_obs_space + n_action_space
        self.fc1 = nn.Linear(input_dim, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

        #27.最適化処理としてアダムを設定する
        self.optimizer = optim.Adam(self.parameters(), lr=beta)

    def forward(self, obs, action):
        input_data = T.cat([obs, action], dim=1)
        x = self.fc1(input_data)
        x = F.relu(x)
        x =self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x #一つの状態価値を出力する。

# 3.エージェントクラスを定義する
class AgentDDPG:

    def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64):
        #print('AgentDDPG.__init__ is working.')
        # 5.ActorNNクラスのインスタンスを生成する
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        self.tau = tau
        
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.n_state_action_value = n_state_action_value

        self.layer1_size = layer1_size
        self.layer2_size = layer2_size

        # 13.バッチサイズを決めておく
        self.batch_size = batch_size 

        self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                            layer1_size=64, layer2_size=64, batch_size=64)        
        
        # 9.memoryインスタンスを追加
        self.MAX_MEMORY_SIZE = 1000
        self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,
                                   n_obs_space=self.n_obs_space,
                                   n_action_space=self.n_action_space)
        
        # 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する
        # actorとtarget_actorのネットワークは同じActorNNで良い
        self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)

        
        # 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する
        self.target_critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)
        
        # 24.クリティックネットワークインスタンスcriticを作成する。
        self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,
                               layer1_size=64, layer2_size=64, batch_size=64)
        
        # アクターロスとクリティックロス
        self.actor_loss = 0
        self.critic_loss = 0
        

    def choose_action(self, obs):
        #print('AgentDDPG.choose_action is working.')
        # 4.方策（アクター）はニューラルネットワークで表現する。
        #   ActorNNクラスを新規作成し、インスタンスactorとして使用する。
        action = self.actor.forward(obs)
        action = action.detach().numpy()
        #print('====ここまではOK3====')
        return action
    
    # 8.remenberメソドを追加
    def remember(self, obs, action, reward, next_state, done):
        self.memory.store_transition(obs, action, reward, next_state, done)

    # 13.learnメソドを追加
    def learn(self):
        # 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。
        if self.memory.memory_count < self.batch_size:
            return
        
        # 15.メモリバッファからデータを抜き出す sample_buffer()
        # バッチ化されているので変数名を複数形にする
        observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)
        #print('s:', observations)
        #print(observations.shape)
        #print('a :', actions)
        #print('r :', rewards)
        #print('s_ :', next_states)
        #print('terminal :', terminals)

        # 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する
        observations = T.tensor(observations, dtype=T.float32)
        actions = T.tensor(actions, dtype=T.float32)
        rewards = T.tensor(rewards, dtype=T.float32)
        next_states = T.tensor(next_states, dtype=T.float32)
        terminals = T.tensor(terminals, dtype=T.float32)
       
        # 18.ターゲットアクターネットワークインスタンスtarget_actorに
        # 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。
        # このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。
        #print('next_states :', next_states)
        target_actions = self.target_actor.forward(next_states)

        # 20.ターゲットクリティックネットワークインスタンスtarget_criticに
        # 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して
        # 価値関数の推定値ターゲットバリューを出力する。
        # TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。
        # ターゲットクリティックバリューはターゲットアクターネットワークを使う
        target_critic_values = self.target_critic.forward(next_states, target_actions)

        
        # 23.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に
        # 現在の状態observationsと行動actionsを入力して
        # クリティックバリューを算出する
        critic_values = self.critic.forward(observations, actions)

        # 25.TDターゲットを算出する：r + γ*V(w)[s_t+1]
        td_targets = []
        for i in range(self.batch_size):
            td_target = rewards[i] + self.gamma * target_critic_values[i] * terminals[i]
            td_targets.append(td_target)
        
        # TDターゲットの形をバッチに整える
        td_targets = T.tensor(td_targets, dtype=T.float32)
        td_targets = td_targets.view(self.batch_size, 1) #viewはreshapeと同じ。64x1に見え方を変更した、という意味
        #print('td_targets :', td_targets)

        
        #ここから次回はやっていこう 2023/5/16
        # ==== （１）クリティックの学習 ====

        #self.critic.train()

        # 28.クリティックの勾配をゼロに初期化する
        self.critic.optimizer.zero_grad()

        # 29. TDターゲットと状態価値の二乗誤差を算出して、クリティックの損失関数とする。バッチサイズは６４個
        critic_loss = F.mse_loss(td_targets, critic_values)
        self.critic_loss = critic_loss
        #print('critic_loss : ', critic_loss) # tensor(0.0485, grad_fn=<MseLossBackward0>)

        # 30. クリティックの損失関数を微分して、勾配を算出する
        critic_loss.backward()

        # 31. 勾配からオプティマイザーによってクリティックのパラメータ（重みとバイアス）を更新する
        self.critic.optimizer.step()

        # self.critic.eval()


        # ==== （２）アクターの学習 ====

        # 32. アクターの勾配をゼロに初期化する
        self.actor.optimizer.zero_grad()

        # 33. アクターに観測情報を入力して行動を算出する。バッチサイズは６４個
        predicted_actions = self.actor.forward(observations)

        #self.actor.train()

        # 34.アクターの損失関数を算出する
        #    Actorの目的は、Criticネットワークの出力（行動価値）を最大化するような行動を選択すること。
        #    なので、actorNN→criticNNのDDPG構造全体の出力結果をactor_lossとして、actorNNとcriticNNの両方をbackwardし、
        #    actorだけをパラメータ更新することによりactorの学習をすることができる。

        actor_loss = -self.critic.forward(observations, predicted_actions)
        actor_loss = T.mean(actor_loss)
        self.actor_loss = actor_loss

        print(f'actor_loss: {actor_loss}, critic_loss: {critic_loss}')

        # 35. DDPG構造全体の損失関数actor_lossを微分し、勾配を算出する
        actor_loss.backward()

        # 36. 勾配からオプティマイザーによってアクターのパラメータだけを（重みとバイアス）を更新する
        self.actor.optimizer.step()
    
        # 37. 全ニューラルネットワークのパラメータを更新する。
        self.update_network_parameters()

    # 37. パラメータ更新メソド。
    def update_network_parameters(self, tau=None):
        if tau is None:
            tau = self.tau
    
        # 38. actor, critic, target_actor, target_criticのネットワーク内の全てのパラメータ（重みとバイアス）とその名前を取得する
        # actorとcriticは先ほど更新されたばかりのパラメーター
        actor_params = self.actor.named_parameters()
        critic_params = self.critic.named_parameters()
        target_actor_params = self.target_actor.named_parameters()
        target_critic_params = self.target_critic.named_parameters()
        #print('actor_params : ', actor_params) # actor_params :  <generator object Module.named_parameters at 0x000001661B2D9D48>

        # 39. パラメータをディクショナリとして取り出す。
        actor_params_dict = dict(actor_params)
        critic_params_dict = dict(critic_params)
        target_actor_params_dict = dict(target_actor_params)
        target_critic_params_dict = dict(target_critic_params)
        #print('actor_params_dict : ', actor_params_dict)
        #print(actor_params_dict.keys())
        """
        actor_params_dict :  {'fc1.weight': Parameter containing:
                                   tensor([[-0.1895, -0.0343,  0.1138,  ...,  0.2157,  0.0527, -0.1173],/

        dict_keys(['fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'])
        """

        # 40. クリティックの各パラメーター毎に 更新重みtau=0.0001の分だけほんの少しcriticパラメータをtarget_criticパラメータに近づける。
        for name in critic_params_dict:
            critic_params_dict[name] = tau * critic_params_dict[name].clone() + \
                                       (1-tau) * target_critic_params_dict[name].clone()
            
        # 41. 更新したcriticパラメータをtarget_criticのパラメータとしてロードする。
        self.target_critic.load_state_dict(critic_params_dict)

        # 42.アクターの各パラメーター毎に 更新重みtau=0.0001の分だけほんの少しactorパラメータをtarget_actorパラメータに近づける。
        for name in actor_params_dict:
            actor_params_dict[name] = tau * actor_params_dict[name].clone() + \
                                      (1 - tau) * target_actor_params_dict[name].clone()
            
        # 43. 更新したactorパラメータをtarget_actorのパラメータとしてロードする。
        self.target_actor.load_state_dict(actor_params_dict)
    

# 2.エージェントクラスのインスタンスを生成する
agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 100 # episodes
STEPS = 10    # steps
DELAY_TIME = 0.00 # sec


total_rewards = []
actor_losses = []
critic_losses = []
for eposode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)
    # tensor([ 0.0040,  0.0199, -0.0622,  0.0594, -0.0605,  0.0577, -0.0056,  0.0333,        -0.0072,  0.0532, -0.0512,  0.0173, -0.0529, -0.1104,  0.0946, -0.0559,         0.0824])
    #print(type(obs))
    # observation_space :  Box(-inf, inf, (17,), float64)
    #print('observation_space : ', env.observation_space)
    #print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(STEPS):
        env.render()
        
        # ここをDDPGに置き換えていく
        action = agent.choose_action(obs) # 1.Agentクラスを定義していく
        #action :  [ 0.06660474 -0.11753064  0.02527559  0.06465236  0.1050786   0.05048539]
        #print('====ここまではOK4====')
        #print('action_space : ', env.action_space)
        #print('action : ', action)

        next_state, reward, done, _, info = env.step(action)
        #print('next_state, reward, done, _, info :', next_state, reward, done, _, info)
        """
        action :  [ 0.06660474 -0.11753064  0.02527559  0.06465236  0.1050786   0.05048539]

        next_state, reward, done, _, info : 
        [-0.00265179  0.0229547   0.00463243 -0.04729936 -0.00959038  0.04734605
        0.03672746  0.02857842  0.09980254 -0.32065693  0.04221647  1.58668951
        -2.31089174  1.30338924 -0.25465526  1.08250465 -0.14134398]
        0.07553858359316026
        False
        False
        {'x_position': -0.09233384215910741, 'x_velocity': 0.07920445513883267, 'reward_run': 0.07920445513883267, 'reward_ctrl': -0.0036658715456724168}
        """
        #print('====ここまではOK5====')

        #7. トラジェクトを保存する。経験再生(ReplayBuffer)
        agent.remember(obs, action, reward, next_state, int(done))

        #12. ニューラルネットワークを学習する
        agent.learn()
        
        # 26.エピソード内での報酬を累積していく
        total_reward += reward
        
        # 27. next_stateをobsとして再出発する
        #print('next_state:', next_state)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)
        # 28. チーターの動きを見たいのでスリープを入れる
        time.sleep(DELAY_TIME)

    #print('total_reward : ', total_reward)
    total_rewards.append(total_reward)

    actor_losses.append(float(agent.actor_loss)) 
    critic_losses.append(float(agent.critic_loss)) 

#print('total_rewards : ', total_rewards)
# plt.plot(total_rewards)

plt.plot(actor_losses)
plt.plot(critic_losses)
#plt.plot(critic_losses.detach().numpy())
plt.show()

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

import gymnasium as gym

import time

import torch as T

import torch.nn as nn

import torch.nn.functional as F

import torch.optim as optim

import numpy as np

import matplotlib.pyplot as plt

# 10. ReplayBufferクラスを新規作成する

class ReplayBuffer:

def __init__(self, max_memory_size, n_obs_space, n_action_space):

self.max_memory_size = max_memory_size

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.memory_count = 0

self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.action_memory = np.zeros((self.max_memory_size, self.n_action_space))

self.reward_memory = np.zeros(self.max_memory_size)

self.next_state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.terminal_memory = np.zeros(self.max_memory_size)

#self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

# 11.トランジション保存のためstore_transitionメソドを作成する

def store_transition(self, obs, action, reward, next_state, done):

#print('store_transition is working.')

index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック

#print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())

self.state_memory[index] = obs.detach().numpy().flatten()

self.action_memory[index] = action.flatten()

self.reward_memory[index] = reward.flatten()

self.next_state_memory[index] = next_state.flatten()

self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように

#print('state_memory :', self.state_memory)

#print('action_memory :', self.action_memory)

#print('reward_memory :', self.reward_memory)

#print('next_state_memory :', self.next_state_memory)

#print('memory.state_memory :', self.terminal_memory)

#print('type of state_memory :', type(self.state_memory[0][0]))

#print('type of action_memory :', type(self.action_memory[0][0]))

#print('type of reward_memory :', type(self.reward_memory[0]))

#print('type of next_state_memory :', type(self.next_state_memory[0][0]))

#print('type of memory.state_memory :', type(self.terminal_memory[0]))

self.memory_count += 1

print('memory_count :', agent.memory.memory_count)

# 16 バッファメモリーからランダムに抽出する

def sample_buffer(self, batch_size):

# indexが最大メモリに到達していない場合を想定する。

max_index = min(self.max_memory_size, self.memory_count)

choosed_index = np.random.choice(max_index, batch_size)

observations = self.state_memory[choosed_index]

actions = self.action_memory[choosed_index]

rewards = self.reward_memory[choosed_index]

next_states = self.next_state_memory[choosed_index]

terminals = self.terminal_memory[choosed_index]

return observations, actions, rewards, next_states, terminals

# 6.ActorNNクラスを新規作成する

class ActorNN(nn.Module):

def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64):

#print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(n_obs_space, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, n_action_space)

#26.最適化処理としてアダムを設定する

self.optimizer = optim.Adam(self.parameters(), lr=alpha)

def forward(self, obs):

#print('AgetDDPG.ActorNN.forward is working')

#print('====ここまではOK1====')

x = self.fc1(obs)

x = F.relu(x)

x = self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

mu = F.tanh(x)

#print('action μ:', mu)

#print('====ここまではOK2====')

# 必要であればあとでノイズを入れる：action = mu + noize

action = mu

return action

# 22.CriticNNクラスを新規作成する

class CriticNN(nn.Module):

def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64):

#print('CriticNN.__init__ is working.')

super(CriticNN, self).__init__()

# クリティックNNは観察空間+行動空間の２つを入力とする構造

input_dim = n_obs_space + n_action_space

self.fc1 = nn.Linear(input_dim, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

#27.最適化処理としてアダムを設定する

self.optimizer = optim.Adam(self.parameters(), lr=beta)

def forward(self, obs, action):

input_data = T.cat([obs, action], dim=1)

x = self.fc1(input_data)

x = F.relu(x)

x =self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

return x #一つの状態価値を出力する。

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64):

#print('AgentDDPG.__init__ is working.')

# 5.ActorNNクラスのインスタンスを生成する

self.alpha = alpha

self.beta = beta

self.gamma = gamma

self.tau = tau

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.n_state_action_value = n_state_action_value

self.layer1_size = layer1_size

self.layer2_size = layer2_size

# 13.バッチサイズを決めておく

self.batch_size = batch_size

self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 9.memoryインスタンスを追加

self.MAX_MEMORY_SIZE = 1000

self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,

n_obs_space=self.n_obs_space,

n_action_space=self.n_action_space)

# 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する

# actorとtarget_actorのネットワークは同じActorNNで良い

self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する

self.target_critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 24.クリティックネットワークインスタンスcriticを作成する。

self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# アクターロスとクリティックロス

self.actor_loss = 0

self.critic_loss = 0

def choose_action(self, obs):

#print('AgentDDPG.choose_action is working.')

# 4.方策（アクター）はニューラルネットワークで表現する。

# ActorNNクラスを新規作成し、インスタンスactorとして使用する。

action = self.actor.forward(obs)

action = action.detach().numpy()

#print('====ここまではOK3====')

return action

# 8.remenberメソドを追加

def remember(self, obs, action, reward, next_state, done):

self.memory.store_transition(obs, action, reward, next_state, done)

# 13.learnメソドを追加

def learn(self):

# 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。

if self.memory.memory_count < self.batch_size:

return

# 15.メモリバッファからデータを抜き出す sample_buffer()

# バッチ化されているので変数名を複数形にする

observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)

#print('s:', observations)

#print(observations.shape)

#print('a :', actions)

#print('r :', rewards)

#print('s_ :', next_states)

#print('terminal :', terminals)

# 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する

observations = T.tensor(observations, dtype=T.float32)

actions = T.tensor(actions, dtype=T.float32)

rewards = T.tensor(rewards, dtype=T.float32)

next_states = T.tensor(next_states, dtype=T.float32)

terminals = T.tensor(terminals, dtype=T.float32)

# 18.ターゲットアクターネットワークインスタンスtarget_actorに

# 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。

# このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。

#print('next_states :', next_states)

target_actions = self.target_actor.forward(next_states)

# 20.ターゲットクリティックネットワークインスタンスtarget_criticに

# 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して

# 価値関数の推定値ターゲットバリューを出力する。

# TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。

# ターゲットクリティックバリューはターゲットアクターネットワークを使う

target_critic_values = self.target_critic.forward(next_states, target_actions)

# 23.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に

# 現在の状態observationsと行動actionsを入力して

# クリティックバリューを算出する

critic_values = self.critic.forward(observations, actions)

# 25.TDターゲットを算出する：r + γ*V(w)[s_t+1]

td_targets = []

for i in range(self.batch_size):

td_target = rewards[i] + self.gamma * target_critic_values[i] * terminals[i]

td_targets.append(td_target)

# TDターゲットの形をバッチに整える

td_targets = T.tensor(td_targets, dtype=T.float32)

td_targets = td_targets.view(self.batch_size, 1) #viewはreshapeと同じ。64x1に見え方を変更した、という意味

#print('td_targets :', td_targets)

#ここから次回はやっていこう 2023/5/16

# ==== （１）クリティックの学習 ====

#self.critic.train()

# 28.クリティックの勾配をゼロに初期化する

self.critic.optimizer.zero_grad()

# 29. TDターゲットと状態価値の二乗誤差を算出して、クリティックの損失関数とする。バッチサイズは６４個

critic_loss = F.mse_loss(td_targets, critic_values)

self.critic_loss = critic_loss

#print('critic_loss : ', critic_loss) # tensor(0.0485, grad_fn=<MseLossBackward0>)

# 30. クリティックの損失関数を微分して、勾配を算出する

critic_loss.backward()

# 31. 勾配からオプティマイザーによってクリティックのパラメータ（重みとバイアス）を更新する

self.critic.optimizer.step()

# self.critic.eval()

# ==== （２）アクターの学習 ====

# 32. アクターの勾配をゼロに初期化する

self.actor.optimizer.zero_grad()

# 33. アクターに観測情報を入力して行動を算出する。バッチサイズは６４個

predicted_actions = self.actor.forward(observations)

#self.actor.train()

# 34.アクターの損失関数を算出する

# Actorの目的は、Criticネットワークの出力（行動価値）を最大化するような行動を選択すること。

# なので、actorNN→criticNNのDDPG構造全体の出力結果をactor_lossとして、actorNNとcriticNNの両方をbackwardし、

# actorだけをパラメータ更新することによりactorの学習をすることができる。

actor_loss = -self.critic.forward(observations, predicted_actions)

actor_loss = T.mean(actor_loss)

self.actor_loss = actor_loss

print(f'actor_loss: {actor_loss}, critic_loss: {critic_loss}')

# 35. DDPG構造全体の損失関数actor_lossを微分し、勾配を算出する

actor_loss.backward()

# 36. 勾配からオプティマイザーによってアクターのパラメータだけを（重みとバイアス）を更新する

self.actor.optimizer.step()

# 37. 全ニューラルネットワークのパラメータを更新する。

self.update_network_parameters()

# 37. パラメータ更新メソド。

def update_network_parameters(self, tau=None):

if tau is None:

tau = self.tau

# 38. actor, critic, target_actor, target_criticのネットワーク内の全てのパラメータ（重みとバイアス）とその名前を取得する

# actorとcriticは先ほど更新されたばかりのパラメーター

actor_params = self.actor.named_parameters()

critic_params = self.critic.named_parameters()

target_actor_params = self.target_actor.named_parameters()

target_critic_params = self.target_critic.named_parameters()

#print('actor_params : ', actor_params) # actor_params : <generator object Module.named_parameters at 0x000001661B2D9D48>

# 39. パラメータをディクショナリとして取り出す。

actor_params_dict = dict(actor_params)

critic_params_dict = dict(critic_params)

target_actor_params_dict = dict(target_actor_params)

target_critic_params_dict = dict(target_critic_params)

#print('actor_params_dict : ', actor_params_dict)

#print(actor_params_dict.keys())

"""

actor_params_dict : {'fc1.weight': Parameter containing:

tensor([[-0.1895, -0.0343, 0.1138, ..., 0.2157, 0.0527, -0.1173],/

dict_keys(['fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias', 'fc3.weight', 'fc3.bias'])

"""

# 40. クリティックの各パラメーター毎に更新重みtau=0.0001の分だけほんの少しcriticパラメータをtarget_criticパラメータに近づける。

for name in critic_params_dict:

critic_params_dict[name] = tau * critic_params_dict[name].clone() + \

(1-tau) * target_critic_params_dict[name].clone()

# 41. 更新したcriticパラメータをtarget_criticのパラメータとしてロードする。

self.target_critic.load_state_dict(critic_params_dict)

# 42.アクターの各パラメーター毎に更新重みtau=0.0001の分だけほんの少しactorパラメータをtarget_actorパラメータに近づける。

for name in actor_params_dict:

actor_params_dict[name] = tau * actor_params_dict[name].clone() + \

(1 - tau) * target_actor_params_dict[name].clone()

# 43. 更新したactorパラメータをtarget_actorのパラメータとしてロードする。

self.target_actor.load_state_dict(actor_params_dict)

# 2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 100 # episodes

STEPS = 10 # steps

DELAY_TIME = 0.00 # sec

total_rewards = []

actor_losses = []

critic_losses = []

for eposode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

# tensor([ 0.0040, 0.0199, -0.0622, 0.0594, -0.0605, 0.0577, -0.0056, 0.0333, -0.0072, 0.0532, -0.0512, 0.0173, -0.0529, -0.1104, 0.0946, -0.0559, 0.0824])

#print(type(obs))

# observation_space : Box(-inf, inf, (17,), float64)

#print('observation_space : ', env.observation_space)

#print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(STEPS):

env.render()

# ここをDDPGに置き換えていく

action = agent.choose_action(obs) # 1.Agentクラスを定義していく

#action : [ 0.06660474 -0.11753064 0.02527559 0.06465236 0.1050786 0.05048539]

#print('====ここまではOK4====')

#print('action_space : ', env.action_space)

#print('action : ', action)

next_state, reward, done, _, info = env.step(action)

#print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

"""

action : [ 0.06660474 -0.11753064 0.02527559 0.06465236 0.1050786 0.05048539]

next_state, reward, done, _, info :

[-0.00265179 0.0229547 0.00463243 -0.04729936 -0.00959038 0.04734605

0.03672746 0.02857842 0.09980254 -0.32065693 0.04221647 1.58668951

-2.31089174 1.30338924 -0.25465526 1.08250465 -0.14134398]

0.07553858359316026

False

{'x_position': -0.09233384215910741, 'x_velocity': 0.07920445513883267, 'reward_run': 0.07920445513883267, 'reward_ctrl': -0.0036658715456724168}

"""

#print('====ここまではOK5====')

#7. トラジェクトを保存する。経験再生(ReplayBuffer)

agent.remember(obs, action, reward, next_state, int(done))

#12. ニューラルネットワークを学習する

agent.learn()

# 26.エピソード内での報酬を累積していく

total_reward += reward

# 27. next_stateをobsとして再出発する

#print('next_state:', next_state)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

# 28. チーターの動きを見たいのでスリープを入れる

time.sleep(DELAY_TIME)

#print('total_reward : ', total_reward)

total_rewards.append(total_reward)

actor_losses.append(float(agent.actor_loss))

critic_losses.append(float(agent.critic_loss))

#print('total_rewards : ', total_rewards)

# plt.plot(total_rewards)

plt.plot(actor_losses)

plt.plot(critic_losses)

#plt.plot(critic_losses.detach().numpy())

plt.show()

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

以上。次回はニューラルネットワーク構造を見直しましょう。

前回までの動き：

agentがobsを受けてchoose_actionでactor(NN)をforwardしactionを出力する。
actionを受けてenv.stepし結果としてnext_state,reward,doneを得る。
agent.rememberで結果obs,action,rewerd,next_state,int(done)を保存する。
rememberで64データ集まったらagent.learnで学習が始まる。

学習：

sample_bufferで64データをランダムに取り出す。
dtype=T.float32に変換する。
target_actorへnext_states 64データを入力してtarget_actions 64データを得る。

ここまで作成しました。

今回

このtarget_actionsをnext_statesと共にtarget_criticへ入力するところからやっていきます。

この部分こそ連続値対応できるDDPGの核心部分なので十分に理解する必要があります。

引き続き学習learn()メソド内での処理です。

やっていこう

AgentDDPG.learn()メソド内の

target_actions = self.target_actor.forward(next_states)

の直下に

target_critic_values

　= self.target_critic.forward(next_states, target_actions)

を入れます。

算出したてのtarget_actionsとnext_statesの２つを入力として、target_critic_valuesを出力します。

ちょっと説明をいれると、DDPGはTD法なのでTDターゲットとしてr + γ*V(w)[s_t+1]を考えます。target_critic_valuesはこれのことです。

この価値関数Vの部分をtarget_criticNNで表現します。

# 21.ターゲットクリティックネットワークインスタンスtarget_criticを作成します。AgentDDPG.__init__()内に定義します。

target_actorの引数　学習率alphaをcritic用にbetaへ変更しています

self.target_critic

= CriticNN(beta=0.000025,
           n_obs_space=17, n_action_space=6,
           layer1_size=64, layer2_size=64,
           batch_size=64)

self.target_critic

= CriticNN(beta=0.000025,

n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64,

batch_size=64)

# 22.CriticNNクラスを作成します。

# 22.CriticNNクラスを新規作成する
class CriticNN(nn.Module):

    def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,
                 layer1_size=64, layer2_size=64, batch_size=64):
        print('CriticNN.__init__ is working.')
        super(CriticNN, self).__init__()

        # クリティックNNは観察空間+行動空間の２つを入力とする構造
        input_dim = n_obs_space + n_action_space
        self.fc1 = nn.Linear(input_dim, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

    def forward(self, obs, action):
        input_data = T.cat([obs, action], dim=1)
        x = self.fc1(input_data)
        x = F.relu(x)
        x =self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x #一つの状態価値を出力する。

# 22.CriticNNクラスを新規作成する

class CriticNN(nn.Module):

def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64):

print('CriticNN.__init__ is working.')

super(CriticNN, self).__init__()

# クリティックNNは観察空間+行動空間の２つを入力とする構造

input_dim = n_obs_space + n_action_space

self.fc1 = nn.Linear(input_dim, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

def forward(self, obs, action):

input_data = T.cat([obs, action], dim=1)

x = self.fc1(input_data)

x = F.relu(x)

x =self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

return x #一つの状態価値を出力する。

これでtarget_critic_values が返ってくる

#23.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に現在の状態observationsと行動actionsを入力してcritic_valueを算出する。

AgentDDPG.learn()メソドに戻ってさっきほどの

target_critic_values

　= self.target_critic.forward(next_states, target_actions)

の直下に

critic_values

= self.critic.forward(observations, actions)

を入れる。criticインスタンスはまだ作成していないので、AgentDDPG.__init__()に追加する

# 24. AgentDDPG.__init__()にクリティックインスタンス生成を追加する

self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,
                       layer1_size=64, layer2_size=64, batch_size=64)

1 2	self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6, layer1_size=64, layer2_size=64, batch_size=64)

これで４つのNNを導入することができた。

# 25.target_criticからTDターゲット(= r + γ*V(w)[s_t+1])を算出する。

AgentDDPG.learn()メソドに戻って、

td_targets = []
for i in range(self.batch_size):
    td_target 
        = rewards[i] + self.gamma * target_critic_values[i] * terminals[i]
    td_targets.append(td_target)

# TDターゲットの形をバッチに整える
td_targets = T.tensor(td_targets, dtype=T.float32)
td_targets = td_targets.view(self.batch_size, 1)

td_targets = []

for i in range(self.batch_size):

td_target

= rewards[i] + self.gamma * target_critic_values[i] * terminals[i]

td_targets.append(td_target)

# TDターゲットの形をバッチに整える

td_targets = T.tensor(td_targets, dtype=T.float32)

td_targets = td_targets.view(self.batch_size, 1)

まとめとこれまでのスクリプト

actorが行動して集めたデータから、経験再生を使って「次の状態」からtarget_actorが「次の行動」を出力し、「次の行動」と「次の状態」からtarget_criticが「次の状態価値」出力し、TDターゲットを算出しました。注意すべきはここで言う「次の行動」とはあくまでtarget_actorが生み出した「架空の行動」です。

一方で、criticは経験再生を使って「現在の状態」と「そのとき取った行動」から「現在の状態価値」を算出ししました。注意すべきは、こちらの「そのとき取った行動」とは実際にactorが行動して経験再生バッファに保存されたデータです。

また、この「現在の状態価値」をベースラインと呼びます。次回以降。「TDターゲット-ベースライン」の演算が出てくるので注目です。

import gymnasium as gym
import time
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opitm
import numpy as np

# 10. ReplayBufferクラスを新規作成する
class ReplayBuffer:
    def __init__(self, max_memory_size, n_obs_space, n_action_space):
        self.max_memory_size = max_memory_size
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.memory_count = 0

        self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))
        self.action_memory =  np.zeros((self.max_memory_size, self.n_action_space))
        self.reward_memory =  np.zeros(self.max_memory_size)
        self.next_state_memory =  np.zeros((self.max_memory_size, self.n_obs_space))
        self.terminal_memory =  np.zeros(self.max_memory_size)
        #self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

    # 11.トランジション保存のためstore_transitionメソドを作成する
    def store_transition(self, obs, action, reward, next_state, done):
        print('store_transition is working.')
        index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック
        print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())
        self.state_memory[index] = obs.detach().numpy().flatten()
        self.action_memory[index] = action.flatten()
        self.reward_memory[index] = reward.flatten()
        self.next_state_memory[index] = next_state.flatten()
        self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように
        print('state_memory :', self.state_memory)
        print('action_memory :', self.action_memory)
        print('reward_memory :', self.reward_memory)
        print('next_state_memory :', self.next_state_memory)
        print('memory.state_memory :', self.terminal_memory)

        print('type of state_memory :', type(self.state_memory[0][0]))
        print('type of action_memory :', type(self.action_memory[0][0]))
        print('type of reward_memory :', type(self.reward_memory[0]))
        print('type of next_state_memory :', type(self.next_state_memory[0][0]))
        print('type of memory.state_memory :', type(self.terminal_memory[0]))

        self.memory_count += 1
        print('memory_count :', agent.memory.memory_count)

    # 16 バッファメモリーからランダムに抽出する
    def sample_buffer(self, batch_size):
        # indexが最大メモリに到達していない場合を想定する。
        max_index = min(self.max_memory_size, self.memory_count)
        choosed_index = np.random.choice(max_index, batch_size)
        
        observations = self.state_memory[choosed_index]
        actions = self.action_memory[choosed_index]
        rewards = self.reward_memory[choosed_index]
        next_states = self.next_state_memory[choosed_index]
        terminals = self.terminal_memory[choosed_index]

        return observations, actions, rewards, next_states, terminals


# 6.ActorNNクラスを新規作成する
class ActorNN(nn.Module):
    def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64):
        print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(n_obs_space, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, n_action_space)

    def forward(self, obs):
        print('AgetDDPG.ActorNN.forward is working')
        print('====ここまではOK1====')
        x = self.fc1(obs)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        mu = F.tanh(x)
        print('action μ:', mu)
        print('====ここまではOK2====')

        # 必要であればあとでノイズを入れる：action = mu + noize
        action = mu
        return action
        

# 22.CriticNNクラスを新規作成する
class CriticNN(nn.Module):

    def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,
                 layer1_size=64, layer2_size=64, batch_size=64):
        print('CriticNN.__init__ is working.')
        super(CriticNN, self).__init__()

        # クリティックNNは観察空間+行動空間の２つを入力とする構造
        input_dim = n_obs_space + n_action_space
        self.fc1 = nn.Linear(input_dim, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

    def forward(self, obs, action):
        input_data = T.cat([obs, action], dim=1)
        x = self.fc1(input_data)
        x = F.relu(x)
        x =self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x #一つの状態価値を出力する。

# 3.エージェントクラスを定義する
class AgentDDPG:

    def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64):
        print('AgentDDPG.__init__ is working.')
        # 5.ActorNNクラスのインスタンスを生成する
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        self.tau = tau
        
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.n_state_action_value = n_state_action_value

        self.layer1_size = layer1_size
        self.layer2_size = layer2_size

        # 13.バッチサイズを決めておく
        self.batch_size = batch_size 

        self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                            layer1_size=64, layer2_size=64, batch_size=64)        
        
        # 9.memoryインスタンスを追加
        self.MAX_MEMORY_SIZE = 1000
        self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,
                                   n_obs_space=self.n_obs_space,
                                   n_action_space=self.n_action_space)
        
        # 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する
        # actorとtarget_actorのネットワークは同じActorNNで良い
        self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)

        
        # 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する
        self.target_critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)
        
        # 24.クリティックネットワークインスタンスcriticを作成する。
        self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,
                               layer1_size=64, layer2_size=64, batch_size=64)        

    def choose_action(self, obs):
        print('AgentDDPG.choose_action is working.')
        # 4.方策（アクター）はニューラルネットワークで表現する。
        #   ActorNNクラスを新規作成し、インスタンスactorとして使用する。
        action = self.actor.forward(obs)
        action = action.detach().numpy()
        print('====ここまではOK3====')
        return action
    
    # 8.remenberメソドを追加
    def remember(self, obs, action, reward, next_state, done):
        self.memory.store_transition(obs, action, reward, next_state, done)

    # 13.learnメソドを追加
    def learn(self):
        # 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。
        if self.memory.memory_count < self.batch_size:
            return
        
        # 15.メモリバッファからデータを抜き出す sample_buffer()
        # バッチ化されているので変数名を複数形にする
        observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)
        print('s:', observations)
        print(observations.shape)
        print('a :', actions)
        print('r :', rewards)
        print('s_ :', next_states)
        print('terminal :', terminals)

        # 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する
        observations = T.tensor(observations, dtype=T.float32)
        actions = T.tensor(actions, dtype=T.float32)
        rewards = T.tensor(rewards, dtype=T.float32)
        next_states = T.tensor(next_states, dtype=T.float32)
        terminals = T.tensor(terminals, dtype=T.float32)
       
        # 18.ターゲットアクターネットワークインスタンスtarget_actorに
        # 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。
        # このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。
        print('next_states :', next_states)
        target_actions = self.target_actor.forward(next_states)

        # 20.ターゲットクリティックネットワークインスタンスtarget_criticに
        # 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して
        # 価値関数の推定値ターゲットバリューを出力する。
        # TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。
        # ターゲットクリティックバリューはターゲットアクターネットワークを使う
        target_critic_values = self.target_critic.forward(next_states, target_actions)

        
        # 23.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に
        # 現在の状態observationsと行動actionsを入力して
        # クリティックバリューを算出する
        critic_values = self.critic.forward(observations, actions)

        # 25.TDターゲットを算出する：r + γ*V(w)[s_t+1]
        td_targets = []
        for i in range(self.batch_size):
            td_target = rewards[i] + self.gamma * target_critic_values[i] * terminals[i]
            td_targets.append(td_target)
        
        # TDターゲットの形をバッチに整える
        td_targets = T.tensor(td_targets, dtype=T.float32)
        td_targets = td_targets.view(self.batch_size, 1)
        print('td_targets :', td_targets)
        print('td_targets :', td_targets)

 
# 2.エージェントクラスのインスタンスを生成する
agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10 # episodes
STEPS = 10    # steps
DELAY_TIME = 0.00 # sec


total_rewards = []
for eposode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)
    # tensor([ 0.0040,  0.0199, -0.0622,  0.0594, -0.0605,  0.0577, -0.0056,  0.0333,        -0.0072,  0.0532, -0.0512,  0.0173, -0.0529, -0.1104,  0.0946, -0.0559,         0.0824])
    print(type(obs))
    # observation_space :  Box(-inf, inf, (17,), float64)
    print('observation_space : ', env.observation_space)
    print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(STEPS):
        env.render()
        
        # ここをDDPGに置き換えていく
        action = agent.choose_action(obs) # 1.Agentクラスを定義していく
        #action :  [ 0.06660474 -0.11753064  0.02527559  0.06465236  0.1050786   0.05048539]
        print('====ここまではOK4====')
        print('action_space : ', env.action_space)
        print('action : ', action)

        next_state, reward, done, _, info = env.step(action)
        print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

        print('====ここまではOK5====')

        #7. トラジェクトを保存する。経験再生(ReplayBuffer)
        agent.remember(obs, action, reward, next_state, int(done))

        #12. ニューラルネットワークを学習する
        agent.learn()

        # 26.エピソード内での報酬を累積していく
        total_reward += reward
        
        # 27. next_stateをobsとして再出発する
        print('next_state:', next_state)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)
        # 28. チーターの動きを見たいのでスリープを入れる
        time.sleep(DELAY_TIME)

    print('total_reward : ', total_reward)
    total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・
print('script is done.')
# https://gymnasium.farama.org/

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

import gymnasium as gym

import time

import torch as T

import torch.nn as nn

import torch.nn.functional as F

import torch.optim as opitm

import numpy as np

# 10. ReplayBufferクラスを新規作成する

class ReplayBuffer:

def __init__(self, max_memory_size, n_obs_space, n_action_space):

self.max_memory_size = max_memory_size

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.memory_count = 0

self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.action_memory = np.zeros((self.max_memory_size, self.n_action_space))

self.reward_memory = np.zeros(self.max_memory_size)

self.next_state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.terminal_memory = np.zeros(self.max_memory_size)

#self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

# 11.トランジション保存のためstore_transitionメソドを作成する

def store_transition(self, obs, action, reward, next_state, done):

print('store_transition is working.')

index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック

print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())

self.state_memory[index] = obs.detach().numpy().flatten()

self.action_memory[index] = action.flatten()

self.reward_memory[index] = reward.flatten()

self.next_state_memory[index] = next_state.flatten()

self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように

print('state_memory :', self.state_memory)

print('action_memory :', self.action_memory)

print('reward_memory :', self.reward_memory)

print('next_state_memory :', self.next_state_memory)

print('memory.state_memory :', self.terminal_memory)

print('type of state_memory :', type(self.state_memory[0][0]))

print('type of action_memory :', type(self.action_memory[0][0]))

print('type of reward_memory :', type(self.reward_memory[0]))

print('type of next_state_memory :', type(self.next_state_memory[0][0]))

print('type of memory.state_memory :', type(self.terminal_memory[0]))

self.memory_count += 1

print('memory_count :', agent.memory.memory_count)

# 16 バッファメモリーからランダムに抽出する

def sample_buffer(self, batch_size):

# indexが最大メモリに到達していない場合を想定する。

max_index = min(self.max_memory_size, self.memory_count)

choosed_index = np.random.choice(max_index, batch_size)

observations = self.state_memory[choosed_index]

actions = self.action_memory[choosed_index]

rewards = self.reward_memory[choosed_index]

next_states = self.next_state_memory[choosed_index]

terminals = self.terminal_memory[choosed_index]

return observations, actions, rewards, next_states, terminals

# 6.ActorNNクラスを新規作成する

class ActorNN(nn.Module):

def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64):

print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(n_obs_space, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, n_action_space)

def forward(self, obs):

print('AgetDDPG.ActorNN.forward is working')

print('====ここまではOK1====')

x = self.fc1(obs)

x = F.relu(x)

x = self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

mu = F.tanh(x)

print('action μ:', mu)

print('====ここまではOK2====')

# 必要であればあとでノイズを入れる：action = mu + noize

action = mu

return action

# 22.CriticNNクラスを新規作成する

class CriticNN(nn.Module):

def __init__(self, beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64):

print('CriticNN.__init__ is working.')

super(CriticNN, self).__init__()

# クリティックNNは観察空間+行動空間の２つを入力とする構造

input_dim = n_obs_space + n_action_space

self.fc1 = nn.Linear(input_dim, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, 1) # 最後は1個で良い

def forward(self, obs, action):

input_data = T.cat([obs, action], dim=1)

x = self.fc1(input_data)

x = F.relu(x)

x =self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

return x #一つの状態価値を出力する。

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64):

print('AgentDDPG.__init__ is working.')

# 5.ActorNNクラスのインスタンスを生成する

self.alpha = alpha

self.beta = beta

self.gamma = gamma

self.tau = tau

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.n_state_action_value = n_state_action_value

self.layer1_size = layer1_size

self.layer2_size = layer2_size

# 13.バッチサイズを決めておく

self.batch_size = batch_size

self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 9.memoryインスタンスを追加

self.MAX_MEMORY_SIZE = 1000

self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,

n_obs_space=self.n_obs_space,

n_action_space=self.n_action_space)

# 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する

# actorとtarget_actorのネットワークは同じActorNNで良い

self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する

self.target_critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 24.クリティックネットワークインスタンスcriticを作成する。

self.critic = CriticNN(beta=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

def choose_action(self, obs):

print('AgentDDPG.choose_action is working.')

# 4.方策（アクター）はニューラルネットワークで表現する。

# ActorNNクラスを新規作成し、インスタンスactorとして使用する。

action = self.actor.forward(obs)

action = action.detach().numpy()

print('====ここまではOK3====')

return action

# 8.remenberメソドを追加

def remember(self, obs, action, reward, next_state, done):

self.memory.store_transition(obs, action, reward, next_state, done)

# 13.learnメソドを追加

def learn(self):

# 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。

if self.memory.memory_count < self.batch_size:

return

# 15.メモリバッファからデータを抜き出す sample_buffer()

# バッチ化されているので変数名を複数形にする

observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)

print('s:', observations)

print(observations.shape)

print('a :', actions)

print('r :', rewards)

print('s_ :', next_states)

print('terminal :', terminals)

# 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する

observations = T.tensor(observations, dtype=T.float32)

actions = T.tensor(actions, dtype=T.float32)

rewards = T.tensor(rewards, dtype=T.float32)

next_states = T.tensor(next_states, dtype=T.float32)

terminals = T.tensor(terminals, dtype=T.float32)

# 18.ターゲットアクターネットワークインスタンスtarget_actorに

# 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。

# このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。

print('next_states :', next_states)

target_actions = self.target_actor.forward(next_states)

# 20.ターゲットクリティックネットワークインスタンスtarget_criticに

# 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して

# 価値関数の推定値ターゲットバリューを出力する。

# TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。

# ターゲットクリティックバリューはターゲットアクターネットワークを使う

target_critic_values = self.target_critic.forward(next_states, target_actions)

# 23.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に

# 現在の状態observationsと行動actionsを入力して

# クリティックバリューを算出する

critic_values = self.critic.forward(observations, actions)

# 25.TDターゲットを算出する：r + γ*V(w)[s_t+1]

td_targets = []

for i in range(self.batch_size):

td_target = rewards[i] + self.gamma * target_critic_values[i] * terminals[i]

td_targets.append(td_target)

# TDターゲットの形をバッチに整える

td_targets = T.tensor(td_targets, dtype=T.float32)

td_targets = td_targets.view(self.batch_size, 1)

print('td_targets :', td_targets)

# 2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10 # episodes

STEPS = 10 # steps

DELAY_TIME = 0.00 # sec

total_rewards = []

for eposode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

# tensor([ 0.0040, 0.0199, -0.0622, 0.0594, -0.0605, 0.0577, -0.0056, 0.0333, -0.0072, 0.0532, -0.0512, 0.0173, -0.0529, -0.1104, 0.0946, -0.0559, 0.0824])

print(type(obs))

# observation_space : Box(-inf, inf, (17,), float64)

print('observation_space : ', env.observation_space)

print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(STEPS):

env.render()

# ここをDDPGに置き換えていく

action = agent.choose_action(obs) # 1.Agentクラスを定義していく

#action : [ 0.06660474 -0.11753064 0.02527559 0.06465236 0.1050786 0.05048539]

print('====ここまではOK4====')

print('action_space : ', env.action_space)

print('action : ', action)

next_state, reward, done, _, info = env.step(action)

print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

print('====ここまではOK5====')

#7. トラジェクトを保存する。経験再生(ReplayBuffer)

agent.remember(obs, action, reward, next_state, int(done))

#12. ニューラルネットワークを学習する

agent.learn()

# 26.エピソード内での報酬を累積していく

total_reward += reward

# 27. next_stateをobsとして再出発する

print('next_state:', next_state)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

# 28. チーターの動きを見たいのでスリープを入れる

time.sleep(DELAY_TIME)

print('total_reward : ', total_reward)

total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

ではまた次回

DDPG by gymnasium ５日目

次回は、actorNNへの引数修正、それに伴うagentDDPG引数修正、そしてリプレイバッファへの保存内容を修正しました。

今回は64データ集まった後に学習learn()メソドが走るとエラーが出るので修正していきます。

chatGPTより

このエラーメッセージは、F.linear関数を使用して乗算されている2つのテンソルのデータ型が一致しないことを示しています。この場合、inputテンソルとweightテンソルのデータ型が異なるようです。

この問題を解決するには、次の手順を試すことができます。

inputテンソルとweightテンソルのデータ型を確認します。それらのdtype属性を出力して、同じかどうかを確認できます。
データ型が異なる場合は、to()メソッドを使用してinputテンソルをweightテンソルと同じデータ型に変換します。たとえば、weightテンソルがfloat32テンソルの場合、input.to(torch.float32)を呼び出してinputテンソルをfloat32テンソルに変換できます。
代わりに、inputテンソルと同じデータ型にweightテンソルを変換することもできます。たとえば、inputテンソルがfloat64テンソルの場合、weight.to(torch.float64)を呼び出してweightテンソルをfloat64テンソルに変換できます。
テンソル上で行われる他の操作も同じデータ型を維持するようにしてください。

テンソルのデータ型が一貫していることを確認することで、遭遇したRuntimeErrorを解決できるはずです。

とのこと。なるほど、入力データをpytorchの型に合わせる必要があるようです。

現状確認

データを保存するときにstore_transitionメソドで

self.state_memory[index] = obs.detach().numpy().flatten()

self.action_memory[index] = action.flatten()

self.reward_memory[index] = reward.flatten()

self.next_state_memory[index] = next_state.flatten()

self.terminal_memory[index] = 1 – int(done)

としているので、type()で型を見てみます。

print(‘type of state_memory :’, type(self.state_memory[0][0]))

print(‘type of action_memory :’, type(self.action_memory[0][0]))

print(‘type of reward_memory :’, type(self.reward_memory[0]))

print(‘type of next_state_memory :’, type(self.next_state_memory[0][0]))

print(‘type of memory.state_memory :’, type(self.terminal_memory[0]))

結果、値は全てnumpy.float64になっています。

type of state_memory : <class ‘numpy.float64’>
type of action_memory : <class ‘numpy.float64’>
type of reward_memory : <class ‘numpy.float64’>
type of next_state_memory : <class ‘numpy.float64’>
type of memory.state_memory : <class ‘numpy.float64’>

取り出す際もsample_buffer(self, batch_size)メソドで

observations = self.state_memory[choosed_index]

として戻り値を得ているので変わりません。

戻り値はpytorchのテンソルに変換しています。troch.float64になっている。

observations = T.tensor(observations, dtype=float)

actions = T.tensor(actions, dtype=float)

rewards = T.tensor(rewards, dtype=float)

next_states = T.tensor(next_states, dtype=float)

terminals = T.tensor(terminals, dtype=float)

それを

target_actions = self.target_actor.forward(next_states)

に入れたときに起こっているのか？

class ActorNN(nn.Module):

__init__: self.fc1 = nn.Linear(n_obs_space, layer1_size)

forward : x = self.fc1(obs) ここでエラーが発生している

obsはバッチサイズ64x観察空間17、を入力ノード17x次層ノード64で待ち受けている。数としては問題ない。

型が合わないということなので、重みパラメータの型を調べてみる

agent.actor.fc1.weight.dtype → torch.float32

agent.target_actor.fc1.weight.dtype → torch.float32

なるほど、torch.float32で入力しなければならないようなので、変更します。

修正前：observations = T.tensor(observations, dtype=float)

修正後：observations = T.tensor(observations, dtype=T.float32)

これで回るようになりました。

ここまでのスクリプト

target_actorへnext_states 　バッチサイズ64データを入力し、target_actions 64データを得ることができました。

以降、actorが１ステップ行動するごとに、target_actorへnext_states 64データを入力してtarget_actions 64データを繰り返し出力する状態になりました。

次回は、このtarget_actionsをnext_statesと共にtarget_criticへ入力するところからやっていきます。

import gymnasium as gym
import time
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opitm
import numpy as np

# 10. ReplayBufferクラスを新規作成する
class ReplayBuffer:
    def __init__(self, max_memory_size, n_obs_space, n_action_space):
        self.max_memory_size = max_memory_size
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.memory_count = 0

        self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))
        self.action_memory =  np.zeros((self.max_memory_size, self.n_action_space))
        self.reward_memory =  np.zeros(self.max_memory_size)
        self.next_state_memory =  np.zeros((self.max_memory_size, self.n_obs_space))
        self.terminal_memory =  np.zeros(self.max_memory_size)
        #self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

    # 11.トランジション保存のためstore_transitionメソドを作成する
    def store_transition(self, obs, action, reward, next_state, done):
        print('store_transition is working.')
        index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック
        print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())
        self.state_memory[index] = obs.detach().numpy().flatten()
        self.action_memory[index] = action.flatten()
        self.reward_memory[index] = reward.flatten()
        self.next_state_memory[index] = next_state.flatten()
        self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように
        print('state_memory :', self.state_memory)
        print('action_memory :', self.action_memory)
        print('reward_memory :', self.reward_memory)
        print('next_state_memory :', self.next_state_memory)
        print('memory.state_memory :', self.terminal_memory)

        print('type of state_memory :', type(self.state_memory[0][0]))
        print('type of action_memory :', type(self.action_memory[0][0]))
        print('type of reward_memory :', type(self.reward_memory[0]))
        print('type of next_state_memory :', type(self.next_state_memory[0][0]))
        print('type of memory.state_memory :', type(self.terminal_memory[0]))

        self.memory_count += 1
        print('memory_count :', agent.memory.memory_count)

    # 16 バッファメモリーからランダムに抽出する
    def sample_buffer(self, batch_size):
        # indexが最大メモリに到達していない場合を想定する。
        max_index = min(self.max_memory_size, self.memory_count)
        choosed_index = np.random.choice(max_index, batch_size)
        
        observations = self.state_memory[choosed_index]
        actions = self.action_memory[choosed_index]
        rewards = self.reward_memory[choosed_index]
        next_states = self.next_state_memory[choosed_index]
        terminals = self.terminal_memory[choosed_index]

        return observations, actions, rewards, next_states, terminals


# 6.ActorNNクラスを新規作成する
class ActorNN(nn.Module):
    def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64):
        print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(n_obs_space, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, n_action_space)

    def forward(self, obs):
        print('AgetDDPG.ActorNN.forward is working')
        print('====ここまではOK1====')
        x = self.fc1(obs)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        mu = F.tanh(x)
        print('action μ:', mu)
        print('====ここまではOK2====')

        # 必要であればあとでノイズを入れる：action = mu + noize
        action = mu
        return action
        

# 22.CriticNNクラスを新規作成する
class CriticNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        print('CriticNN.__init__ is working.')
        super(CriticNN, self).__init__()

        # ActorNNの部分:ActorNnと同じ構造
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, output_dim)
        
        #　CriticNNの部分
        self.fc11 = nn.Linear(n_action)
    def forward(self, obs, action):
        # ActorNNの部分:ActorNnと同じ構造
        x = self.fc1(obs)
        x = F.relu(x)
        mu = self.fc2(x)

# 3.エージェントクラスを定義する
class AgentDDPG:

    def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64):
        print('AgentDDPG.__init__ is working.')
        # 5.ActorNNクラスのインスタンスを生成する
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        self.tau = tau
        
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.n_state_action_value = n_state_action_value

        self.layer1_size = layer1_size
        self.layer2_size = layer2_size

        # 13.バッチサイズを決めておく
        self.batch_size = batch_size 

        self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                            layer1_size=64, layer2_size=64, batch_size=64)        
        
        # 9.memoryインスタンスを追加
        self.MAX_MEMORY_SIZE = 1000
        self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,
                                   n_obs_space=self.n_obs_space,
                                   n_action_space=self.n_action_space)
        
        # 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する
        # actorとtarget_actorのネットワークは同じActorNNで良い
        self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)

        
        # 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する
        #self.target_critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

        #self.critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

    def choose_action(self, obs):
        print('AgentDDPG.choose_action is working.')
        # 4.方策（アクター）はニューラルネットワークで表現する。
        #   ActorNNクラスを新規作成し、インスタンスactorとして使用する。
        action = self.actor.forward(obs)
        action = action.detach().numpy()
        print('====ここまではOK3====')
        return action
    
    # 8.remenberメソドを追加
    def remember(self, obs, action, reward, next_state, done):
        self.memory.store_transition(obs, action, reward, next_state, done)

    # 13.learnメソドを追加
    def learn(self):
        # 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。
        if self.memory.memory_count < self.batch_size:
            return
        
        # 15.メモリバッファからデータを抜き出す sample_buffer()
        # バッチ化されているので変数名を複数形にする
        observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)
        print('s:', observations)
        print(observations.shape)
        print('a :', actions)
        print('r :', rewards)
        print('s_ :', next_states)
        print('terminal :', terminals)

        # 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する
        observations = T.tensor(observations, dtype=T.float32)
        actions = T.tensor(actions, dtype=T.float32)
        rewards = T.tensor(rewards, dtype=T.float32)
        next_states = T.tensor(next_states, dtype=T.float32)
        terminals = T.tensor(terminals, dtype=T.float32)
       
        # 18.ターゲットアクターネットワークインスタンスtarget_actorに
        # 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。
        # このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。
        print('next_states :', next_states)
        target_actions = self.target_actor.forward(next_states)
        """

        # 20.ターゲットクリティックネットワークインスタンスtareget_criticに
        # 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して
        # 価値関数の推定値ターゲットバリューを出力する。
        # TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。
        # ターゲットクリティックバリューはターゲットアクターネットワークを使う
        target_critic_values = self.target_critic.forward(next_states, target_actions)

        # 22.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に
        # 現在の状態observationsと行動actionsを入力して
        # クリティックバリューを算出する
        critic_values = self.critic.forward(observations, actions)

        # 333.TDターゲットを算出する：r + γ*V(w)[s_t+1]
        GAMMA = 0.01
        self.gamma = GAMMA

        td_targets = []
        for i in range(self.batch_size):
            td_target = rewards[i] + GAMMA * target_critic_values[i] * terminals[i]
            td_targets.append(td_target)

        # TDターゲットの形をバッチに整える
        td_targets = T.tensor(td_targets)
        td_target = td_target.view(self.bathc_size, 1)
        """

# 2.エージェントクラスのインスタンスを生成する
agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 2
DELAY_TIME = 0.00 # sec
total_rewards = []
for eposode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)
    # tensor([ 0.0040,  0.0199, -0.0622,  0.0594, -0.0605,  0.0577, -0.0056,  0.0333,        -0.0072,  0.0532, -0.0512,  0.0173, -0.0529, -0.1104,  0.0946, -0.0559,         0.0824])
    print(type(obs))
    # observation_space :  Box(-inf, inf, (17,), float64)
    print('observation_space : ', env.observation_space)
    print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(40):
        env.render()
        
        # ここをDDPGに置き換えていく
        action = agent.choose_action(obs) # 1.Agentクラスを定義していく
        #action :  [ 0.06660474 -0.11753064  0.02527559  0.06465236  0.1050786   0.05048539]
        print('====ここまではOK4====')
        print('action_space : ', env.action_space)
        print('action : ', action)

        next_state, reward, done, _, info = env.step(action)
        print('next_state, reward, done, _, info :', next_state, reward, done, _, info)
        """
        action :  [ 0.06660474 -0.11753064  0.02527559  0.06465236  0.1050786   0.05048539]

        next_state, reward, done, _, info : 
        [-0.00265179  0.0229547   0.00463243 -0.04729936 -0.00959038  0.04734605
        0.03672746  0.02857842  0.09980254 -0.32065693  0.04221647  1.58668951
        -2.31089174  1.30338924 -0.25465526  1.08250465 -0.14134398]
        0.07553858359316026
        False
        False
        {'x_position': -0.09233384215910741, 'x_velocity': 0.07920445513883267, 'reward_run': 0.07920445513883267, 'reward_ctrl': -0.0036658715456724168}
        """
        print('====ここまではOK5====')

        #7. トラジェクトを保存する。経験再生(ReplayBuffer)
        agent.remember(obs, action, reward, next_state, int(done))

        # 12. ニューラルネットワークを学習する
        agent.learn()

        print('next_state:', next_state)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)

        total_reward += reward
        
        time.sleep(DELAY_TIME)

    print('total_reward : ', total_reward)
    total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・
print('script is done.')
# https://gymnasium.farama.org/

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

import gymnasium as gym

import time

import torch as T

import torch.nn as nn

import torch.nn.functional as F

import torch.optim as opitm

import numpy as np

# 10. ReplayBufferクラスを新規作成する

class ReplayBuffer:

def __init__(self, max_memory_size, n_obs_space, n_action_space):

self.max_memory_size = max_memory_size

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.memory_count = 0

self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.action_memory = np.zeros((self.max_memory_size, self.n_action_space))

self.reward_memory = np.zeros(self.max_memory_size)

self.next_state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.terminal_memory = np.zeros(self.max_memory_size)

#self.terminal_memory = np.zeros(self.max_memory_size, dtype=np.bool)

# 11.トランジション保存のためstore_transitionメソドを作成する

def store_transition(self, obs, action, reward, next_state, done):

print('store_transition is working.')

index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック

print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())

self.state_memory[index] = obs.detach().numpy().flatten()

self.action_memory[index] = action.flatten()

self.reward_memory[index] = reward.flatten()

self.next_state_memory[index] = next_state.flatten()

self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように

print('state_memory :', self.state_memory)

print('action_memory :', self.action_memory)

print('reward_memory :', self.reward_memory)

print('next_state_memory :', self.next_state_memory)

print('memory.state_memory :', self.terminal_memory)

print('type of state_memory :', type(self.state_memory[0][0]))

print('type of action_memory :', type(self.action_memory[0][0]))

print('type of reward_memory :', type(self.reward_memory[0]))

print('type of next_state_memory :', type(self.next_state_memory[0][0]))

print('type of memory.state_memory :', type(self.terminal_memory[0]))

self.memory_count += 1

print('memory_count :', agent.memory.memory_count)

# 16 バッファメモリーからランダムに抽出する

def sample_buffer(self, batch_size):

# indexが最大メモリに到達していない場合を想定する。

max_index = min(self.max_memory_size, self.memory_count)

choosed_index = np.random.choice(max_index, batch_size)

observations = self.state_memory[choosed_index]

actions = self.action_memory[choosed_index]

rewards = self.reward_memory[choosed_index]

next_states = self.next_state_memory[choosed_index]

terminals = self.terminal_memory[choosed_index]

return observations, actions, rewards, next_states, terminals

# 6.ActorNNクラスを新規作成する

class ActorNN(nn.Module):

def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64):

print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(n_obs_space, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, n_action_space)

def forward(self, obs):

print('AgetDDPG.ActorNN.forward is working')

print('====ここまではOK1====')

x = self.fc1(obs)

x = F.relu(x)

x = self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

mu = F.tanh(x)

print('action μ:', mu)

print('====ここまではOK2====')

# 必要であればあとでノイズを入れる：action = mu + noize

action = mu

return action

# 22.CriticNNクラスを新規作成する

class CriticNN(nn.Module):

def __init__(self, input_dim, output_dim):

print('CriticNN.__init__ is working.')

super(CriticNN, self).__init__()

# ActorNNの部分:ActorNnと同じ構造

self.fc1 = nn.Linear(input_dim, 64)

self.fc2 = nn.Linear(64, output_dim)

#　CriticNNの部分

self.fc11 = nn.Linear(n_action)

def forward(self, obs, action):

# ActorNNの部分:ActorNnと同じ構造

x = self.fc1(obs)

x = F.relu(x)

mu = self.fc2(x)

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64):

print('AgentDDPG.__init__ is working.')

# 5.ActorNNクラスのインスタンスを生成する

self.alpha = alpha

self.beta = beta

self.gamma = gamma

self.tau = tau

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.n_state_action_value = n_state_action_value

self.layer1_size = layer1_size

self.layer2_size = layer2_size

# 13.バッチサイズを決めておく

self.batch_size = batch_size

self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 9.memoryインスタンスを追加

self.MAX_MEMORY_SIZE = 1000

self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,

n_obs_space=self.n_obs_space,

n_action_space=self.n_action_space)

# 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する

# actorとtarget_actorのネットワークは同じActorNNで良い

self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する

#self.target_critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

#self.critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

def choose_action(self, obs):

print('AgentDDPG.choose_action is working.')

# 4.方策（アクター）はニューラルネットワークで表現する。

# ActorNNクラスを新規作成し、インスタンスactorとして使用する。

action = self.actor.forward(obs)

action = action.detach().numpy()

print('====ここまではOK3====')

return action

# 8.remenberメソドを追加

def remember(self, obs, action, reward, next_state, done):

self.memory.store_transition(obs, action, reward, next_state, done)

# 13.learnメソドを追加

def learn(self):

# 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。

if self.memory.memory_count < self.batch_size:

return

# 15.メモリバッファからデータを抜き出す sample_buffer()

# バッチ化されているので変数名を複数形にする

observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)

print('s:', observations)

print(observations.shape)

print('a :', actions)

print('r :', rewards)

print('s_ :', next_states)

print('terminal :', terminals)

# 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する

observations = T.tensor(observations, dtype=T.float32)

actions = T.tensor(actions, dtype=T.float32)

rewards = T.tensor(rewards, dtype=T.float32)

next_states = T.tensor(next_states, dtype=T.float32)

terminals = T.tensor(terminals, dtype=T.float32)

# 18.ターゲットアクターネットワークインスタンスtarget_actorに

# 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。

# このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。

print('next_states :', next_states)

target_actions = self.target_actor.forward(next_states)

"""

# 20.ターゲットクリティックネットワークインスタンスtareget_criticに

# 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して

# 価値関数の推定値ターゲットバリューを出力する。

# TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。

# ターゲットクリティックバリューはターゲットアクターネットワークを使う

target_critic_values = self.target_critic.forward(next_states, target_actions)

# 22.ベースラインとして機能するクリティックネットワーク（価値関数V(w)[s_t]ネットワーク）に

# 現在の状態observationsと行動actionsを入力して

# クリティックバリューを算出する

critic_values = self.critic.forward(observations, actions)

# 333.TDターゲットを算出する：r + γ*V(w)[s_t+1]

GAMMA = 0.01

self.gamma = GAMMA

td_targets = []

for i in range(self.batch_size):

td_target = rewards[i] + GAMMA * target_critic_values[i] * terminals[i]

td_targets.append(td_target)

# TDターゲットの形をバッチに整える

td_targets = T.tensor(td_targets)

td_target = td_target.view(self.bathc_size, 1)

"""

# 2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 2

DELAY_TIME = 0.00 # sec

total_rewards = []

for eposode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

# tensor([ 0.0040, 0.0199, -0.0622, 0.0594, -0.0605, 0.0577, -0.0056, 0.0333, -0.0072, 0.0532, -0.0512, 0.0173, -0.0529, -0.1104, 0.0946, -0.0559, 0.0824])

print(type(obs))

# observation_space : Box(-inf, inf, (17,), float64)

print('observation_space : ', env.observation_space)

print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(40):

env.render()

# ここをDDPGに置き換えていく

action = agent.choose_action(obs) # 1.Agentクラスを定義していく

#action : [ 0.06660474 -0.11753064 0.02527559 0.06465236 0.1050786 0.05048539]

print('====ここまではOK4====')

print('action_space : ', env.action_space)

print('action : ', action)

next_state, reward, done, _, info = env.step(action)

print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

"""

action : [ 0.06660474 -0.11753064 0.02527559 0.06465236 0.1050786 0.05048539]

next_state, reward, done, _, info :

[-0.00265179 0.0229547 0.00463243 -0.04729936 -0.00959038 0.04734605

0.03672746 0.02857842 0.09980254 -0.32065693 0.04221647 1.58668951

-2.31089174 1.30338924 -0.25465526 1.08250465 -0.14134398]

0.07553858359316026

False

{'x_position': -0.09233384215910741, 'x_velocity': 0.07920445513883267, 'reward_run': 0.07920445513883267, 'reward_ctrl': -0.0036658715456724168}

"""

print('====ここまではOK5====')

#7. トラジェクトを保存する。経験再生(ReplayBuffer)

agent.remember(obs, action, reward, next_state, int(done))

# 12. ニューラルネットワークを学習する

agent.learn()

print('next_state:', next_state)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

total_reward += reward

time.sleep(DELAY_TIME)

print('total_reward : ', total_reward)

total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

DDPG by gymnasium ４日目

前回はリプレイバッファを作りました。

今回はいよいよ、ニューラルネットワークの核心、学習部分を作っていきます。

12.メインスクリプトのagent.remember()直下にagent.learn()を作ります。

13.AgentDDPGクラス内にlearn()メソドを新規作成します。

14.バッチサイズ分のトランジションが集まるまでは何も実行しない。

if self.memory.memory_count< self.batch_size:

return

15.メモリバッファからデータを抜き出す sample_buffer()

obs, action, reward, new_state, done = self.memory.sample_buffer(self.batch_size)

16.ReplayBufferのメソドとしてsample_bufferメソドを追加する。

def sample_buffer(self, batch_size):

# indexが最大メモリに到達していない場合を想定する。

max_index = min(self.max_memory_size, self.memory_count)

choosed_index = np.random.choice(max.index, batch_size)

observations = self.state_memory[choosed_index]

actions = self.action_memory[choosed_index]

rewards = self.reward_memory[choosed_index]

next_states = self.next_state_memory[choosed_index]

terminals = self.terminal[choosed_index]

return observations, actions, rewards, next_states, terminals

17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する。torch.tensor( obs, dtype=float)

obs = T.tensor(obs, dtype=float)

action = T.tensor(action, dtype=float)

reward = T.tensor(reward, dtype=float)

new_state = T.tensor(new_state, dtype=float)

done = T.tensor(done, dtype=float)

18.ターゲットアクターネットワークインスタンスtarget_actorに

次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。

target_actions = self.target_actor.forward(next_states)

19.AgentDDPGクラスにターゲットアクターネットワークインスタンスtarget_actorを作成する。actorとtarget_actorのネットワークは同じActorNN構造で良い

self.target_actor = ActorNN(input_dim=self.input_dim, output_dim=self.output_dim)

20.ターゲットクリティックネットワークインスタンスtareget_criticに

# 次の状態next_statesと上記より算出したターゲットアクションの２つを入力して

# 価値関数の推定値ターゲットバリューを出力する。

# TDターゲット：r + γ*V(w)[s_t+1] の部分のこと。

# ターゲットクリティックバリューはターゲットアクターネットワークを使う

target_critic_values = self.target_critic.forward(next_states, target_actions)

21.AgentDDPGクラスにターゲットクリティックネットワークインスタンスtareget_criticを作成する

self.target_critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

問題発生

2.エージェントクラスのインスタンスを生成するところで

agent = AgentDDPG(input_dim=17, output_dim=6)

としていますが、これだけの引数ではDDPGを表現できないことに気が付きました。

ActorNNは入力obsと出力actionだけなので17と6だけの情報で良かったのですが、CriticNNは入力がactionと obsの２つ、また出力が状態価値state_valueの１つあるので、ニューラルネットワークに必要な入出力の数が異なります。

よって、入力の形、学習率、ニューラルネットワークの各層とノード数など、アクターとクリティックで異なるであろう部分はエージェントクラスから含めるように修正していきます。

以下のようにエージェントクラスの引数をたくさん増やしました。

agent = AgentDDPG( alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001, n_obs_space=17 , n_action_space=6, layer1_size=64, layer2_size=64, batch_size=64)

またアクターニューラルネットワーククラスへ受け渡す引数を修正しました。

self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6, layer1_size=64, layer2_size=64, batch_size=64)

さらなる問題

“””エラーメッセージ

このエラーメッセージは、行列の乗算に問題があることを示しています。具体的には、1×64の行列と17×64の行列を乗算しようとしていますが、この操作は許容されません。なぜなら、最初の行列の列数(64)が2番目の行列の行数(17)と異なるためです。

この問題を解決するには、乗算しようとしている行列の次元を確認し、行列乗算に対して互換性のある次元になるように調整する必要があります。あるいは、行列の次元に合わせて、適切な演算や変換を使用することも検討してみてください。

“””

バッファメモリーへ保存した観測情報obsを

observations = self.state_memory[choosed_index]

によって64個のバッチサイズで抜き出して、ニューラルネットワークへ入力した時点でエラーが発生しました。

流れを今一度おさらいします。

ひとつの観測情報obsをActorNNへ入力することによって、１つ行動actionが生成されます。

このobsの形は1×17なので、バッチ64個分をバッファから抜き出すと 64×17のはずです。しかし、エラーでは1×64となっているので根本的に間違っています。

患部リプレイバッファーの初期化部分でした。

self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

これで、観測データ1000000 x 観測空間17　を確保するつもりが、

self.state_memory = np.zeros(self.max_memory_size)

となっており、観測空間17のメモリーしか確保されていませんでした。

さらに悪いことに保存データが

self.state_memory[index] = obs.detach().numpy().flatten()[0]

となっており、観測空間17個のうち先頭の１個しか保存されないという間違いがありました。

self.state_memory[index] = obs.detach().numpy().flatten()に修正しました。

ほかにも

self.new_state_memory[index]がありますので同様に修正が必要です。

ここまでの修正スクリプト

リプレイバッファが６４データ蓄積されるまでは動きます。

学習learn()メソドが始まるとエラーが出る状態です。

次回はここを解消していきます。

import gymnasium as gymimport time
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opitm
import numpy as np

# 10. ReplayBufferクラスを新規作成する
class ReplayBuffer:
    def __init__(self, max_memory_size, n_obs_space, n_action_space):
        self.max_memory_size = max_memory_size
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.memory_count = 0

        self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))
        self.action_memory =  np.zeros((self.max_memory_size, self.n_action_space))
        self.reward_memory =  np.zeros(self.max_memory_size)
        self.next_state_memory =  np.zeros((self.max_memory_size, self.n_obs_space))
        self.terminal_memory =  np.zeros(self.max_memory_size)     

    # 11.トランジション保存のためstore_transitionメソドを作成する
    def store_transition(self, obs, action, reward, next_state, done):
        print('store_transition is working.')
        index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック
        print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())
        self.state_memory[index] = obs.detach().numpy().flatten()
        self.action_memory[index] = action.flatten()
        self.reward_memory[index] = reward.flatten()
        self.next_state_memory[index] = next_state.flatten()
        self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように
        print('state_memory :', self.state_memory)
        print('action_memory :', self.action_memory)
        print('reward_memory :', self.reward_memory)
        print('next_state_memory :', self.next_state_memory)
        print('memory.state_memory :', self.terminal_memory)

        self.memory_count += 1
        print('memory_count :', agent.memory.memory_count)

    # 16 バッファメモリーからランダムに抽出する
    def sample_buffer(self, batch_size):
        # indexが最大メモリに到達していない場合を想定する。
        max_index = min(self.max_memory_size, self.memory_count)
        choosed_index = np.random.choice(max_index, batch_size)
        
        observations = self.state_memory[choosed_index]
        actions = self.action_memory[choosed_index]
        rewards = self.reward_memory[choosed_index]
        next_states = self.next_state_memory[choosed_index]
        terminals = self.terminal_memory[choosed_index]

        return observations, actions, rewards, next_states, terminals


# 6.ActorNNクラスを新規作成する
class ActorNN(nn.Module):
    def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64):
        print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(n_obs_space, layer1_size)
        self.fc2 = nn.Linear(layer1_size, layer2_size)
        self.fc3 = nn.Linear(layer2_size, n_action_space)

    def forward(self, obs):
        print('AgetDDPG.ActorNN.forward is working')
        print('====ここまではOK1====')
        x = self.fc1(obs)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.fc3(x)
        mu = F.tanh(x)
        print('action μ:', mu)
        print('====ここまではOK2====')

        # 必要であればあとでノイズを入れる：action = mu + noize
        action = mu
        return action

# 3.エージェントクラスを定義する
class AgentDDPG:

    def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64):
        print('AgentDDPG.__init__ is working.')
        # 5.ActorNNクラスのインスタンスを生成する
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        self.tau = tau
        
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.n_state_action_value = n_state_action_value

        self.layer1_size = layer1_size
        self.layer2_size = layer2_size

        # 13.バッチサイズを決めておく
        self.batch_size = batch_size 

        self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                            layer1_size=64, layer2_size=64, batch_size=64)        
        
        # 9.memoryインスタンスを追加
        self.MAX_MEMORY_SIZE = 1000
        self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,
                                   n_obs_space=self.n_obs_space,
                                   n_action_space=self.n_action_space)
        
        # 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する
        # actorとtarget_actorのネットワークは同じActorNNで良い
        self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,
                                    layer1_size=64, layer2_size=64, batch_size=64)

        
        # 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する
        #self.target_critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

        #self.critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

    def choose_action(self, obs):
        print('AgentDDPG.choose_action is working.')
        # 4.方策（アクター）はニューラルネットワークで表現する。
        #   ActorNNクラスを新規作成し、インスタンスactorとして使用する。
        action = self.actor.forward(obs)
        action = action.detach().numpy()
        print('====ここまではOK3====')
        return action
    
    # 8.remenberメソドを追加
    def remember(self, obs, action, reward, next_state, done):
        self.memory.store_transition(obs, action, reward, next_state, done)

    # 13.learnメソドを追加
    def learn(self):
        # 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。
        if self.memory.memory_count< self.batch_size:
            return
        
        # 15.メモリバッファからデータを抜き出す sample_buffer()
        # バッチ化されているので変数名を複数形にする
        observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)
        print('s:', observations)
        print(observations.shape)
        print('a :', actions)
        print('r :', rewards)
        print('s_ :', next_states)
        print('terminal :', terminals)

        # 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する
        observations = T.tensor(observations, dtype=float)
        actions = T.tensor(actions, dtype=float)
        rewards = T.tensor(rewards, dtype=float)
        next_states = T.tensor(next_states, dtype=float)
        terminals = T.tensor(terminals, dtype=float)
       
        # 18.ターゲットアクターネットワークインスタンスtarget_actorに
        # 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。
        # このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。
        print('next_states :', next_states)
        target_actions = self.target_actor.forward(next_states)


# 2.エージェントクラスのインスタンスを生成する
agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,
                  n_obs_space=17 , n_action_space=6, n_state_action_value=1,
                  layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 2
DELAY_TIME = 0.00 # sec
total_rewards = []
for eposode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)

    # observation_space :  Box(-inf, inf, (17,), float64)
    print('observation_space : ', env.observation_space)
    print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(40):
        env.render()

        action = agent.choose_action(obs) # 1.Agentクラスを定義していく

        print('====ここまではOK4====')
        print('action_space : ', env.action_space)
        print('action : ', action)

        next_state, reward, done, _, info = env.step(action)
        print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

        print('====ここまではOK5====')

        #7. トラジェクトを保存する。経験再生(ReplayBuffer)
        agent.remember(obs, action, reward, next_state, int(done))

        # 12. ニューラルネットワークを学習する
        agent.learn()

        print('next_state:', next_state)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)

        total_reward += reward
        
        time.sleep(DELAY_TIME)

    print('total_reward : ', total_reward)
    total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・
print('script is done.')
# https://gymnasium.farama.org/

print(len(agent.memory.reward_memory))
print(agent.memory.reward_memory)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

import gymnasium as gymimport time

import torch as T

import torch.nn as nn

import torch.nn.functional as F

import torch.optim as opitm

import numpy as np

# 10. ReplayBufferクラスを新規作成する

class ReplayBuffer:

def __init__(self, max_memory_size, n_obs_space, n_action_space):

self.max_memory_size = max_memory_size

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.memory_count = 0

self.state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.action_memory = np.zeros((self.max_memory_size, self.n_action_space))

self.reward_memory = np.zeros(self.max_memory_size)

self.next_state_memory = np.zeros((self.max_memory_size, self.n_obs_space))

self.terminal_memory = np.zeros(self.max_memory_size)

# 11.トランジション保存のためstore_transitionメソドを作成する

def store_transition(self, obs, action, reward, next_state, done):

print('store_transition is working.')

index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック

print('obs.detach().numpy().flatten():',obs.detach().numpy().flatten())

self.state_memory[index] = obs.detach().numpy().flatten()

self.action_memory[index] = action.flatten()

self.reward_memory[index] = reward.flatten()

self.next_state_memory[index] = next_state.flatten()

self.terminal_memory[index] = 1 - int(done) # ゴールならterminal = 0 となるように

print('state_memory :', self.state_memory)

print('action_memory :', self.action_memory)

print('reward_memory :', self.reward_memory)

print('next_state_memory :', self.next_state_memory)

print('memory.state_memory :', self.terminal_memory)

self.memory_count += 1

print('memory_count :', agent.memory.memory_count)

# 16 バッファメモリーからランダムに抽出する

def sample_buffer(self, batch_size):

# indexが最大メモリに到達していない場合を想定する。

max_index = min(self.max_memory_size, self.memory_count)

choosed_index = np.random.choice(max_index, batch_size)

observations = self.state_memory[choosed_index]

actions = self.action_memory[choosed_index]

rewards = self.reward_memory[choosed_index]

next_states = self.next_state_memory[choosed_index]

terminals = self.terminal_memory[choosed_index]

return observations, actions, rewards, next_states, terminals

# 6.ActorNNクラスを新規作成する

class ActorNN(nn.Module):

def __init__(self, alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64):

print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(n_obs_space, layer1_size)

self.fc2 = nn.Linear(layer1_size, layer2_size)

self.fc3 = nn.Linear(layer2_size, n_action_space)

def forward(self, obs):

print('AgetDDPG.ActorNN.forward is working')

print('====ここまではOK1====')

x = self.fc1(obs)

x = F.relu(x)

x = self.fc2(x)

x = F.relu(x)

x = self.fc3(x)

mu = F.tanh(x)

print('action μ:', mu)

print('====ここまではOK2====')

# 必要であればあとでノイズを入れる：action = mu + noize

action = mu

return action

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64):

print('AgentDDPG.__init__ is working.')

# 5.ActorNNクラスのインスタンスを生成する

self.alpha = alpha

self.beta = beta

self.gamma = gamma

self.tau = tau

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.n_state_action_value = n_state_action_value

self.layer1_size = layer1_size

self.layer2_size = layer2_size

# 13.バッチサイズを決めておく

self.batch_size = batch_size

self.actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 9.memoryインスタンスを追加

self.MAX_MEMORY_SIZE = 1000

self.memory = ReplayBuffer(max_memory_size=self.MAX_MEMORY_SIZE,

n_obs_space=self.n_obs_space,

n_action_space=self.n_action_space)

# 19.ターゲットアクターネットワークインスタンスtarget_actorを作成する

# actorとtarget_actorのネットワークは同じActorNNで良い

self.target_actor = ActorNN(alpha=0.000025, n_obs_space=17, n_action_space=6,

layer1_size=64, layer2_size=64, batch_size=64)

# 21.ターゲットクリティックネットワークインスタンスtareget_criticを作成する

#self.target_critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

#self.critic = CriticNN(input_dim=self.input_dim, output_dim=self.output_dim)

def choose_action(self, obs):

print('AgentDDPG.choose_action is working.')

# 4.方策（アクター）はニューラルネットワークで表現する。

# ActorNNクラスを新規作成し、インスタンスactorとして使用する。

action = self.actor.forward(obs)

action = action.detach().numpy()

print('====ここまではOK3====')

return action

# 8.remenberメソドを追加

def remember(self, obs, action, reward, next_state, done):

self.memory.store_transition(obs, action, reward, next_state, done)

# 13.learnメソドを追加

def learn(self):

# 14.バッチサイズ分のトランジションが集まるまでは何も実行しない。

if self.memory.memory_count< self.batch_size:

return

# 15.メモリバッファからデータを抜き出す sample_buffer()

# バッチ化されているので変数名を複数形にする

observations, actions, rewards, next_states, terminals = self.memory.sample_buffer(self.batch_size)

print('s:', observations)

print(observations.shape)

print('a :', actions)

print('r :', rewards)

print('s_ :', next_states)

print('terminal :', terminals)

# 17.抜き出したデータをpytorchで微分可能なようにtorch.tensor化する

observations = T.tensor(observations, dtype=float)

actions = T.tensor(actions, dtype=float)

rewards = T.tensor(rewards, dtype=float)

next_states = T.tensor(next_states, dtype=float)

terminals = T.tensor(terminals, dtype=float)

# 18.ターゲットアクターネットワークインスタンスtarget_actorに

# 次の状態next_satesを入れて、ターゲットアクションtarget_actionsとして取り出す。

# このターゲットネットワークはターゲットでないネットワークとNNパラメータを共有させる。

print('next_states :', next_states)

target_actions = self.target_actor.forward(next_states)

# 2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG(alpha=0.000025, beta=0.00025, gamma=0.99, tau=0.001,

n_obs_space=17 , n_action_space=6, n_state_action_value=1,

layer1_size=64, layer2_size=64, batch_size=64)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 2

DELAY_TIME = 0.00 # sec

total_rewards = []

for eposode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

# observation_space : Box(-inf, inf, (17,), float64)

print('observation_space : ', env.observation_space)

print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(40):

env.render()

action = agent.choose_action(obs) # 1.Agentクラスを定義していく

print('====ここまではOK4====')

print('action_space : ', env.action_space)

print('action : ', action)

next_state, reward, done, _, info = env.step(action)

print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

print('====ここまではOK5====')

#7. トラジェクトを保存する。経験再生(ReplayBuffer)

agent.remember(obs, action, reward, next_state, int(done))

# 12. ニューラルネットワークを学習する

agent.learn()

print('next_state:', next_state)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

total_reward += reward

time.sleep(DELAY_TIME)

print('total_reward : ', total_reward)

total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

print(len(agent.memory.reward_memory))

print(agent.memory.reward_memory)

pytorchで謎の部分は下記のサイトを参考にさせていただきました。

https://qiita.com/tatsuya11bbs/items/86141fe3ca35bdae7338

DDPG by gymnasium ３日目

前回はActorのニューラルネットワークを作って、観測情報obsを入力することによってactionを得る、とうところまでできました。

今回はその結果であるtransition： obs, action, reward, next_state, doneを保存するところを作ります。

これは経験再生:ReplayBufferという方法で、方策πに従って行動した結果をいったんメモリーバッファーとして保存し、ニューラルネットワークのパラメータ学習のときに、そのメモリーバッファーからランダムに取り出して入力データとして使用するために使います、これをやらないで行動の結果の順番通りに入力データとして入れてしまうと、似たようなデータばかり入れて学習することになるのでパラメータが最適化されていきません。

一旦バッファーにいれて、あとで改めてバッチ学習させます。

agent.remember(s,a,r,s’,done)メソドを作成する

#7. トラジェクトを保存する。経験再生(ReplayBuffer)

agent.remember(obs, action, reward, next_state, int(done))

#8.AgentDDPGクラスにremenberメソドを追加する

def remember(self, obs, action, reward, next_state, done):

self.memory.store_transition(self, obs, action, reward, next_state, done)

# 9.AgentDDPG.__init__()にmemoryインスタンスを追加する。

# 10.memoryインスタンスの元クラスReplayBufferクラスを作成する。

# 11.ReplayBufferクラスのメソドとしてトランジションを保存する実態であるstore_transitionメソドを作成する

ここまでのスクリプト

import gymnasium as gym
import time
import torch as T
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opitm
import numpy as np

# 10. ReplayBufferクラスを新規作成する
class ReplayBuffer:
    def __init__(self, max_memory_size, n_obs_space, n_action_space):
        self.max_memory_size = max_memory_size
        self.n_obs_space = n_obs_space
        self.n_action_space = n_action_space

        self.memory_count = 0

        self.state_memory = np.zeros(self.max_memory_size)
        self.action_memory =  np.zeros(self.max_memory_size)
        self.reward_memory =  np.zeros(self.max_memory_size)
        self.next_state_memory =  np.zeros(self.max_memory_size)
        self.terminal =  np.zeros(self.max_memory_size)     

    # 11.トランジション保存のためstore_transitionメソドを作成する
    def store_transition(self, obs, action, reward, next_state, done):
        index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック
        self.state_memory[index] = obs.detach().numpy().flatten()[0]
        self.action_memory[index] = action.flatten()[0]
        self.reward_memory[index] = reward.flatten()[0]
        self.next_state_memory[index] = next_state.flatten()[0]
        self.terminal[index] = 1 - int(done) # ゴールならterminal = 0 となるように

        self.memory_count += 1


# 6.ActorNNクラスを新規作成する
class ActorNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, output_dim)

    def forward(self, obs):
        print('AgetDDPG.ActorNN.forward is working')
        print('====ここまではOK1====')
        x = self.fc1(obs)
        x = F.relu(x)
        mu = self.fc2(x)
        print('action μ:', mu)
        print('====ここまではOK2====')
        action = mu
        return action

# 3.エージェントクラスを定義する
class AgentDDPG:
    def __init__(self, input_dim, output_dim):
        print('AgentDDPG.__init__ is working.')
        # 5.ActorNNクラスのインスタンスを生成する
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.actor = ActorNN(input_dim=self.input_dim, output_dim=self.output_dim)
        
        # 9.memoryインスタンスを追加
        MAX_MEMORY_SIZE = 1000
        self.memory = ReplayBuffer(max_memory_size=MAX_MEMORY_SIZE,
                                   n_obs_space=self.input_dim,
                                   n_action_space=self.output_dim)

    def choose_action(self, obs):
        print('AgentDDPG.choose_action is working.')
        # 4.方策（アクター）はニューラルネットワークで表現する。
        #   ActorNNクラスを新規作成し、インスタンスactorとして使用する。
        action = self.actor.forward(obs)
        action = action.detach().numpy()
        print('====ここまではOK3====')
        return action
    
    # 8.remenberメソドを追加
    def remember(self, obs, action, reward, next_state, done):
        self.memory.store_transition(obs, action, reward, next_state, done)

# 2.エージェントクラスのインスタンスを生成する
agent = AgentDDPG(input_dim=17, output_dim=6)    

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10
DELAY_TIME = 0.00 # sec
total_rewards = []
for eposode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)
    print(type(obs))
    print('observation_space : ', env.observation_space)
    print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(100):
        env.render()
        
        # ここをDDPGに置き換えていく
        action = agent.choose_action(obs) # 1.Agentクラスを定義していく
        print('====ここまではOK4====')
        print('action_space : ', env.action_space)
        print('action : ', action)

        next_state, reward, done, _, info = env.step(action)
        print('next_state, reward, done, _, info :', next_state, reward, done, _, info)
        print('====ここまではOK5====')

        #7. トラジェクトを保存する。経験再生(ReplayBuffer)
        agent.remember(obs, action, reward, next_state, int(done))
        
        print('next_state:', next_state)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)

        total_reward += reward
        
        time.sleep(DELAY_TIME)

    print('total_reward : ', total_reward)
    total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・
print('script is done.')
# https://gymnasium.farama.org/

print(len(agent.memory.reward_memory))
print(agent.memory.reward_memory)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

import gymnasium as gym

import time

import torch as T

import torch.nn as nn

import torch.nn.functional as F

import torch.optim as opitm

import numpy as np

# 10. ReplayBufferクラスを新規作成する

class ReplayBuffer:

def __init__(self, max_memory_size, n_obs_space, n_action_space):

self.max_memory_size = max_memory_size

self.n_obs_space = n_obs_space

self.n_action_space = n_action_space

self.memory_count = 0

self.state_memory = np.zeros(self.max_memory_size)

self.action_memory = np.zeros(self.max_memory_size)

self.reward_memory = np.zeros(self.max_memory_size)

self.next_state_memory = np.zeros(self.max_memory_size)

self.terminal = np.zeros(self.max_memory_size)

# 11.トランジション保存のためstore_transitionメソドを作成する

def store_transition(self, obs, action, reward, next_state, done):

index = self.memory_count % self.max_memory_size # 最大メモリー数に到達したら、古いデータから上書きされていくギミック

self.state_memory[index] = obs.detach().numpy().flatten()[0]

self.action_memory[index] = action.flatten()[0]

self.reward_memory[index] = reward.flatten()[0]

self.next_state_memory[index] = next_state.flatten()[0]

self.terminal[index] = 1 - int(done) # ゴールならterminal = 0 となるように

self.memory_count += 1

# 6.ActorNNクラスを新規作成する

class ActorNN(nn.Module):

def __init__(self, input_dim, output_dim):

print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(input_dim, 64)

self.fc2 = nn.Linear(64, output_dim)

def forward(self, obs):

print('AgetDDPG.ActorNN.forward is working')

print('====ここまではOK1====')

x = self.fc1(obs)

x = F.relu(x)

mu = self.fc2(x)

print('action μ:', mu)

print('====ここまではOK2====')

action = mu

return action

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, input_dim, output_dim):

print('AgentDDPG.__init__ is working.')

# 5.ActorNNクラスのインスタンスを生成する

self.input_dim = input_dim

self.output_dim = output_dim

self.actor = ActorNN(input_dim=self.input_dim, output_dim=self.output_dim)

# 9.memoryインスタンスを追加

MAX_MEMORY_SIZE = 1000

self.memory = ReplayBuffer(max_memory_size=MAX_MEMORY_SIZE,

n_obs_space=self.input_dim,

n_action_space=self.output_dim)

def choose_action(self, obs):

print('AgentDDPG.choose_action is working.')

# 4.方策（アクター）はニューラルネットワークで表現する。

# ActorNNクラスを新規作成し、インスタンスactorとして使用する。

action = self.actor.forward(obs)

action = action.detach().numpy()

print('====ここまではOK3====')

return action

# 8.remenberメソドを追加

def remember(self, obs, action, reward, next_state, done):

self.memory.store_transition(obs, action, reward, next_state, done)

# 2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG(input_dim=17, output_dim=6)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10

DELAY_TIME = 0.00 # sec

total_rewards = []

for eposode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

print(type(obs))

print('observation_space : ', env.observation_space)

print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(100):

env.render()

# ここをDDPGに置き換えていく

action = agent.choose_action(obs) # 1.Agentクラスを定義していく

print('====ここまではOK4====')

print('action_space : ', env.action_space)

print('action : ', action)

next_state, reward, done, _, info = env.step(action)

print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

print('====ここまではOK5====')

#7. トラジェクトを保存する。経験再生(ReplayBuffer)

agent.remember(obs, action, reward, next_state, int(done))

print('next_state:', next_state)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

total_reward += reward

time.sleep(DELAY_TIME)

print('total_reward : ', total_reward)

total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

print(len(agent.memory.reward_memory))

print(agent.memory.reward_memory)

DDPG by gymnasium２日目

前回

前回はハーフチーター環境をランダムな行動で動かすところまでいきました。

actionの決定は、行動空間からのランダムサンプリング

action = env.action_space.sample()

になっています。いわゆる「方策：ポリシー」と呼ばれるものです。

エージェント（行動する者）はポリシー（方策・方針）を定めることによって、その状況（環境、観測情報）に応じて行動を選択します。その決定は確率的であったり一意的あるいは決定論的であったりします。

このポリシーを何かしらのアルゴリズムで調整・改善することで最大収益が得られるようにしていくのがDDPGなどの強化学習手法の目的です。

「収益が最大になるようにポリシーを改善していく」のほうが正しい言い方でしょうか。

改善案

Agentクラスを新規作成して、インスタンスagentを作り、DDPG的な学習ができるようなメソドを作成していく。

ではAgentDDPGクラスを作成していく。

class AgentDDPG:
    def __init__(self):
        print('AgentDDPG is working')
    def choose_action(self, obs):
        action = env.action_space.sample()
        return action

class AgentDDPG:

def __init__(self):

print('AgentDDPG is working')

def choose_action(self, obs):

action = env.action_space.sample()

return action

メインスクリプトでagent = AgenDDPG()インスタンスを生成してから

action = agent.choose_action(obs)メソドを実行することで、行動空間action_spaceから行動をランダムに選択してactionをひとつ選択されたものを戻り値とすることができました。まだこの時点ではDDPG的な要素を入れていません。

DDPG部分を作っていく

DDPGは方策勾配法を基礎としており、目的関数J= E[ Σ G(τ) * grad log π(θ)]を最大化するために最初は適当な方策πをちょっとずつ自動調整していく方法です。

勾配 grad Jを使って、パラメータθを最適化していくのですがどう表現したらよいでしょうか。方策勾配法の発展経緯を追っていくと、下記のように読み取れます。

基本の方策勾配法： E[ Σ G(τ) * grad log π(θ)]
REINFORCE的に収益ノイズ除去： E[Σ G(t) * grad log π(θ)] ]
ベースライン付き： E[Σ ( G(t)-b) * grad log π(θ)] ]
ベースラインを価値関数とする：E[Σ ( G(t)-V(w) )* grad log π(θ)] ]
TD法であること：( r+γV(w)[s_t+1] – V(w) [s_t]) * grad log π(θ)
方策π(θ)をニューラルネットワークで表現：actorという。入力s、出力π(a|s)（行動確率probと表現することもある)
価値関数V(w)をニューラルネットワークで表現：criticという。※真の価値関数vは求めない。Vは中途半端な推定値でも方策πは学習できる。入力はactorと同じs、加えてactorの出力であるπがcriticの入力として使用されます。

「actorの出力をcriticの入力とする」部分がDDPGが連続値に対応できるポイントです。この要素を除くと出力が離散的になりそのアルゴリズムはactor-criticと呼んでいました。actor-criticは行動の選択肢が左右の２つある場合、「右に行く」と決めるような状況で使います。

DDPGは連続値なので、「車のハンドルを右へ15.2°回転させつつ、ブレーキを20%踏み込む」という出力が得られます。（のはず！）

では、ちょっとずつコーディングしていく。

python、プログラミング初級者でも理解できるようにちょっとずつ変えていきます。

1.エージェントインスタンスのchoose_actionメソドを使って行動actionを得るように変更する。

変更前：action = env.action_space.sample()

変更後：action = agent.choose_action(obs)

2.エージェントクラスのインスタンスを生成する

agent = AgentDDPG()

3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self):

def choose_action(self, obs):

4.方策（アクター）はニューラルネットワークで表現する。ActorNNクラスを新規作成し、インスタンスactorとして使用する。

def choose_action(self, obs):

action = self.actor.forward(obs)

return action

5.ActorNNクラスのインスタンスを生成する

def __init__(self):

self.actor = ActorNN()

6.ActorNNクラスを新規作成する

class ActorNN:

def __init__(self):

def forward(self, obs):

action = [0.0 for i in range(6)]

return action

まだニューラルネットワーク構造まで作成していないので、obsは入力として使っていませんし、actionも仮出力として[0,0,0,0,0,0]にしています。

ActorNNクラスを作りこんでいく

３層の全結合ネットワーク、そして活性化関数はreluにしています。

class ActorNN(nn.Module):
    def __init__(self, input_dim, output_dim):
        print('ActorNN.__init__ is working.')
        super(ActorNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, output_dim)

    def forward(self, obs):

        x = self.fc1(obs)
        x = F.relu(x)
        action = self.fc2(x)
        print('action :', action)

        return action

class ActorNN(nn.Module):

def __init__(self, input_dim, output_dim):

print('ActorNN.__init__ is working.')

super(ActorNN, self).__init__()

self.fc1 = nn.Linear(input_dim, 64)

self.fc2 = nn.Linear(64, output_dim)

def forward(self, obs):

x = self.fc1(obs)

x = F.relu(x)

action = self.fc2(x)

print('action :', action)

return action

AgentDDPGクラスを作りこんでいく

# 3.エージェントクラスを定義する
class AgentDDPG:
    def __init__(self, input_dim, output_dim):
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.actor = ActorNN(input_dim=self.input_dim, output_dim=self.output_dim)

    def choose_action(self, obs):
        print('AgentDDPG.choose_action is working.')

        action = self.actor.forward(obs)
        action = action.detach().numpy()

        return action

# 3.エージェントクラスを定義する

class AgentDDPG:

def __init__(self, input_dim, output_dim):

self.input_dim = input_dim

self.output_dim = output_dim

self.actor = ActorNN(input_dim=self.input_dim, output_dim=self.output_dim)

def choose_action(self, obs):

print('AgentDDPG.choose_action is working.')

action = self.actor.forward(obs)

action = action.detach().numpy()

return action

メインスクリプトと整合性を合わせる

agent = AgentDDPG(input_dim=17, output_dim=6)    

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10
DELAY_TIME = 0.00 # sec
total_rewards = []
for eposode in range(EPISODES):
    obs = env.reset()
    obs = T.tensor(obs[0], dtype=T.float)
env.observation_space)
    print('obs :', obs)

    reward: float = 0
    total_reward: float = 0
    done: bool = False
    for j in range(10):
        env.render()
        
        # ここをDDPGに置き換えていく
        # action = env.action_space.sample()
        action = agent.choose_action(obs)

        next_state, reward, done, _, info = env.step(action)
        obs = next_state
        obs = T.tensor(obs, dtype=T.float)

        total_reward += reward
        
        time.sleep(DELAY_TIME)
    total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() 
print('script is done.')

agent = AgentDDPG(input_dim=17, output_dim=6)

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 10

DELAY_TIME = 0.00 # sec

total_rewards = []

for eposode in range(EPISODES):

obs = env.reset()

obs = T.tensor(obs[0], dtype=T.float)

env.observation_space)

print('obs :', obs)

reward: float = 0

total_reward: float = 0

done: bool = False

for j in range(10):

env.render()

# ここをDDPGに置き換えていく

# action = env.action_space.sample()

action = agent.choose_action(obs)

next_state, reward, done, _, info = env.step(action)

obs = next_state

obs = T.tensor(obs, dtype=T.float)

total_reward += reward

time.sleep(DELAY_TIME)

total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close()

print('script is done.')

こんな感じで作っていきました。

訓練について

actorは目的関数J(θ)が最大になるように学習する。実際の計算は-J(θ)が最小になるように学習する。
target_actorを正解データとして教師あり学習する
criticはtarget_actorを正解データとして教師あり学習する

DDPG by gymnasium１日目

深層強化学習をやっていこう

今日から強化学習AIの道場　gymnasiumを使って深層決定論的方策勾配DDPGを試していきたいと思います。

環境

windows10 python3.7.9
メモリ8GB
core i7 7700
RTX3070Ti
visual studio code/
python 3.7.9
venv 仮想環境

モジュール

多分もっと増えていきます。

Package              Version
-------------------- -------
absl-py              1.4.0
cloudpickle          2.2.1
Farama-Notifications 0.0.4
glfw                 2.5.9
gym                  0.26.2
gym-notices          0.0.8
gymnasium            0.28.1
imageio              2.28.1
importlib-metadata   6.1.0
jax-jumpy            1.0.0
mujoco               2.3.5
numpy                1.21.6
Pillow               9.5.0
pip                  23.1.2
pygame               2.3.0
PyOpenGL             3.1.6
setuptools           47.1.0
swig                 4.1.1
typing_extensions    4.5.0
zipp                 3.15.0

Package Version

-------------------- -------

absl-py 1.4.0

cloudpickle 2.2.1

Farama-Notifications 0.0.4

glfw 2.5.9

gym 0.26.2

gym-notices 0.0.8

gymnasium 0.28.1

imageio 2.28.1

importlib-metadata 6.1.0

jax-jumpy 1.0.0

mujoco 2.3.5

numpy 1.21.6

Pillow 9.5.0

pip 23.1.2

pygame 2.3.0

PyOpenGL 3.1.6

setuptools 47.1.0

swig 4.1.1

typing_extensions 4.5.0

zipp 3.15.0

とりあえず学習なしで動かしてみる。

gymnasiumの HalfCheetah-v4　半分チーター（動物）？のエージェントモデルを動かして描画するところまでやっていきます。

import gymnasium as gym
import time


env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 2
DELAY_TIME = 0.00 # sec
total_rewards = []
for eposode in range(EPISODES):
    obs = env.reset()
    print('observation_space : ', env.observation_space)
    print('obs :', obs)
    reward = 0
    total_reward = 0
    done = False
    for j in range(100):
        env.render()
        action = env.action_space.sample()
        print('action_space : ', env.action_space)
        print('action : ', action)

        next_state, reward, done, _, info = env.step(action)
        print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

        obs = next_state
        total_reward += reward
        time.sleep(DELAY_TIME)

    print('total_reward : ', total_reward)
    total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・
print('script is done.')
# https://gymnasium.farama.org/

import gymnasium as gym

import time

env = gym.make("HalfCheetah-v4", render_mode= 'human')

EPISODES = 2

DELAY_TIME = 0.00 # sec

total_rewards = []

for eposode in range(EPISODES):

obs = env.reset()

print('observation_space : ', env.observation_space)

print('obs :', obs)

reward = 0

total_reward = 0

done = False

for j in range(100):

env.render()

action = env.action_space.sample()

print('action_space : ', env.action_space)

print('action : ', action)

next_state, reward, done, _, info = env.step(action)

print('next_state, reward, done, _, info :', next_state, reward, done, _, info)

obs = next_state

total_reward += reward

time.sleep(DELAY_TIME)

print('total_reward : ', total_reward)

total_rewards.append(total_reward)

print('total_rewards : ', total_rewards)

env.close() # 空なんですけど・・・

print('script is done.')

# https://gymnasium.farama.org/

結果

エピソード数１、繰り返しステップ数１に変更した場合です。

観察空間は１７個　マイナス無限からプラス無限までの連続値です。

行動空間は６個　-1から+1までの連続値です。

最後にscript is done.と出力されているので、特に問題なさそうです。

が・・・

observation_space :  Box(-inf, inf, (17,), float64)
obs : (array([ 0.02485869, -0.08434862, -0.04511497, -0.08579491,  0.00076974,
       -0.05903329,  0.09622602, -0.0461476 ,  0.07756127,  0.11203682,
        0.15562385, -0.22147787,  0.01759578,  0.07849912, -0.04187814,
        0.06876929,  0.00292594]), {})
action_space :  Box(-1.0, 1.0, (6,), float32)
action :  [-0.3660471   0.07370865 -0.33536777 -0.00359232 -0.06751072 -0.27815822]
next_state, reward, done, _, info : [ 7.62149187e-03 -8.01887540e-02 -1.37769448e-01  3.27668373e-02
 -1.22453461e-01 -1.61535935e-02  7.06252930e-04 -9.38258318e-02
 -2.00889555e-02 -5.68076861e-01  1.14528769e-01 -2.20827479e+00
  2.45804234e+00 -3.00804250e+00  1.02273244e+00 -2.36049060e+00
 -1.37703787e+00] -0.03053368795621509 False False {'x_position': -0.013300085731865507, 'x_velocity': 0.0028500719000089728, 'reward_run': 0.0028500719000089728, 'reward_ctrl': -0.03338375985622406}       
total_reward :  -0.03053368795621509
total_rewards :  [-0.03053368795621509]
script is done.

observation_space : Box(-inf, inf, (17,), float64)

obs : (array([ 0.02485869, -0.08434862, -0.04511497, -0.08579491, 0.00076974,

-0.05903329, 0.09622602, -0.0461476 , 0.07756127, 0.11203682,

0.15562385, -0.22147787, 0.01759578, 0.07849912, -0.04187814,

0.06876929, 0.00292594]), {})

action_space : Box(-1.0, 1.0, (6,), float32)

action : [-0.3660471 0.07370865 -0.33536777 -0.00359232 -0.06751072 -0.27815822]

next_state, reward, done, _, info : [ 7.62149187e-03 -8.01887540e-02 -1.37769448e-01 3.27668373e-02

-1.22453461e-01 -1.61535935e-02 7.06252930e-04 -9.38258318e-02

-2.00889555e-02 -5.68076861e-01 1.14528769e-01 -2.20827479e+00

2.45804234e+00 -3.00804250e+00 1.02273244e+00 -2.36049060e+00

-1.37703787e+00] -0.03053368795621509 False False {'x_position': -0.013300085731865507, 'x_velocity': 0.0028500719000089728, 'reward_run': 0.0028500719000089728, 'reward_ctrl': -0.03338375985622406}

total_reward : -0.03053368795621509

total_rewards : [-0.03053368795621509]

script is done.

エラー発生

スクリプト自体は最後の行まで問題なく script is doneと表示されていますが、

なんか出てます。

Exception ignored in: <function WindowViewer.__del__ at 0x0000025F7AC781F8>
Traceback (most recent call last):
  File "C:\dev\gym_test\gym_env\lib\site-packages\gymnasium\envs\mujoco\mujoco_rendering.py", line 335, in __del__
  File "C:\dev\gym_test\gym_env\lib\site-packages\gymnasium\envs\mujoco\mujoco_rendering.py", line 330, in free
  File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 1278, in destroy_window
  File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 691, in errcheck
  File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 70, in _reraise
  File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 670, in callback_wrapper
  File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 916, in _handle_glfw_errors
TypeError: 'NoneType' object is not callable

Exception ignored in: <function WindowViewer.__del__ at 0x0000025F7AC781F8>

Traceback (most recent call last):

File "C:\dev\gym_test\gym_env\lib\site-packages\gymnasium\envs\mujoco\mujoco_rendering.py", line 335, in __del__

File "C:\dev\gym_test\gym_env\lib\site-packages\gymnasium\envs\mujoco\mujoco_rendering.py", line 330, in free

File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 1278, in destroy_window

File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 691, in errcheck

File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 70, in _reraise

File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 670, in callback_wrapper

File "C:\dev\gym_test\gym_env\lib\site-packages\glfw\__init__.py", line 916, in _handle_glfw_errors

TypeError: 'NoneType' object is not callable

chatGPTによると

このエラーは、プログラムの実行中に、WindowViewerオブジェクトが破棄されるときに発生しました。WindowViewerオブジェクトは、Mujoco環境をレンダリングするために使用されます。Mujocoのレンダリングに関する問題が発生したため、このエラーが発生した可能性があります。

このエラーは、通常、単に無視しても安全です。ただし、エラーが頻繁に発生する場合、Mujoco環境のレンダリングに問題があるため、問題を解決する必要があります。

問題を解決するためには、次のことを試すことができます。

システムのグラフィックドライバを更新します。
仮想環境を再インストールします。
レンダリング設定を変更します。
Mujocoのバージョンをアップグレードすることも、問題を解決する可能性があります。

このエラーは、プログラムの実行中に、WindowViewerオブジェクトが破棄されるときに発生しました。WindowViewerオブジェクトは、Mujoco環境をレンダリングするために使用されます。Mujocoのレンダリングに関する問題が発生したため、このエラーが発生した可能性があります。

このエラーは、通常、単に無視しても安全です。ただし、エラーが頻繁に発生する場合、Mujoco環境のレンダリングに問題があるため、問題を解決する必要があります。

問題を解決するためには、次のことを試すことができます。

システムのグラフィックドライバを更新します。

仮想環境を再インストールします。

レンダリング設定を変更します。

Mujocoのバージョンをアップグレードすることも、問題を解決する可能性があります。

と言われますが、どうにもならないので無視しました。

env.close()が本来こういったエラーが出ないようにするはずですが、メソドの中身を見ると、説明書きだけでコードは空でした・・・

次回

DDPGをちょっとずつ作っていきます。

maddpg_pytorch

MADDPGです。これもうまく動かないけど、メモ。

import os
import numpy as np
import torch as T
import torch.nn.functional as F
import torch.optim as optim
from make_env import make_env

""" pip install
pip uninstall torch
pip install torch==1.4.0

pip uninstall numpy
pip install numpy==1.14.5

"""

"""
#env = make_env('simple')
env = make_env('simple_adversary')
observation = env.reset()


print(observation)
print(observation[0])
print(env.action_space)
print(env.action_space[0])
"""

# リプレイバッファのクラスを作成する
class MultiAgentReplayBuffer:
    def __init__(self, max_size, critic_dims, actor_dims,
                 n_actions, n_agents, batch_size):
        # 引数をアトリビュートとして保存する
        self.mem_cntr = 0 # メモリーカウンター

        self.mem_size = max_size # メモリーサイズ
        self.critic_dims = critic_dims
        self.actor_dims = actor_dims
        self.n_actions = n_actions
        self.n_agents = n_agents
        self.batch_size = batch_size

        # メモリーの枠を確保する
        self.state_memory = np.zeros((self.mem_size, critic_dims))
        self.new_state_memory = np.zeros((self.mem_size, critic_dims))#同じ
        self.reward_memory = np.zeros((self.mem_size, n_agents))
        self.terminal_memory = np.zeros((self.mem_size, n_agents), dtype=bool)#最終状態は値がないようにマスクする
        
        # アクターメモリーの初期化（メソドの作成が必要）
        self.init_actor_memory()

        """ memo
        print(np.zeros((2,3)))
        [[0. 0. 0.]
        [0. 0. 0.]]
        """
    # アクターメモリーの初期化
    def init_actor_memory(self):
        self.actor_state_memory = []
        self.actor_new_state_memory = []
        self.actor_action_memory = []

        for i in range(self.n_agents):
            self.actor_state_memory.append(
                np.zeros((self.mem_size, self.actor_dims[i])))
            self.actor_new_state_memory.append(
                np.zeros((self.mem_size, self.actor_dims[i])))
            self.actor_action_memory.append(
                np.zeros((self.mem_size, self.n_actions)))

    # トランジションの保存
    def store_transition(self, raw_obs, state, action, reward,
                         raw_obs_, state_, done):
        if self.mem_cntr % self.mem_size == 0 and self.mem_cntr > 0:
            self.init_actor_memory()
        
        index = self.mem_cntr % self.mem_size

        for agent_idx in range(self.n_agents):
            self.actor_state_memory[agent_idx][index] = raw_obs[agent_idx]
            self.actor_new_state_memory[agent_idx][index] = raw_obs_[agent_idx]
            self.actor_action_memory[agent_idx][index] = action[agent_idx]

        self.state_memory[index] = state # ここで８次元と２８次元で食い違っている
        self.new_state_memory[index] = state_
        self.terminal_memory[index] = done

        self.mem_cntr += 1

    # サンプルバッファー
    def sample_buffer(self):
        max_mem = min(self.mem_cntr, self.mem_size)

        batch = np.random.choice(max_mem, self.batch_size, replace=False)

        states = self.state_memory[batch]
        rewards = self.reward_memory[batch]
        states_ = self.new_state_memory[batch]
        terminal = self.terminal_memory[batch]

        actor_states = []
        actor_new_states = []
        actions = []

        for agent_idx in range(self.n_agents):
            actor_states.append(self.actor_state_memory[agent_idx][batch])
            actor_new_states.append(self.actor_new_state_memory[agent_idx][batch])
            actions.append(self.actor_action_memory[agent_idx][batch])

        return actor_states, states, actions, rewards , \
                actor_new_states, states_, terminal
        
    def ready(self):
        if self.mem_cntr >= self.batch_size:
            return True
        return False

# クリティックのニューラルネットワークを作成する
class CriticNetwork(T.nn.Module):
    def __init__(self, beta, input_dims, fc1_dims, fc2_dims,
                 n_agents, n_actions, chkpt_dir, name):
        # CriticNetworkクラスの親クラスT.nn.Moduleの___init__()にアクセスして、初期化する
        super(CriticNetwork, self).__init__() # super()で親クラスの__init__()を呼び出す
        #super(CriticNetwork, self).__init__()# 親のクラス=super(現在のクラス名,現在のクラス) 

        self.chkpt_dir = chkpt_dir
        self.chkpt_file = os.path.join(chkpt_dir, name)

        self.fc1 = T.nn.Linear(input_dims+n_agents*n_actions, fc1_dims)
        self.fc2 = T.nn.Linear(fc1_dims, fc2_dims)
        self.q = T.nn.Linear(fc2_dims, 1) # 出力は一つのみ

        self.optimizer = optim.Adam(self.parameters(), lr=beta)
        self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu')
        self.to(self.device)
        if not os.path.exists(self.chkpt_file):
            with open(self.chkpt_file, 'w'):
               pass
    # 順伝播する
    def forward(self, state, action):
        # Q関数のNN入力は　現在のstateと方針piのNN出力としてのactionの２つである。 
        x = T.cat((state, action), dim=1) #列方向に合体する
        x = F.relu(self.fc1(x)) 
        x = F.relu(self.fc2(x))
        q = self.q(x) # 出力は一つのみ

        return q
    
    def save_checkpoint(self):

        if not os.path.exists(self.chkpt_file):
            with open(self.chkpt_file, 'w'):
               pass
        # state_dictは、モデルのパラメータを格納しているPythonの辞書オブジェクト
        T.save(self.state_dict(), self.chkpt_file)
    
    def load_checkpoint(self):
        self.load_state_dict(T.load(self.chkpt_file))

# アクターのニューラルネットワークを作成する
class ActorNetwork(T.nn.Module):
    def __init__(self, alpha, input_dims, fc1_dims, fc2_dims,
                 n_actions, chkpt_dir, name):
        # ActorNetworkクラスの親クラスT.nn.Moduleの___init__()にアクセスして、初期化する
        super(ActorNetwork, self).__init__() # 親のクラス=super(現在のクラス名,現在のクラス)       


        self.chkpt_file = os.path.join(chkpt_dir, name)

        self.fc1 = T.nn.Linear(input_dims, fc1_dims)
        self.fc2 = T.nn.Linear(fc1_dims, fc2_dims)
        self.pi = T.nn.Linear(fc2_dims, n_actions) # 方策piはactionの選択肢の数分用意する

        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        self.device = T.device('cuda' if T.cuda.is_available() else 'cpu')
        self.to(self.device)
        if not os.path.exists(self.chkpt_file):
            with open(self.chkpt_file, 'w'):
               pass
    # 順伝播する
    def forward(self, state): 
        # 方針piのNN入力はstateのみで良い。出力はpi
        x = F.relu(self.fc1(state)) # ここでエラーが出ている
        x = F.relu(self.fc2(x))
        pi = T.softmax(self.pi(x), dim=1) # これも列方向 dim=1

        return pi
     
    def save_checkpoint(self):
        if not os.path.exists(self.chkpt_file):
            with open(self.chkpt_file, 'w'):
               pass
        # state_dictは、モデルのパラメータを格納しているPythonの辞書オブジェクト
        T.save(self.state_dict(), self.chkpt_file)
    
    def load_checkpoint(self):
        self.load_state_dict(T.load(self.chkpt_file))

class Agent:
    def __init__(self, agent_idx,
                 actor_dims, critic_dims,
                 n_agents, n_actions,
                 fc1=64, fc2=64,
                 alpha=0.01, beta=0.01,
                 gamma=0.95, tau=0.01,
                 chkpt_dir='tmp/maddpg/'):
        
        self.gamma = gamma
        self.tau = tau
        self.n_agents = n_agents
        self.n_actions = n_actions
        self.agent_name = 'agent_%s' % agent_idx
        self.chkpt_dir = chkpt_dir
        
        #アクターとクリティックのNNをインスタンス化
        self.actor = ActorNetwork(alpha, actor_dims, fc1, fc2, n_actions,
                                  chkpt_dir=self.chkpt_dir, name=self.agent_name+'_actor')
        self.critic = CriticNetwork(beta, critic_dims, fc1, fc2, n_agents, n_actions,
                                    chkpt_dir=self.chkpt_dir, name=self.agent_name+'_critic')

        
        # ターゲットアクターとターゲットクリティックのNNをインスタンス化
        self.target_actor = ActorNetwork(alpha, actor_dims, fc1, fc2, n_actions,
                                          chkpt_dir=self.chkpt_dir, name=self.agent_name+'_target_actor')
        self.target_critic = CriticNetwork(beta, critic_dims, fc1, fc2, n_agents, n_actions,
                                           chkpt_dir=self.chkpt_dir, name=self.agent_name+'_target_critic')

        # NNのパラメーターを更新する
        self.update_network_parameters(tau=1)

    # NNパラメータのアップデート
    def update_network_parameters(self, tau=None):
        if tau is None:
            tau = self.tau

        # ターゲットアクターとアクターに対してNNパラメータのアップデート
        target_actor_params = self.target_actor.named_parameters()
        actor_params = self.actor.named_parameters()

        target_actor_state_dict = dict(target_actor_params)
        actor_state_dict = dict(actor_params)

        for name in actor_state_dict:
            actor_state_dict[name] = tau*actor_state_dict[name].clone() + \
                    (1-tau)*target_actor_state_dict[name].clone()
        self.target_actor.load_state_dict(actor_state_dict)

        # ターゲットクリティックとクリティックに対してNNパラメータのアップデート
        target_critic_params = self.target_critic.named_parameters()
        critic_params = self.critic.named_parameters()

        target_critic_state_dict = dict(target_critic_params)
        critic_state_dict = dict(critic_params)

        for name in critic_state_dict:
            critic_state_dict[name] = tau*critic_state_dict[name].clone() + \
                    (1-tau)*target_critic_state_dict[name].clone()
        self.target_critic.load_state_dict(critic_state_dict)

    # actionを選択する
    def choose_action(self, observation):
        state = T.tensor(np.array([observation]), dtype=T.float).to(self.actor.device)
        actions = self.actor.forward(state)
        noise = T.rand(T.tensor(self.n_actions).to(self.actor.device)) #n_actionsは整数なのでtensorに変換する
        action = actions + noise

        return action.detach().cpu().numpy()[0]
    
    # モデルを保存する
    def save_models(self):
        self.actor.save_checkpoint()
        self.target_actor.save_checkpoint()
        self.critic.save_checkpoint()
        self.target_critic.save_checkpoint()
    
    # モデルをロードする
    def load_models(self):
        self.actor.load_checkpoint()
        self.target_actor.load_checkpoint()
        self.critic.load_checkpoint()
        self.target_critic.load_checkpoint()

# MADDPクラスを作成する
class MADDPG:

    def __init__(self,
                 actor_dims, critic_dims, n_agents, n_actions,
                 fc1=64, fc2=64,
                 alpha=0.01, beta=0.01,
                 gamma=0.99, tau=0.01,
                 scenario='simple_adversary',
                 chkpt_dir='tmp/maddpg/'):
       
        self.agents = []
        self.n_agents = n_agents
        self.n_actions = n_actions
        chkpt_dir += scenario


        for agent_idx in range(self.n_agents):

            self.agents.append(Agent(agent_idx, 
                                     actor_dims[agent_idx], critic_dims,
                                     n_agents, n_actions,
                                     fc1=64, fc2=64,
                                     alpha=0.01, beta=0.01, 
                                     gamma=0.95, tau=0.01,
                                     chkpt_dir=chkpt_dir))


    def save_checkpoint(self):
        print('==== saving checkpoint ====')
        for agent in self.agents:
            agent.save_models()

    def load_checkpoint(self):
        print('==== loading checkpoint ====')
        for agent in self.agents:
            agent.load_models()
    
    def choose_action(self, raw_obs):
        actions = []
        for agent_idx, agent in enumerate(self.agents):
            action = agent.choose_action(raw_obs[agent_idx])
            actions.append(action)

        return actions
    
    def learn(self, memory):
        if not memory.ready():
            return
        
        # リプレイバッファーのメモリからデータを引っ張り出す
        actor_states, states, actions, rewards,\
        actor_new_states, states_, dones = memory.sample_buffer()

        # できれば,cudaを使いたい
        device = self.agents[0].actor.device

        states = T.tensor(states, dtype=T.float).to(device)
        actions = T.tensor(actions, dtype=T.float).to(device)
        rewards = T.tensor(rewards, dtype=T.float).to(device)
        states_ = T.tensor(states_, dtype=T.float).to(device)
        dones = T.tensor(dones).to(device)

        # 全てのエージェントの行動を入れる箱
        all_agents_new_actions = []
        all_agents_new_mu_actions = []
        old_agents_actions = []
        
        # エージェント毎に行動を空リストへappendしていく
        for agent_idx, agent in enumerate(self.agents):
            #まずは新しい状態new_statesを定義する
            new_states = T.tensor(actor_new_states[agent_idx],
                                  dtype=T.float).to(device)
            # ターゲットアクターNNを順伝搬
            new_pi = agent.target_actor.forward(new_states) # ここでActorNN.forwardへ飛ぶ1024x8
            
            # 新しい方針（行動）new_piをappendする
            all_agents_new_actions.append(new_pi)

            # 次はmu_statesをやっていく muは現在のstatesでの方策（行動）μ(θ)のこと
            mu_states = T.tensor(actor_states[agent_idx],
                                 dtype=T.float).to(device)
            # アクターNNを順伝搬
            pi = agent.actor.forward(mu_states)# ここでActorNN.forwardへ飛ぶ1024x8
        
            #  新しい方針（行動）new_piをappendする
            all_agents_new_mu_actions.append(pi)

            old_agents_actions.append(actions[agent_idx])

        new_actions = T.cat([acts for acts in all_agents_new_actions], dim=1)
        mu = T.cat([acts for acts in all_agents_new_mu_actions], dim=1)
        old_actions = T.cat([acts for acts in old_agents_actions], dim=1)

        for agent_idx, agent in enumerate(self.agents):
            
            # target_qの計算1024x1：ターゲットクリティック（次の状態、次の行動）1024x28, 1024x15
            critic_value_ = agent.target_critic(states_, new_actions).flatten()
            critic_value_[dones[:,0]] = 0.0 # バッチ1024全てのエージェント0のdonesを0.0にする
            
            # qの計算
            critic_value = agent.critic(states, old_actions).flatten()
            
            #収益計算（割引率考慮）:target = 即時報酬r + （割引率γ x 次の状態行動価値q）
            target = rewards[:, agent_idx] + (agent.gamma * critic_value_)
            
            #criticNNの損失計算  
            critic_loss = F.mse_loss(target.detach(), critic_value)
            
            # criticの誤差逆伝播
            agent.critic.optimizer.zero_grad() # 勾配初期化
            critic_loss.backward(retain_graph=True) # 損失関数から勾配を計算
            agent.critic.optimizer.step() 

            # =================================

            # actorNNの損失計算 
            actor_loss = agent.critic.forward(states, mu)#.flatten()
            actor_loss = - actor_loss.mean() # 本当にmeanか？
            #actor_loss = - actor_loss

            #actorの誤差逆伝播
            agent.actor.optimizer.zero_grad()
            actor_loss.backward(retain_graph=True) # ここでエラーが起こっている
            #actor_loss = actor_loss.detach() # 独自に追加：detach()を使用して、計算グラフを切り離す
            agent.actor.optimizer.step()


            """改良コードだが、動かなかったので元に戻した 
            with T.no_grad():
                agent.actor.optimizer.zero_grad()
                actor_loss_copy = actor_loss.clone() # コピーを作成する
                actor_loss_copy.backward(retain_graph=True) # コピーに対して誤差逆伝播を行う
                actor_loss = actor_loss_copy.detach() # 独自に追加：detach()を使用して、計算グラフを切り離す
                actor_loss = actor_loss_copy.detach() # detach()を使用して、計算グラフを切り離す
                agent.actor.optimizer.step()
            """

            # agentのパラメータ更新実行(actor, critic, target_actor, target_critic)
            agent.update_network_parameters()
            #以上を３エージェント分繰り返す

def obsavation_list_to_state_vector(observation):
    state = np.array([])
    for obs in observation:
        # 観察空間を縦につなげていく
        state = np.concatenate([state, obs])
    return state
    
# ここからがメインスクリプト
if __name__ == '__main__':

    # 勾配エラー検出をオンにする
    #T.autograd.set_detect_anomaly(True)

    # シナリオを定義する
    #scenario = 'simple'
    scenario = 'simple_adversary'

    # 環境を定義する
    env = make_env(scenario)
    # エージェントの数を定義する
    n_agents = env.n # 3
    print('n_agents : ', n_agents) # 1
    # アクターの次元を初期化する = []
    actor_dims = []

    # エージェントの数だけ繰り返す
    for i in range(n_agents):
        # エージェントの次元にエージェントiの観察空間の数を入れる
        actor_dims.append(env.observation_space[i].shape[0]) #8, 10, 10
    print(f'actor_dims : {actor_dims}') # actor_dims : [8, 10, 10]
        
    # 全てのエージェントについて、アクターの観察空間数を足し算した数をクリティックNNの入力次元とする
    # が、間違ってないか？アクターの観察空間数を全部足したらアクターNNの入力次元ではないか？
    critic_dims = sum(actor_dims) # 28 = 8 + 10 + 10
    
    # 行動空間の数を定義する
    n_actions = env.action_space[0].n # 5

    # MADDPGに基づいたエージェントのインスタンスを作成する
    # args: アクターの次元[8,10,10]、クリティックの次元28、エージェントの数3、行動空間の数5
    #       NN第一層のノード数64、NN第二層のノード数64、アクターNNの学習率0.01、クリティックNNの学習率0.01,
    #       シナリオsimple_adversary, チェックポイント保存用フォルダ

    maddpg_agents = MADDPG(actor_dims, critic_dims,
                           n_agents, n_actions,
                           fc1=64, fc2=64,
                           alpha=0.01, beta=0.01,
                           gamma=0.99, tau=0.01,
                           scenario=scenario,
                           chkpt_dir='tmp/maddpg/')
     
    # リプレイバッファーからのメモリーのインスタンスを作成する
    memory = MultiAgentReplayBuffer(1000000, critic_dims, actor_dims,
                                    n_actions, n_agents, batch_size=1024)
    
    # 出力頻度
    PRINT_INTERVAL = 500

    # 試行回数
    N_GAMES = 30000

    # 1試行中の最大ステップ数
    MAX_STEPS = 25

    # 初期化
    total_steps = 0
    best_score = 0
    
    # 学習=False , 評価検証=True
    evaluate = False # or True

    # 評価検証の場合は学習済みのモデルパラメータをダウンロードする
    if evaluate:
        maddpg_agents.load_checkpoint()

    # 試行回数分繰り返す
    for i in range(N_GAMES):
        # gym環境リセット　初期位置・初期条件
        obs = env.reset() 
        score = 0
        score_history = []
        done = [False] * n_agents # エージェントの数ぶん
        episode_step = 0

        # 全エージェントのdoneが格納されているdoneリストの各要素が全部Trueでない限り繰り返す。
        # つまり、全エージェントがゴールに到達したら繰り返しは終了する。
        while not any(done):
            if evaluate:
                env.render()
            
            # 環境obsのときエージェントがとる行動確率から行動を抽出し、決定する
            actions = maddpg_agents.choose_action(obs)

            # 決定した行動から、次の環境、報酬、ゴールしたかどうか、その他情報を得る
            obs_, reward, done, info = env.step(actions)

            # 環境obsをベクトルに変換して状態stateとする。
            state = obsavation_list_to_state_vector(obs)
            
            # 次の環境obs をベクトルに変換して次の状態state_とする。
            state_ = obsavation_list_to_state_vector(obs_)

            # ここで、最大ステップを超えたら全エージェントのdoneを強制的にTrueにする。
            if episode_step > MAX_STEPS:
                done = [True] * n_agents

            # リプレイバッファメモリーにトランジションを保存する
            memory.store_transition(obs, state, actions, reward, obs_, state_, done)

            # 100ステップ毎に実行する
            if total_steps % 100 == 0 and not evaluate:
                # インスタンスを引数にとるとどうなるのか？
                maddpg_agents.learn(memory)

            # 次の環境を現在の環境としてアップデートする
            obs = obs_  

            # 全エージェントの報酬をスコアとして加算する
            score += sum(reward)
            
            # ステップ数を更新する
            total_steps += 1

            # エピソードを更新する
            episode_step += 1

        # スコアを履歴に追加する
        score_history.append(score)

        # スコア履歴から平均スコアを算出する
        avg_score = np.mean(score_history[-100:])

        if not evaluate:
            # ベストスコアより平均スコアのほうが高ければ
            if avg_score > best_score:
                # チェックポイントを保存
                maddpg_agents.save_checkpoint()
                # 平均スコアをベストスコアとして上書きする
                best_score = avg_score

        if i % PRINT_INTERVAL == 0 and i > 0:
            print('(episode)', i, 'average_score {:.1f}'.format(avg_score))


print('Script is done')        
#1:47:03
# #https://www.youtube.com/watch?v=tZTQ6S9PfkE

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

import os

import numpy as np

import torch as T

import torch.nn.functional as F

import torch.optim as optim

from make_env import make_env

""" pip install

pip uninstall torch

pip install torch==1.4.0

pip uninstall numpy

pip install numpy==1.14.5

"""

#env = make_env('simple')

env = make_env('simple_adversary')

observation = env.reset()

print(observation)

print(observation[0])

print(env.action_space)

print(env.action_space[0])

"""

# リプレイバッファのクラスを作成する

class MultiAgentReplayBuffer:

def __init__(self, max_size, critic_dims, actor_dims,

n_actions, n_agents, batch_size):

# 引数をアトリビュートとして保存する

self.mem_cntr = 0 # メモリーカウンター

self.mem_size = max_size # メモリーサイズ

self.critic_dims = critic_dims

self.actor_dims = actor_dims

self.n_actions = n_actions

self.n_agents = n_agents

self.batch_size = batch_size

# メモリーの枠を確保する

self.state_memory = np.zeros((self.mem_size, critic_dims))

self.new_state_memory = np.zeros((self.mem_size, critic_dims))#同じ

self.reward_memory = np.zeros((self.mem_size, n_agents))

self.terminal_memory = np.zeros((self.mem_size, n_agents), dtype=bool)#最終状態は値がないようにマスクする

# アクターメモリーの初期化（メソドの作成が必要）

self.init_actor_memory()

""" memo

print(np.zeros((2,3)))

[[0. 0. 0.]

[0. 0. 0.]]

"""

# アクターメモリーの初期化

def init_actor_memory(self):

self.actor_state_memory = []

self.actor_new_state_memory = []

self.actor_action_memory = []

for i in range(self.n_agents):

self.actor_state_memory.append(

np.zeros((self.mem_size, self.actor_dims[i])))

self.actor_new_state_memory.append(

np.zeros((self.mem_size, self.actor_dims[i])))

self.actor_action_memory.append(

np.zeros((self.mem_size, self.n_actions)))

# トランジションの保存

def store_transition(self, raw_obs, state, action, reward,

raw_obs_, state_, done):

if self.mem_cntr % self.mem_size == 0 and self.mem_cntr > 0:

self.init_actor_memory()

index = self.mem_cntr % self.mem_size

for agent_idx in range(self.n_agents):

self.actor_state_memory[agent_idx][index] = raw_obs[agent_idx]

self.actor_new_state_memory[agent_idx][index] = raw_obs_[agent_idx]

self.actor_action_memory[agent_idx][index] = action[agent_idx]

self.state_memory[index] = state # ここで８次元と２８次元で食い違っている

self.new_state_memory[index] = state_

self.terminal_memory[index] = done

self.mem_cntr += 1

# サンプルバッファー

def sample_buffer(self):

max_mem = min(self.mem_cntr, self.mem_size)

batch = np.random.choice(max_mem, self.batch_size, replace=False)

states = self.state_memory[batch]

rewards = self.reward_memory[batch]

states_ = self.new_state_memory[batch]

terminal = self.terminal_memory[batch]

actor_states = []

actor_new_states = []

actions = []

for agent_idx in range(self.n_agents):

actor_states.append(self.actor_state_memory[agent_idx][batch])

actor_new_states.append(self.actor_new_state_memory[agent_idx][batch])

actions.append(self.actor_action_memory[agent_idx][batch])

return actor_states, states, actions, rewards , \

actor_new_states, states_, terminal

def ready(self):

if self.mem_cntr >= self.batch_size:

return True

return False

# クリティックのニューラルネットワークを作成する

class CriticNetwork(T.nn.Module):

def __init__(self, beta, input_dims, fc1_dims, fc2_dims,

n_agents, n_actions, chkpt_dir, name):

# CriticNetworkクラスの親クラスT.nn.Moduleの___init__()にアクセスして、初期化する

super(CriticNetwork, self).__init__() # super()で親クラスの__init__()を呼び出す

#super(CriticNetwork, self).__init__()# 親のクラス=super(現在のクラス名,現在のクラス)

self.chkpt_dir = chkpt_dir

self.chkpt_file = os.path.join(chkpt_dir, name)

self.fc1 = T.nn.Linear(input_dims+n_agents*n_actions, fc1_dims)

self.fc2 = T.nn.Linear(fc1_dims, fc2_dims)

self.q = T.nn.Linear(fc2_dims, 1) # 出力は一つのみ

self.optimizer = optim.Adam(self.parameters(), lr=beta)

self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu')

self.to(self.device)

if not os.path.exists(self.chkpt_file):

with open(self.chkpt_file, 'w'):

pass

# 順伝播する

def forward(self, state, action):

# Q関数のNN入力は　現在のstateと方針piのNN出力としてのactionの２つである。

x = T.cat((state, action), dim=1) #列方向に合体する

x = F.relu(self.fc1(x))

x = F.relu(self.fc2(x))

q = self.q(x) # 出力は一つのみ

return q

def save_checkpoint(self):

if not os.path.exists(self.chkpt_file):

with open(self.chkpt_file, 'w'):

pass

# state_dictは、モデルのパラメータを格納しているPythonの辞書オブジェクト

T.save(self.state_dict(), self.chkpt_file)

def load_checkpoint(self):

self.load_state_dict(T.load(self.chkpt_file))

# アクターのニューラルネットワークを作成する

class ActorNetwork(T.nn.Module):

def __init__(self, alpha, input_dims, fc1_dims, fc2_dims,

n_actions, chkpt_dir, name):

# ActorNetworkクラスの親クラスT.nn.Moduleの___init__()にアクセスして、初期化する

super(ActorNetwork, self).__init__() # 親のクラス=super(現在のクラス名,現在のクラス)

self.chkpt_file = os.path.join(chkpt_dir, name)

self.fc1 = T.nn.Linear(input_dims, fc1_dims)

self.fc2 = T.nn.Linear(fc1_dims, fc2_dims)

self.pi = T.nn.Linear(fc2_dims, n_actions) # 方策piはactionの選択肢の数分用意する

self.optimizer = optim.Adam(self.parameters(), lr=alpha)

self.device = T.device('cuda' if T.cuda.is_available() else 'cpu')

self.to(self.device)

if not os.path.exists(self.chkpt_file):

with open(self.chkpt_file, 'w'):

pass

# 順伝播する

def forward(self, state):

# 方針piのNN入力はstateのみで良い。出力はpi

x = F.relu(self.fc1(state)) # ここでエラーが出ている

x = F.relu(self.fc2(x))

pi = T.softmax(self.pi(x), dim=1) # これも列方向 dim=1

return pi

def save_checkpoint(self):

if not os.path.exists(self.chkpt_file):

with open(self.chkpt_file, 'w'):

pass

# state_dictは、モデルのパラメータを格納しているPythonの辞書オブジェクト

T.save(self.state_dict(), self.chkpt_file)

def load_checkpoint(self):

self.load_state_dict(T.load(self.chkpt_file))

class Agent:

def __init__(self, agent_idx,

actor_dims, critic_dims,

n_agents, n_actions,

fc1=64, fc2=64,

alpha=0.01, beta=0.01,

gamma=0.95, tau=0.01,

chkpt_dir='tmp/maddpg/'):

self.gamma = gamma

self.tau = tau

self.n_agents = n_agents

self.n_actions = n_actions

self.agent_name = 'agent_%s' % agent_idx

self.chkpt_dir = chkpt_dir

#アクターとクリティックのNNをインスタンス化

self.actor = ActorNetwork(alpha, actor_dims, fc1, fc2, n_actions,

chkpt_dir=self.chkpt_dir, name=self.agent_name+'_actor')

self.critic = CriticNetwork(beta, critic_dims, fc1, fc2, n_agents, n_actions,

chkpt_dir=self.chkpt_dir, name=self.agent_name+'_critic')

# ターゲットアクターとターゲットクリティックのNNをインスタンス化

self.target_actor = ActorNetwork(alpha, actor_dims, fc1, fc2, n_actions,

chkpt_dir=self.chkpt_dir, name=self.agent_name+'_target_actor')

self.target_critic = CriticNetwork(beta, critic_dims, fc1, fc2, n_agents, n_actions,

chkpt_dir=self.chkpt_dir, name=self.agent_name+'_target_critic')

# NNのパラメーターを更新する

self.update_network_parameters(tau=1)

# NNパラメータのアップデート

def update_network_parameters(self, tau=None):

if tau is None:

tau = self.tau

# ターゲットアクターとアクターに対してNNパラメータのアップデート

target_actor_params = self.target_actor.named_parameters()

actor_params = self.actor.named_parameters()

target_actor_state_dict = dict(target_actor_params)

actor_state_dict = dict(actor_params)

for name in actor_state_dict:

actor_state_dict[name] = tau*actor_state_dict[name].clone() + \

(1-tau)*target_actor_state_dict[name].clone()

self.target_actor.load_state_dict(actor_state_dict)

# ターゲットクリティックとクリティックに対してNNパラメータのアップデート

target_critic_params = self.target_critic.named_parameters()

critic_params = self.critic.named_parameters()

target_critic_state_dict = dict(target_critic_params)

critic_state_dict = dict(critic_params)

for name in critic_state_dict:

critic_state_dict[name] = tau*critic_state_dict[name].clone() + \

(1-tau)*target_critic_state_dict[name].clone()

self.target_critic.load_state_dict(critic_state_dict)

# actionを選択する

def choose_action(self, observation):

state = T.tensor(np.array([observation]), dtype=T.float).to(self.actor.device)

actions = self.actor.forward(state)

noise = T.rand(T.tensor(self.n_actions).to(self.actor.device)) #n_actionsは整数なのでtensorに変換する

action = actions + noise

return action.detach().cpu().numpy()[0]

# モデルを保存する

def save_models(self):

self.actor.save_checkpoint()

self.target_actor.save_checkpoint()

self.critic.save_checkpoint()

self.target_critic.save_checkpoint()

# モデルをロードする

def load_models(self):

self.actor.load_checkpoint()

self.target_actor.load_checkpoint()

self.critic.load_checkpoint()

self.target_critic.load_checkpoint()

# MADDPクラスを作成する

class MADDPG:

def __init__(self,

actor_dims, critic_dims, n_agents, n_actions,

fc1=64, fc2=64,

alpha=0.01, beta=0.01,

gamma=0.99, tau=0.01,

scenario='simple_adversary',

chkpt_dir='tmp/maddpg/'):

self.agents = []

self.n_agents = n_agents

self.n_actions = n_actions

chkpt_dir += scenario

for agent_idx in range(self.n_agents):

self.agents.append(Agent(agent_idx,

actor_dims[agent_idx], critic_dims,

n_agents, n_actions,

fc1=64, fc2=64,

alpha=0.01, beta=0.01,

gamma=0.95, tau=0.01,

chkpt_dir=chkpt_dir))

def save_checkpoint(self):

print('==== saving checkpoint ====')

for agent in self.agents:

agent.save_models()

def load_checkpoint(self):

print('==== loading checkpoint ====')

for agent in self.agents:

agent.load_models()

def choose_action(self, raw_obs):

actions = []

for agent_idx, agent in enumerate(self.agents):

action = agent.choose_action(raw_obs[agent_idx])

actions.append(action)

return actions

def learn(self, memory):

if not memory.ready():

return

# リプレイバッファーのメモリからデータを引っ張り出す

actor_states, states, actions, rewards,\

actor_new_states, states_, dones = memory.sample_buffer()

# できれば,cudaを使いたい

device = self.agents[0].actor.device

states = T.tensor(states, dtype=T.float).to(device)

actions = T.tensor(actions, dtype=T.float).to(device)

rewards = T.tensor(rewards, dtype=T.float).to(device)

states_ = T.tensor(states_, dtype=T.float).to(device)

dones = T.tensor(dones).to(device)

# 全てのエージェントの行動を入れる箱

all_agents_new_actions = []

all_agents_new_mu_actions = []

old_agents_actions = []

# エージェント毎に行動を空リストへappendしていく

for agent_idx, agent in enumerate(self.agents):

#まずは新しい状態new_statesを定義する

new_states = T.tensor(actor_new_states[agent_idx],

dtype=T.float).to(device)

# ターゲットアクターNNを順伝搬

new_pi = agent.target_actor.forward(new_states) # ここでActorNN.forwardへ飛ぶ1024x8

# 新しい方針（行動）new_piをappendする

all_agents_new_actions.append(new_pi)

# 次はmu_statesをやっていく muは現在のstatesでの方策（行動）μ(θ)のこと

mu_states = T.tensor(actor_states[agent_idx],

dtype=T.float).to(device)

# アクターNNを順伝搬

pi = agent.actor.forward(mu_states)# ここでActorNN.forwardへ飛ぶ1024x8

# 新しい方針（行動）new_piをappendする

all_agents_new_mu_actions.append(pi)

old_agents_actions.append(actions[agent_idx])

new_actions = T.cat([acts for acts in all_agents_new_actions], dim=1)

mu = T.cat([acts for acts in all_agents_new_mu_actions], dim=1)

old_actions = T.cat([acts for acts in old_agents_actions], dim=1)

for agent_idx, agent in enumerate(self.agents):

# target_qの計算1024x1：ターゲットクリティック（次の状態、次の行動）1024x28, 1024x15

critic_value_ = agent.target_critic(states_, new_actions).flatten()

critic_value_[dones[:,0]] = 0.0 # バッチ1024全てのエージェント0のdonesを0.0にする

# qの計算

critic_value = agent.critic(states, old_actions).flatten()

#収益計算（割引率考慮）:target = 即時報酬r + （割引率γ x 次の状態行動価値q）

target = rewards[:, agent_idx] + (agent.gamma * critic_value_)

#criticNNの損失計算

critic_loss = F.mse_loss(target.detach(), critic_value)

# criticの誤差逆伝播

agent.critic.optimizer.zero_grad() # 勾配初期化

critic_loss.backward(retain_graph=True) # 損失関数から勾配を計算

agent.critic.optimizer.step()

# =================================

# actorNNの損失計算

actor_loss = agent.critic.forward(states, mu)#.flatten()

actor_loss = - actor_loss.mean() # 本当にmeanか？

#actor_loss = - actor_loss

#actorの誤差逆伝播

agent.actor.optimizer.zero_grad()

actor_loss.backward(retain_graph=True) # ここでエラーが起こっている

#actor_loss = actor_loss.detach() # 独自に追加：detach()を使用して、計算グラフを切り離す

agent.actor.optimizer.step()

"""改良コードだが、動かなかったので元に戻した

with T.no_grad():

agent.actor.optimizer.zero_grad()

actor_loss_copy = actor_loss.clone() # コピーを作成する

actor_loss_copy.backward(retain_graph=True) # コピーに対して誤差逆伝播を行う

actor_loss = actor_loss_copy.detach() # 独自に追加：detach()を使用して、計算グラフを切り離す

actor_loss = actor_loss_copy.detach() # detach()を使用して、計算グラフを切り離す

agent.actor.optimizer.step()

"""

# agentのパラメータ更新実行(actor, critic, target_actor, target_critic)

agent.update_network_parameters()

#以上を３エージェント分繰り返す

def obsavation_list_to_state_vector(observation):

state = np.array([])

for obs in observation:

# 観察空間を縦につなげていく

state = np.concatenate([state, obs])

return state

# ここからがメインスクリプト

if __name__ == '__main__':

# 勾配エラー検出をオンにする

#T.autograd.set_detect_anomaly(True)

# シナリオを定義する

#scenario = 'simple'

scenario = 'simple_adversary'

# 環境を定義する

env = make_env(scenario)

# エージェントの数を定義する

n_agents = env.n # 3

print('n_agents : ', n_agents) # 1

# アクターの次元を初期化する = []

actor_dims = []

# エージェントの数だけ繰り返す

for i in range(n_agents):

# エージェントの次元にエージェントiの観察空間の数を入れる

actor_dims.append(env.observation_space[i].shape[0]) #8, 10, 10

print(f'actor_dims : {actor_dims}') # actor_dims : [8, 10, 10]

# 全てのエージェントについて、アクターの観察空間数を足し算した数をクリティックNNの入力次元とする

# が、間違ってないか？アクターの観察空間数を全部足したらアクターNNの入力次元ではないか？

critic_dims = sum(actor_dims) # 28 = 8 + 10 + 10

# 行動空間の数を定義する

n_actions = env.action_space[0].n # 5

# MADDPGに基づいたエージェントのインスタンスを作成する

# args: アクターの次元[8,10,10]、クリティックの次元28、エージェントの数3、行動空間の数5

# NN第一層のノード数64、NN第二層のノード数64、アクターNNの学習率0.01、クリティックNNの学習率0.01,

# シナリオsimple_adversary, チェックポイント保存用フォルダ

maddpg_agents = MADDPG(actor_dims, critic_dims,

n_agents, n_actions,

fc1=64, fc2=64,

alpha=0.01, beta=0.01,

gamma=0.99, tau=0.01,

scenario=scenario,

chkpt_dir='tmp/maddpg/')

# リプレイバッファーからのメモリーのインスタンスを作成する

memory = MultiAgentReplayBuffer(1000000, critic_dims, actor_dims,

n_actions, n_agents, batch_size=1024)

# 出力頻度

PRINT_INTERVAL = 500

# 試行回数

N_GAMES = 30000

# 1試行中の最大ステップ数

MAX_STEPS = 25

# 初期化

total_steps = 0

best_score = 0

# 学習=False , 評価検証=True

evaluate = False # or True

# 評価検証の場合は学習済みのモデルパラメータをダウンロードする

if evaluate:

maddpg_agents.load_checkpoint()

# 試行回数分繰り返す

for i in range(N_GAMES):

# gym環境リセット　初期位置・初期条件

obs = env.reset()

score = 0

score_history = []

done = [False] * n_agents # エージェントの数ぶん

episode_step = 0

# 全エージェントのdoneが格納されているdoneリストの各要素が全部Trueでない限り繰り返す。

# つまり、全エージェントがゴールに到達したら繰り返しは終了する。

while not any(done):

if evaluate:

env.render()

# 環境obsのときエージェントがとる行動確率から行動を抽出し、決定する

actions = maddpg_agents.choose_action(obs)

# 決定した行動から、次の環境、報酬、ゴールしたかどうか、その他情報を得る

obs_, reward, done, info = env.step(actions)

# 環境obsをベクトルに変換して状態stateとする。

state = obsavation_list_to_state_vector(obs)

# 次の環境obs をベクトルに変換して次の状態state_とする。

state_ = obsavation_list_to_state_vector(obs_)

# ここで、最大ステップを超えたら全エージェントのdoneを強制的にTrueにする。

if episode_step > MAX_STEPS:

done = [True] * n_agents

# リプレイバッファメモリーにトランジションを保存する

memory.store_transition(obs, state, actions, reward, obs_, state_, done)

# 100ステップ毎に実行する

if total_steps % 100 == 0 and not evaluate:

# インスタンスを引数にとるとどうなるのか？

maddpg_agents.learn(memory)

# 次の環境を現在の環境としてアップデートする

obs = obs_

# 全エージェントの報酬をスコアとして加算する

score += sum(reward)

# ステップ数を更新する

total_steps += 1

# エピソードを更新する

episode_step += 1

# スコアを履歴に追加する

score_history.append(score)

# スコア履歴から平均スコアを算出する

avg_score = np.mean(score_history[-100:])

if not evaluate:

# ベストスコアより平均スコアのほうが高ければ

if avg_score > best_score:

# チェックポイントを保存

maddpg_agents.save_checkpoint()

# 平均スコアをベストスコアとして上書きする

best_score = avg_score

if i % PRINT_INTERVAL == 0 and i > 0:

print('(episode)', i, 'average_score {:.1f}'.format(avg_score))

print('Script is done')

#1:47:03

# #https://www.youtube.com/watch?v=tZTQ6S9PfkE

メイン

"""
Code for creating a multiagent environment with one of the scenarios listed
in ./scenarios/.
Can be called by using, for example:
    env = make_env('simple_speaker_listener')
After producing the env object, can be used similarly to an OpenAI gym
environment.

A policy using this environment must output actions in the form of a list
for all agents. Each element of the list should be a numpy array,
of size (env.world.dim_p + env.world.dim_c, 1). Physical actions precede
communication actions in this array. See environment.py for more details.
"""

def make_env(scenario_name, benchmark=False):
    '''
    Creates a MultiAgentEnv object as env. This can be used similar to a gym
    environment by calling env.reset() and env.step().
    Use env.render() to view the environment on the screen.

    Input:
        scenario_name   :   name of the scenario from ./scenarios/ to be Returns
                            (without the .py extension)
        benchmark       :   whether you want to produce benchmarking data
                            (usually only done during evaluation)

    Some useful env properties (see environment.py):
        .observation_space  :   Returns the observation space for each agent
        .action_space       :   Returns the action space for each agent
        .n                  :   Returns the number of Agents
    '''
    from multiagent.environment import MultiAgentEnv
    import multiagent.scenarios as scenarios

    # load scenario from script
    scenario = scenarios.load(scenario_name + ".py").Scenario()
    # create world
    world = scenario.make_world()
    # create multiagent environment
    if benchmark:        
        env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, scenario.benchmark_data)
    else:
        env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation)
    return env

"""

Code for creating a multiagent environment with one of the scenarios listed

in ./scenarios/.

Can be called by using, for example:

env = make_env('simple_speaker_listener')

After producing the env object, can be used similarly to an OpenAI gym

environment.

A policy using this environment must output actions in the form of a list

for all agents. Each element of the list should be a numpy array,

of size (env.world.dim_p + env.world.dim_c, 1). Physical actions precede

communication actions in this array. See environment.py for more details.

"""

def make_env(scenario_name, benchmark=False):

'''

Creates a MultiAgentEnv object as env. This can be used similar to a gym

environment by calling env.reset() and env.step().

Use env.render() to view the environment on the screen.

Input:

scenario_name : name of the scenario from ./scenarios/ to be Returns

(without the .py extension)

benchmark : whether you want to produce benchmarking data

(usually only done during evaluation)

Some useful env properties (see environment.py):

.observation_space : Returns the observation space for each agent

.action_space : Returns the action space for each agent

.n : Returns the number of Agents

'''

from multiagent.environment import MultiAgentEnv

import multiagent.scenarios as scenarios

# load scenario from script

scenario = scenarios.load(scenario_name + ".py").Scenario()

# create world

world = scenario.make_world()

# create multiagent environment

if benchmark:

env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, scenario.benchmark_data)

else:

env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation)

return env

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31