目录
- 前情提要与引用参考:
- 顺序阅读代码:
- BC_Trainer
- **BCAgent**
- MLP_policy.py
- ReplayBuffer
- RL_Trainer
- collect_training_trajectories
- do_relabel_with_expert
- 代码完成后的结果分析
- BC Behaviour cloning
- Q1.3分析
- 训练成果对比
- 问题合集
- numpy.core._exceptions.MemoryError: Unable to allocate 1.40 GiB for an array with shape (1000, 500, 1000, 3) and data type uint8
- image一直没有显示
- pickle.load
- ptu.to_numpy() 会把grad_func给去掉
- **参考引用**
- 安装Mujoco Ubuntu
- 介绍
- 前提依赖等
- 许可证申请
- 设置路径
前情提要与引用参考:
- b站看课地址:https://www.bilibili.com/video/BV1dJ411W78A
- 官方课程地址:http://rail.eecs.berkeley.edu/deeprlcourse/
- 本人代码地址:https://gitee.com/kin_zhang/drl-hwprogramm/tree/kin/hw1/hw1
请先看原文件里的readme.md和installation.md等,课程是2019fall,但是作业我直接做的最新的2020fall的 - 一些参考引用见最后的部分,主要是参考代码等
好像上次的flag 强化学习的书一直没更新完,这次也是pjc同学推荐的课程,觉得很有意思 所以就开始听课做作业 顺便当预习这些流程了,后续Carla还是会继续搞,搭个环境做一下DRL/RL之类的。课程笔记看看后面能不能整理的好一点… 看起来只能我自己看得懂 hhhh
Notion原笔记地址
顺序阅读代码:
- scripts/run_hw.py (you should read this file, but you don’t need to edit it)
- infrastructure/rl_trainer.py
- agents/bc_agent.py (another read-only file)
- policies/MLP_policy.py
- infrastructure/replay_buffer.py
- infrastructure/utils.py
- infrastructure/pytorch_utils.py
首先按顺序看run_hw1.py 可以从main()看,
- 添加参数
- do_dagger 是否使用专家数据
- 使用logging的目录等
- 重点 建立BC_Trainer
- 重点 运行训练
BC_Trainer
- 导入参数
- 构建BCAgent
- 构建RL的训练对象
- 加载expert policy
其中,首先按顺序进入BCAgent的构建过程
BCAgent
初始化环境和参数,设置actor也就跳入了下面:
MLP_policy.py
self.actor.update(ob_no, ac_na) # HW1: you will modify this
TODO
首先这一点是构建actor/policy得知,也就是MLPPolicySL
跳入进去后,发现需要更新policy和loss
首先loss的更新是由:self.loss = nn.MSELoss()里给出,具体用法建议先看一下pytorch 60mins教程:https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html
output = net(input)
target = torch.randn(10) # a dummy target, for example
target = target.view(1, -1) # make it the same shape as output
criterion = nn.MSELoss()loss = criterion(output, target)
print(loss)
所以在这里我们应该这样填入:第一个参数:根据现在的观测值得到action,然后去补充get_action;第二个参数是将numpy转成tensor的
loss = self.loss(self.forward(ptu.from_numpy(observations)),ptu.from_numpy(actions))
# the reason why we cannot use get_action since to_numpy will remove grad_func
然后再进入get_action补充:由现在的观测值放入网络得到action,注意格式需要变为numpy
return ptu.to_numpy(self.forward(ptu.from_numpy(observation)))
进而补充forward,从前面的init我们得知:
在self.discrete是logits_na网络,其他则是mean_net网络
所以forward是:
def forward(self, observation: torch.FloatTensor) -> Any:# raise NotImplementedErrorif self.discrete:return self.logits_na(observation)else:return self.mean_net(observation)
最后是关于Backprop的,pytorch官网的教程,整个使用的过程是:
import torch.optim as optim# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)# in your training loop:
optimizer.zero_grad() # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # Does the update
所以模仿一下其中的过程,这一部分就可以全部完成了
class MLPPolicySL(MLPPolicy):def __init__(self, ac_dim, ob_dim, n_layers, size, **kwargs):super().__init__(ac_dim, ob_dim, n_layers, size, **kwargs)self.loss = nn.MSELoss()def update(self, observations, actions,adv_n=None, acs_labels_na=None, qvals=None):# DONE TODO: update the policy and return the lossself.optimizer.zero_grad() # zeroes the gradient buffers of all parametersloss = self.loss(self.get_action(observations),ptu.from_numpy(actions))loss.backward() # backpropself.optimizer.step() # # Does the updatereturn {# You can add extra logging information here, but keep this line'Training Loss': ptu.to_numpy(loss),}
ReplayBuffer
其中需要完成的是随机数据,首先根据提示选取按random entries from each of the 5 component arrays 来随机的个数
也就是np.random.permutation(len(self))
然后因为需要batch_size个这样的索引,所以整体就是
def sample_random_data(self, batch_size):assert (self.obs.shape[0]== self.acs.shape[0]== self.rews.shape[0]== self.next_obs.shape[0]== self.terminals.shape[0])## TODO return batch_size number of random entries from each of the 5 component arrays above## HINT 1: use np.random.permutation to sample random indices## HINT 2: return corresponding data points from each array (i.e., not different indices from each array)## HINT 3: look at the sample_recent_data function belowindices = np.random.permutation(len(self))[:batch_size]return self.obs[indices], self.acs[indices], self.rews[indices], self.next_obs[indices], self.terminals[indices]
自此我们的这个部分的TODO就做完了
RL_Trainer
回到run_hw1.py
继续当构建完BCAgent后,我们就需要构建RL_Trainer里的了,
- 初始化:获取参数,建立logger,创建TF 部分
- 环境的设置
- agent的创建
然后可以看到run_hw1.py里直接调用的self.rl_trainer.run_training_loop
需要完成的部分是收集训练的轨迹
collect_training_trajectories
提示里已经说明了当时第一次迭代的时候,应该需要加载路径
# DONE TODO decide whether to load training data or use the current policy to collect more data
# HINT: depending on if it's the first iteration or not, decide whether to either
# (1) load the data. In this case you can directly return as follows
# ```return loaded_paths, 0, None ```# (2) collect `self.params['batch_size']` transitions
if itr==0:with open(load_initial_expertdata, 'rb') as f:loaded_paths = pickle.loads(f)return loaded_paths, 0, None# DONE TODO collect `batch_size` samples to be used for training
# HINT1: use sample_trajectories from utils
# HINT2: you want each of these collected rollouts to be of length self.params['ep_len']
print("\nCollecting data to be used for training...")
paths, envsteps_this_batch = utils.sample_trajectory(self.env, collect_policy, self.params['ep_len'], render=False, render_mode=('rgb_array')) # sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array'))
第二个TODO其实提示的很明显 从utils调用那个函数,然后跳到那边可以看到需要的参数是什么,对应写已知的即可,跳入的过程我们发现sample_trajectories正好也有TODO需要完善
根据提示,使用sample_trajectory
获得每条路径;使用get_pathlength去计算timesteps
def sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array')):"""Collect rollouts until we have collected min_timesteps_per_batch steps.TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into pathsHint2: use get_pathlength to count the timesteps collected in each path"""timesteps_this_batch = 0paths = []while timesteps_this_batch < min_timesteps_per_batch:path = sample_trajectory(env,policy,max_path_length)paths.append(path)timesteps_this_batch += get_pathlength(path)return paths, timesteps_this_batch
然后再去看看sample_trajectory
,首先是知道gym的基本用法:https://gym.openai.com/ 见其官网
import gym
env = gym.make("CartPole-v1")
observation = env.reset()
for _ in range(1000):env.render()action = env.action_space.sample() # your agent here (this takes random actions)observation, reward, done, info = env.step(action)if done:observation = env.reset()
env.close()
完善的过程中就是重置环境ob = env.reset()
等等 见gitee链接吧:
回到主的收集轨迹的我们可以看到还有一个TODO实现 sample_n_trajectories
是根据输入的ntraj来给出paths的长度
def sample_n_trajectories(env, policy, ntraj, max_path_length, render=False, render_mode=('rgb_array')):"""Collect ntraj rollouts.TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into paths"""paths = []for i in range(ntraj):path = sample_trajectory(env, policy, max_path_length)paths.append(path)return paths
do_relabel_with_expert
收集完后 继续回到RL_Trainer的函数,往下是relabel 收集到的观测,emmm 后面懒得写的那么仔细了… 大家直接看程序把 吧 吧 hhhh
写好的程序地址:https://gitee.com/kin_zhang/drl-hwprogramm/tree/kin/hw1/hw1
代码完成后的结果分析
BC Behaviour cloning
首先正如我在solution.md里写到的一个点,就是ep_len默认一开始是1000,但是在我循环运行的时候,这样子的,我的ep_len是从100开始的
###################
### RUN TRAINING
###################
# 重复运行,请消除此处注释,注释下面两行
add_item = 100
for step in range(25):params['ep_len'] = add_item*(step+1)trainer = BC_Trainer(params)trainer.run_training_loop()
# trainer = BC_Trainer(params)
# trainer.run_training_loop()
然后经过导出csv画一下:
emm 好像也没错哦,到1000的时候也正是那个范围,不过这个提醒了我一个事,在课上曾说过,BC如果在一开始就犯错,后面很难救回来,这就是为什么导入专家数据,但是只导入一次的原因吗?在开始的return保持最高(en_len默认1000的时候哈)
第二题有对迭代次数进行分析,但是我觉得n_itr应该不算迭代次数,而是多少长度的学习,迭代次数不应该表示同一个状态观测下的不断学习嘛?噢 我好像想起来了,是ep_len=1000 eval_batch_size=5000 专业那个就是收集 5 trajectories;是这个代表的一个状态观测下学习多少次吧,默认是双方都是1000,所以就是一次(我到底是怎么写完这个 都没能理解参数的意义的… 主要是输入什么的时候HINT都有提示 基本不用想)
这里的n_itr越大,这些机器人走的路也就越长… emmm 前面说错了:运行了n_itr=300发现还是那几步所以肯定不是这个是次数
emmm 可能是写solution.md和我… 写程序隔了一天所以都不记得了?? 重新回去看了一下是path的长度
def sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array')):"""Collect rollouts until we have collected min_timesteps_per_batch steps.DONE TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into pathsHint2: use get_pathlength to count the timesteps collected in each path"""timesteps_this_batch = 0paths = []while timesteps_this_batch < min_timesteps_per_batch:path = sample_trajectory(env, policy, max_path_length, render)paths.append(path)timesteps_this_batch += get_pathlength(path)return paths, timesteps_this_batch
也就是这里,这里传进来的时候是:paths, envsteps_this_batch = utils.sample_trajectories(self.env, collect_policy, batch_size, self.params['ep_len'], render=False, render_mode=('rgb_array'))
也就是max_path_length
是['ep_len']
在logging时给的是eval_paths, eval_envsteps_this_batch = utils.sample_trajectories(self.env, eval_policy, self.params['eval_batch_size'], self.params['ep_len'])
但是在记录视频的时候其实是转到了MAX_VIDEO_LEN
这里,这里原文档应该写错了一个地方 也就是init里的Overwrite没加global会失败的
train_video_paths = utils.sample_n_trajectories(self.env, collect_policy, MAX_NVIDEO, MAX_VIDEO_LEN, True)
- May 18, 2021 已完成修复 global的问题,加大了长度
总结上面的问题:
-
n_iter
迭代次数,具体是指整体的训练循环是几次比如第一次就可以训练出走ep_len的,但是train_loss可能很大,效果不好
-
ep_len
max length of episodes,在rl_trainer.py里可以看到它决定了MAX_VIDEO_LEN
同时和eval_batch_size
配合使用决定几个轨迹序列self.params['ep_len'] = self.params['ep_len'] or self.env.spec.max_episode_steps
-
eval_batch_size
eval data collected (in the env) for logging metrics,也就是在整个环境中我们取多少数据去做评估,类似于 我的长度由ep_len决定,但是评估的数据量是由这个决定的,比如我的长度默认是1000,评估数据也是1000,# rl_train.py里的对于训练的 print("\nCollecting data to be used for training...") paths, envsteps_this_batch = utils.sample_trajectories(self.env, collect_policy, batch_size, self.params['ep_len'])# rl_train.py里的对于收集eval的轨迹的 # collect eval trajectories, for logging print("\nCollecting data for eval...") eval_paths, eval_envsteps_this_batch = utils.sample_trajectories(self.env, eval_policy, self.params['eval_batch_size'], self.params['ep_len'])# 函数定义 def sample_trajectories(env, policy, min_timesteps_per_batch, max_path_length, render=False, render_mode=('rgb_array')):"""Collect rollouts until we have collected min_timesteps_per_batch steps.DONE TODO implement this functionHint1: use sample_trajectory to get each path (i.e. rollout) that goes into pathsHint2: use get_pathlength to count the timesteps collected in each path"""timesteps_this_batch = 0paths = []while timesteps_this_batch < min_timesteps_per_batch:path = sample_trajectory(env, policy, max_path_length, render)paths.append(path)timesteps_this_batch += get_pathlength(path)return paths, timesteps_this_batch# 在sample_trajectory函数里 while True:# ...........rollout_done = 1 if steps>=max_path_length else 0 # HINT: this is either 0 or 1if rollout_done:break# ....
然后在这里可以看到 每次都是min_timesteps_per_batch是指:每个batch的最小时长是,默认是1000,max_path_length就是每条路径的长度,比如你想那个agent走多远,默认也是1000【当然gif图里显示的为了减少文件夹大小,所以默认只显示40长度】
如果按照hw1.pdf里讲的如果
ep_len=1000
,eval_batch_size=5000
那么在这里的函数,while大概会执行五次,也就是有五条轨迹,这里的轨迹是类似于五条轨迹序列的意思,然后再取五条轨迹序列reward相加,平均和方差 来判断这次的policy表现-
所以这里破案了,Q1.2我选取ep_len从100到2500,就是因为轨迹序列从多到少,多的时候可能会有很差的出现,少的时候完全按照专家策略给的relabel嘛?
但是我看了一下是所有的path都会relabel的 应该不会出现差距这么多呀 -
重新理了一下思路,解决了:
首先是前面分析从100到2500的ep_len没错,轨迹序列从多到少也没错,但是求reward是ep_len的平均,也就是说如果我走100步,一共运行10次,我求的是这10次走100步的平均;接着当我走1000步 只运行一次的时候,我求的就是1000步的reward,所以显而易见 走的越多当然reward也越多【在reward没有负的情况下】
-
Q1.3分析
首先是就算run_hw1.py里有很多参数可调,但是!这里是BC,itr=0所以不会经过训练层 那么关于 train_batch_size
, learning_rate
, size n_layers
, batch_size
这些关于网络的都是在此问题下没用的,在问题Q2里就有用了,然后ep_len前面有分析过了,改了就把总量改了 除非保持总量不变,变轨迹 但是没有什么分析的意义
具体就是collect_training_trajectories这里,当时第一次迭代是直接返回的初始专家数据的,而且BC情况下只能一次迭代
if itr==0:with open(load_initial_expertdata, 'rb') as f:loaded_paths = pickle.loads(f.read())return loaded_paths, 0, None
但是吧,他的hw1.pdf里又说可调参数有train data set 和 你从专家数据集抽多少数据,这就让我很疑惑了… 后者我能理解,前者根本无效呀,因为是不会去进行下一步collecting data to be used for training呀… 但是后者我用多少数据 也没有可调的参数 只能修改源代码搞定吧 吧 吧
训练成果对比
这些都是在ep_len等于200的时候进行的哈,我设过1000 然后就出界了,虽然物理还是在的 就是地面消失了 但是有悬空地面还能继续跑hhh
首先是Ant-v2的环境对比:
在一开始迭代的时候 比如n=5的时候,右边的那位就翻车停止了,然后走到迭代95的时候都走完全程没有翻车了
其实上面Ant-v2的环境看的不明显,主要是比较简单,那么直接上个更明显的humanoid-v2环境下:【这个当时写的时候没有设为done 所以就算失败了 也会继续翻车挣扎 hhh】
n=5 ep_len=100
n=95 ep_len=100
好了大致的分析就到此了
问题合集
numpy.core._exceptions.MemoryError: Unable to allocate 1.40 GiB for an array with shape (1000, 500, 1000, 3) and data type uint8
当ep_len太大保存Video的时候会出现这个错误
引用参考处:https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type
# 查看是否允许overcommit_memory
cat /proc/sys/vm/overcommit_memory# 进入root
sudo su# 修改成允许
echo 1 > /proc/sys/vm/overcommit_memory
image一直没有显示
在pdf里写了 如果想保存image和video删除--video_log_freq -1
其实… 应该还要在代码里把def sample_trajectory(env, policy, max_path_length, render=False, render_mode=('rgb_array'))
这里的render改成True不然的话!
是会报错的,然后print出来可以发现p['image_obs']
是空的 然后顺藤摸瓜就知道问题在这里了
pickle.load
if itr==0:with open(load_initial_expertdata, 'rb') as f:loaded_paths = pickle.loads(f.read())return loaded_paths, 0, None
这里对于第一次迭代是打开专家数据然后读取进来,一开始写成了pickle.loads(f)
ptu.to_numpy() 会把grad_func给去掉
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
这个问题的发现源于我一开始 一直写的是loss = self.loss(self.get_action(observations),ptu.from_numpy(actions))
class MLPPolicySL(MLPPolicy):def __init__(self, ac_dim, ob_dim, n_layers, size, **kwargs):super().__init__(ac_dim, ob_dim, n_layers, size, **kwargs)self.loss = nn.MSELoss()def update(self, observations, actions,adv_n=None, acs_labels_na=None, qvals=None):# DONE TODO: update the policy and return the lossself.optimizer.zero_grad() # zeroes the gradient buffers of all parametersloss = self.loss(self.forward(ptu.from_numpy(observations)),ptu.from_numpy(actions))# the reason why we cannot use get_action since to_numpy will remove grad_funcloss.backward() # backpropself.optimizer.step() # # Does the updatereturn {# You can add extra logging information here, but keep this line'Training Loss': ptu.to_numpy(loss),}
然后get_action那块是:
def get_action(self, obs: np.ndarray) -> np.ndarray:if len(obs.shape) > 1:observation = obselse:observation = obs[None]# DONE TODO return the action that the policy prescribes return ptu.to_numpy(self.forward(ptu.from_numpy(observation)))
因为最后需要转成np.ndarray输出,所以我一直没感觉,直到发生错误:
然后把 我就以为是loss没有数据 然后还print(loss.shape)
,接着错误就更深了,因为只有一个数据的时候输出就是torch.Size([])
然后我就误以为是没有数据,还疑惑
最后呢是健聪过来看我写的,emmm 输出发现 grad_func不见了,然后就知道了
参考引用
写代码过程中的几个Github参考:
- https://github.com/cww97/cs285_fall2020_cww/tree/main/hw1
- https://github.com/vincentkslim/cs285_homework_fall2020/tree/master
- https://github.com/mdeib/berkeley-deep-RL-pytorch-solutions
安装Mujoco Ubuntu
介绍
Mujoco: Mujoco is owned by Roboti LLC, initially used by Movement and Control Laboratory at the University of Washington. MuJoCo stands for Multi-Joint Dynamics with Control and is physics engine which provides simulation environments for research in several areas related to robotics, biomechanics, and graphics. Minimal installation of OpenAI Gym doesn’t include Mujoco because Mujoco needs to be properly licensed which can cost you up to $2000 unless you are a student which again has some very strict clauses regarding publication. Either way, you can still install it and use it for personal projects as much as you want with a student license or 30-day trial license.
前提依赖等
sudo apt-get install libosmesa6-dev
conda install anaconda patchelf
许可证申请
申请网址:https://www.roboti.us/license.html 可按照学校邮箱申请edu
设置路径
> set MUJOCO_PY_MJKEY_PATH=C:\path\to\.mujoco\mjkey.txt
> set MUJOCO_PY_MUJOCO_PATH=C:\Users\zhangqingwen\Downloads\AProgramm\drl-hwprogramm\mujoco200_win64\bin
> set PATH=C:\Users\zhangqingwen\Downloads\AProgramm\drl-hwprogramm\mujoco200_win64\bin
如果是试用版本的话,下载,后对于ubuntu系统需要chmod +x getid_linux
然后再:./getid_linux