Bluedeck Library 图书馆卡
Bluedeck Library 图书馆卡
存档UTC时间 2022年6月24日 04:26 存档编者 Kcx36 当前版本号 72322439
{{vfd|没有足够的可靠资料来源能够让这个条目符合Wikipedia:关注度 中的标准|date=2022/06/24}}
{{Notability|time=2022-05-25T01:09:01+00:00}}
带有经验回放的演员-评论家算法 ({{lang-en|Actor-Critic with Experience Replay}}),简称ACER 。是2017年由DeepMind 团队在提出的算法。其论文发表在{{le|ICLR|International Conference on Learning Representations}}上。该文提出了一种基于深度强化学习Actor-Critic下带有经验回放 的算法,能够在变化的环境中取得不错的效果,其中包括了57个Atari游戏以及一些需要持续控制的问题。[ 1]
在强化学习 中,环境交互需要付出极大的代价;这与普通的分类、回归问题不同,会消耗大量的时间和系统资源。有效采样({{lang-en|Sample Efficiency}})的方法可以使得算法在与环境交互较少的情况下获得较好的结果。其中,为了提高有效采样,使用经验回放是一个很好的方法。而在强化学习中,如果采样时所选取的策略{{lang-en|policy}}与选取动作时所用到的策略不同,我们将这种情况称之为离轨策略({{lang-en|off-policy}})控制。
ACER就是一种离轨策略下的演员评论家算法({{lang-en|off-policy actor-critic}})。
对于离轨策略而言,我们采样所得到的轨迹是与同轨策略({{lang-en|on-policy}})不同的。这里同轨是指采样时所用的策略与选取动作时的策略相同。所以需要用到重要性采样来对其进行调整。加上重要性采样的权重后策略梯度可以被写作
g
^
=
(
∏
t
=
0
k
ρ
t
)
∑
t
=
0
k
(
∑
i
=
0
k
−
t
γ
i
r
t
+
i
)
∇
θ
log
π
θ
(
a
t
|
x
t
)
{\displaystyle {\hat {g}}=(\prod _{t=0}^{k}\rho _{t})\sum _{t=0}^{k}(\sum _{i=0}^{k-t}\gamma ^{i}r_{t+i})\nabla _{\theta }\log \pi _{\theta }(a_{t}|x_{t})}
据Off-Policy Actor-Critic称,离线策略的策略梯度可以拆解为
g
=
E
β
[
ρ
t
∇
θ
(
a
t
|
x
t
)
Q
π
(
x
t
,
a
t
)
]
{\displaystyle g=\mathbb {E_{\beta }} [\rho _{t}\nabla _{\theta }(a_{t}|x_{t})Q^{\pi }(x_{t},a_{t})]}
[ 2]
由于重要性采样的参数
ρ
t
=
π
(
a
t
|
x
t
)
μ
(
a
t
|
x
t
)
{\displaystyle \rho _{t}={\pi (a_{t}|x_{t}) \over \mu (a_{t}|x_{t})}}
是一个比值,有可能非常大或者非常小,严重影响算法的稳定性,所以使用了带有偏差矫正的重要性权重截断技术,使得
E
μ
[
ρ
t
⋅
⋅
⋅
]
=
E
μ
[
ρ
¯
t
⋅
⋅
⋅
]
+
E
a
∼
π
[
[
ρ
t
(
a
)
−
c
ρ
t
]
+
⋅
⋅
⋅
]
{\displaystyle \mathbb {E} _{\mu }[\rho _{t}\cdot \cdot \cdot ]=\mathbb {E} _{\mu }[{{\bar {\rho }}_{t}\cdot \cdot \cdot }]+\mathbb {E} _{a\sim \pi }[[{\rho _{t}(a)-c \over \rho _{t}}]+\cdot \cdot \cdot ]}
,其中
ρ
¯
t
=
m
i
n
(
c
,
ρ
t
)
{\displaystyle {\bar {\rho }}_{t}=min(c,\rho _{t})}
,这样的变换既不会产生额外的偏差,而且产生的两项各自都是有界的,第一项
E
μ
[
ρ
¯
t
⋅
⋅
⋅
]
<
c
{\displaystyle \mathbb {E} _{\mu }[{{\bar {\rho }}_{t}\cdot \cdot \cdot }]<c}
,第二项
E
a
∼
π
[
[
ρ
t
(
a
)
−
c
ρ
t
]
+
⋅
⋅
⋅
]
<
1
{\displaystyle \mathbb {E} _{a\sim \pi }[[{\rho _{t}(a)-c \over \rho _{t}}]+\cdot \cdot \cdot ]<1}
动作值函数
Q
π
(
x
t
,
a
t
)
{\displaystyle Q^{\pi }(x_{t},a_{t})}
的估计使用了回溯技术。
Q
r
e
t
(
x
t
,
a
t
)
=
r
t
+
γ
ρ
t
+
1
¯
[
Q
r
e
t
(
x
t
+
1
,
a
t
+
1
)
−
Q
(
x
t
+
1
,
a
t
+
1
)
]
+
γ
V
(
x
t
+
1
)
{\displaystyle Q^{ret}(x_{t},a_{t})=r_{t}+\gamma {\bar {\rho _{t+1}}}[Q^{ret}(x_{t+1},a_{t+1})-Q(x_{t+1},a_{t+1})]+\gamma V(x_{t+1})}
以上的Q函数和V函数的估计使用了dueling network的结构。使用采样的方法计算
Q
~
θ
v
(
x
t
,
a
t
)
∼
V
θ
v
(
x
t
)
+
A
θ
v
(
x
t
,
a
t
)
−
1
n
∑
i
=
1
n
A
θ
v
(
x
t
,
u
i
)
,
a
n
d
u
i
∼
π
θ
(
⋅
|
x
t
)
{\displaystyle {\tilde {Q}}_{\theta _{v}}(x_{t},a_{t})\sim V_{\theta _{v}}(x_{t})+A_{\theta _{v}}(x_{t},a_{t})-{1 \over n}\sum _{i=1}^{n}A_{\theta _{v}}(x_{t},u_{i}),and\ \ u_{i}\sim \pi _{\theta }(\cdot |x_{t})}
这样输出的网络为
Q
θ
v
{\displaystyle Q_{\theta _{v}}}
和
A
θ
v
{\displaystyle A_{\theta _{v}}}
综合前三项,最终得到了ACER的离线策略梯度解析失败 (语法错误): {\displaystyle \widehat{g_t}^{acer} = \bar{\rho_t}\nabla _{\phi _\theta(x_t)}\log f(a_t|\phi_\theta(x))[Q^ret(x_t,a_t)- V_{\theta_v}(x_t)]+\mathbb{E}_{a\sim\pi}([{{((}}\rho_t(a)-c}\over{\rho_t(a)}}]_+\nabla_{\phi_\theta(x_t)} \log f(a_t|\phi_\theta(x))[Q_{\theta_v}(x_t,a)-V_{\theta_v}(x_t)]}
通过写出信赖域最优化问题
m
i
n
i
m
i
z
e
z
1
2
‖
g
t
^
a
c
e
r
−
z
‖
2
2
{\displaystyle minimize_{z}\ \ {1 \over 2}\left\Vert \ {\hat {g_{t}}}^{acer}-z\right\Vert _{2}^{2}}
s
u
b
j
e
c
t
t
o
∇
ϕ
θ
(
x
t
)
D
K
L
[
f
(
⋅
|
ϕ
θ
a
(
x
t
)
)
|
|
f
(
⋅
|
ϕ
θ
(
x
t
)
)
]
T
z
≤
δ
{\displaystyle subject\ \ to\ \ \nabla _{\phi _{\theta }(x_{t})D_{KL}}[f(\cdot |\phi _{\theta _{a}}(x_{t}))||f(\cdot |\phi _{\theta }(x_{t}))]^{T}z\leq \delta }
直接解析求得最优解解析失败 (语法错误): {\displaystyle z^* = \hat{g_t}^{acer}-max\{ 0,{{((}}k^T \hat{g_t}^{acer}-\delta}\over {||k||^2_2}} \}k}
得到参数更新公式解析失败 (语法错误): {\displaystyle \theta\leftarrow \theta +{{((}}\partial \phi_\theta(x)}\over{\partial\theta}}z^*}
算法1:对于离散动作情况下ACER的主算法
初始化全局共享参数向量
θ
a
n
d
θ
v
{\displaystyle \theta \ \ and\ \ \theta _{v}}
设置回放率
r
{\displaystyle r}
在达到最大迭代次数或者时间用完前:
调用算法2中的在线策略ACER
n
←
P
o
s
s
i
o
n
(
r
)
{\displaystyle n\leftarrow \ \ Possion(r)}
对于
i
∈
{
1
,
⋅
⋅
⋅
,
n
}
{\displaystyle i\in \{1,\cdot \cdot \cdot ,n\}}
执行:
调用算法2中的离线策略ACER
算法2:离散动作下的ACER
重置梯度
d
θ
←
0
a
n
d
d
θ
v
←
0
{\displaystyle d\theta \leftarrow 0\ \ and\ \ d\theta _{v}\leftarrow 0}
初始化参数
θ
′
←
θ
a
n
d
θ
v
′
←
θ
v
{\displaystyle \theta '\leftarrow \theta \ \ and\ \ \theta '_{v}\leftarrow \theta _{v}}
如果不是在线策略:
从经验回放中采样轨迹
{
x
0
,
a
0
,
r
0
,
μ
(
⋅
|
x
)
,
⋅
⋅
⋅
,
x
k
,
a
k
,
r
k
,
μ
(
⋅
|
x
k
)
}
{\displaystyle \{x_{0},a_{0},r_{0},\mu (\cdot |x),\cdot \cdot \cdot ,x_{k},a_{k},r_{k},\mu (\cdot |x_{k})\}}
否则,获取状态
x
0
{\displaystyle x_{0}}
对于
i
∈
{
0
,
⋅
⋅
⋅
,
k
}
{\displaystyle i\in \{0,\cdot \cdot \cdot ,k\}}
执行:
计算
f
(
⋅
|
ϕ
θ
′
(
x
i
)
)
,
Q
θ
v
′
(
x
i
,
⋅
)
{\displaystyle f(\cdot |\phi _{\theta '}(x_{i})),Q_{\theta '_{v}}(x_{i},\cdot )}
和
f
(
⋅
|
ϕ
θ
a
(
x
i
)
)
{\displaystyle f(\cdot |\phi _{\theta _{a}}(x_{i}))}
如果是在线策略则
依据
f
(
⋅
|
ϕ
′
(
x
i
)
)
{\displaystyle f(\cdot |\phi '(x_{i}))}
执行动作
a
i
{\displaystyle a_{i}}
得到回报
r
i
{\displaystyle r_{i}}
和新的状态
x
i
+
1
{\displaystyle x_{i+1}}
μ
(
⋅
|
x
i
)
←
f
(
⋅
|
ϕ
θ
′
(
x
i
)
)
{\displaystyle \mu (\cdot |x_{i})\leftarrow f(\cdot |\phi _{\theta '}(x_{i}))}
解析失败 (语法错误): {\displaystyle \bar{\rho_i}\leftarrow min\{1,{{((}}f(a_i|\phi_{\theta'}(xi))}\over{\mu(a_i|x_i)}} \}}
Q
r
e
t
←
{
0
f
o
r
t
e
r
m
i
n
i
a
l
x
k
∑
a
Q
θ
v
′
(
x
k
,
a
)
f
(
a
|
ϕ
θ
′
(
x
k
)
)
o
t
h
e
r
w
i
s
e
{\displaystyle Q^{ret}\leftarrow {\begin{cases}0\ \ for\ \ terminial\ \ x_{k}\\\sum _{a}Q_{\theta '_{v}}(x_{k},a)f(a|\phi _{\theta '}(x_{k}))\ \ otherwise\end{cases}}}
对于
i
∈
{
k
−
1
,
⋅
⋅
⋅
,
0
}
{\displaystyle i\in \{k-1,\cdot \cdot \cdot ,0\}}
执行
Q
r
e
t
←
r
i
+
γ
Q
r
e
t
{\displaystyle Q^{ret}\leftarrow r_{i}+\gamma Q^{ret}}
V
i
←
∑
a
Q
θ
v
′
(
x
i
,
a
)
f
(
a
|
ϕ
θ
′
(
x
i
)
)
{\displaystyle V_{i}\leftarrow \sum _{a}Q_{\theta '_{v}}(x_{i},a)f(a|\phi _{\theta '}(x_{i}))}
计算信赖域更新所需的:
解析失败 (语法错误): {\displaystyle g \leftarrow min \{ c,\rho_i(a_i)\} \nabla_{\phi_'(x_i)}\log f(a_i|\phi_{\theta'}(x_i))(Q^{ret}-V_i)+ \sum_a[1-{{((}}c}\over{\rho_i(a)}}]_+ f(a|\phi_{\theta'}(x_i))\nabla_{\phi_{\theta'}(x_i)}\log f(a|\phi_{\theta'}(x_i))(Q_{\theta'_v}(x_i,a_i)-V_i)}
k
←
∇
ϕ
θ
′
(
x
i
)
D
K
L
[
f
(
⋅
|
ϕ
θ
a
(
x
i
)
)
|
|
f
(
⋅
|
ϕ
θ
′
(
x
i
)
]
{\displaystyle k\leftarrow \nabla _{\phi _{\theta '}(x_{i})}D_{KL}[f(\cdot |\phi _{\theta _{a}}(x_{i}))||f(\cdot |\phi _{\theta '}(x_{i})]}
累积梯度解析失败 (语法错误): {\displaystyle \theta':d\theta'\leftarrow +{{((}}\partial \phi_{\theta'}(x_i)}\over{\partial\theta'}}(g-max\{ 0,{{((}}k^Tg-\delta}\over{||k||^2_2}}k \})}
累积梯度
θ
v
′
:
d
θ
v
+
∇
θ
v
′
(
Q
r
e
t
−
Q
θ
v
′
(
x
i
,
a
i
)
)
+
V
i
{\displaystyle \theta '_{v}:d\theta _{v}+\nabla _{\theta '_{v}}(Q^{ret}-Q_{\theta '_{v}}(x_{i},a_{i}))+V_{i}}
用
d
θ
,
d
θ
v
{\displaystyle d\theta ,d\theta _{v}}
分别异步更新
θ
,
θ
v
{\displaystyle \theta ,\theta _{v}}
更新平均策略网络:
θ
a
←
α
θ
a
+
(
1
−
α
)
θ
{\displaystyle \theta _{a}\leftarrow \alpha \theta _{a}+(1-\alpha )\theta }
{{reflist}}
Category:算法
^ {{Cite journal |last=Wang |first=Ziyu |date=2017 |title=SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY |url=https://arxiv.org/pdf/1611.01224.pdf |journal=ICLR}}
^ {{cite web |title=Off-Policy Actor-Critic |url=https://arxiv.org/pdf/1205.4839.pdf |website=arXiv |accessdate=2022-05-28}}