Skip to content

Latest commit

 

History

History
111 lines (96 loc) · 17.5 KB

README.md

File metadata and controls

111 lines (96 loc) · 17.5 KB

awesome-Causal-RL-papers

Here is a list of papers related to causal reinforcement learning, and I hope you can submit relevant missing papers in the issue.

Blog

Survey

  • [1] Zeng, Y., Cai, R., Sun, F., Huang, L., & Hao, Z. (2023). A Survey on Causal Reinforcement Learning. arXiv preprint arXiv:2302.05209.
  • [2] Deng Z, Jiang J, Long G, et al. Causal Reinforcement Learning: A Survey[J]. arXiv preprint arXiv:2307.01452, 2023.

Single-Agent RL

  • [1] Li M, Yang M, Liu F, et al. Causal world models by unsupervised deconfounding of physical dynamics[J]. arXiv preprint arXiv:2012.14228, 2020.
  • [2] Liu Y R, Huang B, Zhu Z, et al. Learning World Models with Identifiable Factorization[J]. arXiv preprint arXiv:2306.06561, 2023.
  • [3] Zhu D, Li L E, Elhoseiny M. CausalDyna: Improving Generalization of Dyna-style Reinforcement Learning via Counterfactual-Based Data Augmentation[J]. 2021.
  • [4] Zholus A, Ivchenkov Y, Panov A. Factorized world models for learning causal relationships[C]//ICLR2022 Workshop on the Elements of Reasoning: Objects, Structure and Causality. 2022.
  • [5] Lu C, Huang B, Wang K, et al. Sample-efficient reinforcement learning via counterfactual-based data augmentation[J]. arXiv preprint arXiv:2012.09092, 2020.
  • [6] Wang Z, Xiao X, Xu Z, et al. Causal dynamics learning for task-independent state abstraction[J]. arXiv preprint arXiv:2206.13452, 2022.
  • [7] Pitis S, Creager E, Mandlekar A, et al. Mocoda: Model-based counterfactual data augmentation[J]. Advances in Neural Information Processing Systems, 2022, 35: 18143-18156.
  • [8] Huang B, Lu C, Leqi L, et al. Action-sufficient state representation learning for control with structural constraints[C]//International Conference on Machine Learning. PMLR, 2022: 9260-9279.
  • [9] Huang B, Feng F, Lu C, et al. AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning[C]//International Conference on Learning Representations. 2021.
  • [10] Feng F, Huang B, Zhang K, et al. Factored adaptation for non-stationary reinforcement learning[J]. Advances in Neural Information Processing Systems, 2022, 35: 31957-31971.
  • [11] Lee T E, Zhao J A, Sawhney A S, et al. Causal reasoning in simulation for structure and transfer learning of robot manipulation policies[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 4776-4782.
  • [12] Zhang A, Lyle C, Sodhani S, et al. Invariant causal prediction for block mdps[C]//International Conference on Machine Learning. PMLR, 2020: 11214-11224.
  • [13] Seitzer M, Schölkopf B, Martius G. Causal influence detection for improving efficiency in reinforcement learning[J]. Advances in Neural Information Processing Systems, 2021, 34: 22905-22918.
  • [14] Wang X, Liu Y, Song X, et al. CaMP: Causal Multi-policy Planning for Interactive Navigation in Multi-room Scenes[C]//Thirty-seventh Conference on Neural Information Processing Systems. 2023.
  • [15] Ding W, Lin H, Li B, et al. Generalizing goal-conditioned reinforcement learning with variational causal reasoning[J]. Advances in Neural Information Processing Systems, 2022, 35: 26532-26548.
  • [16] Oberst M, Sontag D. Counterfactual off-policy evaluation with gumbel-max structural causal models[C]//International Conference on Machine Learning. PMLR, 2019: 4881-4890.
  • [17] Park J, Seo Y, Liu C, et al. Object-aware regularization for addressing causal confusion in imitation learning[J]. Advances in Neural Information Processing Systems, 2021, 34: 3029-3042.
  • [18] Bica I, Jarrett D, van der Schaar M. Invariant causal imitation learning for generalizable policies[J]. Advances in Neural Information Processing Systems, 2021, 34: 3952-3964.
  • [19] Zhang P, Liu F, Chen Z, et al. Deep Reinforcement Learning with Causality-based Intrinsic Reward[J]. 2020.
  • [20] Mesnard T, Weber T, Viola F, et al. Counterfactual credit assignment in model-free reinforcement learning[J]. arXiv preprint arXiv:2011.09464, 2020.
  • [21] Li H, Principe J. Speeding Up Reinforcement Learning by Exploiting Causality in Reward Sequences[C]//2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021: 1-6.
  • [22] Pitis S, Creager E, Garg A. Counterfactual data augmentation using locally factored dynamics[J]. Advances in Neural Information Processing Systems, 2020, 33: 3976-3990.
  • [23] Eghbal-zadeh H, Henkel F, Widmer G. Learning to infer unseen contexts in causal contextual reinforcement learning[J]. Proceedings of the Self-Supervision for Reinforcement Learning, 2021.
  • [24] Herlau T, Larsen R. Reinforcement learning of causal variables using mediation analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(6): 6910-6917.
  • [25] Lu C, Hernández-Lobato J M, Schölkopf B. Invariant causal representation learning for generalization in imitation and reinforcement learning[C]//ICLR2022 Workshop on the Elements of Reasoning: Objects, Structure and Causality. 2022.
  • [26] Li Y, Zhang D, Yin F, et al. Cleaning robot operation decision based on causal reasoning and attribute learning[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 6878-6885.
  • [27] Cao Y, Li B, Li Q, et al. Reasoning Operational Decisions for Robots via Time Series Causal Inference[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 6124-6131.
  • [28] Sun H, Wang T. Toward Causal-Aware RL: State-Wise Action-Refined Temporal Difference[J]. arXiv preprint arXiv:2201.00354, 2022.
  • [29] Mutti M, De Santi R, Rossi E, et al. Provably efficient causal model-based reinforcement learning for systematic generalization[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(8): 9251-9259.
  • [30] Gasse M, Grasset D, Gaudron G, et al. Causal reinforcement learning using observational and interventional data[J]. arXiv preprint arXiv:2106.14421, 2021.
  • [31] Lee T E, Zhao J A, Sawhney A S, et al. Causal reasoning in simulation for structure and transfer learning of robot manipulation policies[C]//2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021: 4776-4782.
  • [32] Lee T E, Vats S, Girdhar S, et al. SCALE: Causal Learning and Discovery of Robot Manipulation Skills using Simulation[C]//7th Annual Conference on Robot Learning. 2023.
  • [33] Liang J, Boularias A. Learning Transition Models with Time-delayed Causal Relations[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 8087-8093.
  • [34] Buesing L, Weber T, Zwols Y, et al. Woulda, coulda, shoulda: Counterfactually-guided policy search[J]. arXiv preprint arXiv:1811.06272, 2018.
  • [35] Yang C H H, Danny I, Hung T, et al. Causal inference q-network: Toward resilient reinforcement learning[C]//Self-Supervision for Reinforcement Learning Workshop-ICLR 2021. 2021.
  • [36] Rezende D J, Danihelka I, Papamakarios G, et al. Causally correct partial models for reinforcement learning[J]. arXiv preprint arXiv:2002.02836, 2020.
  • [37] Learning Causal Dynamics Models in Object-Oriented Environments
    • key: Objected-Oriented MDP (FMDP), Object-Oriented Causal Graph, Model-based
    • summary_CN: 从MDP场景扩展到了FMDP,假设环境中存在不同类型的对象,也就是说Agent接收到的State由环境中各个对象的属性组成。传统的CRL方法会直接在属性层面进行因果关系发现,这种方法在变量少的时候表现不错,但是随着变量(对象)的增加,其开销会呈指数增长,于是作者提出可以class-level的角度去做因果发现然后再学习动力学模型,由原来的Causal dynamic model变为Object Oriented Causal Dynamic Model,从而大大缩小因果发现的开销并且对于对象数量的增加有一定的扩展性(注意如果是对象类型增加了,时间开销依然会增加)。 对于面向对象的因果图(Object Oriented Causal Graph)的发现,作者依然是使用具有理论保证的条件独立性测试去做(实现方式是通过条件互信息CMI)。 假设有两种类型的对象,且每一种类型的对象有100个,那么现在只需要做类内(同一种类型对象)属性因果关系发现以及类间(不同类型对象)属性间因果关系发现即可。
  • [38] Zhang Y, Du Y, Huang B, et al. Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach[C]//Thirty-seventh Conference on Neural Information Processing Systems. 2023.
    • key: delay reward
  • [39] Pan H R ,Gürtler, Nico, Neitz A ,et al.Direct Advantage Estimation[J]. 2021.DOI:10.48550/arXiv.2109.06093.
    • key: causal effect, advantage function, credit assigment
    • summary_CN : 这篇文章说明了,在一定的假设下,优势函数可以看作动作对期望回报的因果效应,并且提出了一种可以直接从轨迹数据中估计优势函数的方法,并且在大多数环境中,相比于GAE能够更好的提升策略性能。
  • [40] Corcoll O, Vicente R. Disentangling causal effects for hierarchical reinforcement learning[J]. arXiv preprint arXiv:2010.01351, 2020.
    • key: causal-effect, hierarchical RL
    • summy_CN: 这篇文章通过将由Action导致的环境变化(control effect)从总听的变化(total effect)中解耦出来。基于解耦得到的control effect进行探索、学习。本质上,control effect描述了代理在环境做动作中引起的改变。这些改变在本质上是可组合的和时间抽象的,使它们非常适合于描述性任务。 比如说捡起一个球,走到特定位置就是捡起球和移动这两个动作可以描述的任务。基于control effect,文章设计了一种分层强化学习框架(CEHRL), 在MiniGrid环境中进行了相关实验验证其性能。

Offline RL

  • [1] Sun Z, He B, Liu J, et al. Offline Imitation Learning with Variational Counterfactual Reasoning[C]//Thirty-seventh Conference on Neural Information Processing Systems. 2023.
    • key: counterfactual, Data Augment
    • summary_CN :本篇文章提出了在Offline场景中,如何使用反事实基于未标记的数据生成反事实的专家数据,从而对原有的专家数据集进行扩充。

Multi-Agent RL

  • [1] Grimbly S J, Shock J, Pretorius A. Causal multi-agent reinforcement learning: Review and open problems[J]. arXiv preprint arXiv:2111.06721, 2021.

Non-stationarity

  • [1]Shantian Yang , Bo Yang , Zheng Zeng ,and Zhongfeng Kang.2023."Causal inference multi-agent reinforcement learning for traffic signal control".Information Fusion,94():243-256.https://doi.org/10.1016/j.inffus.2023.02.009
    • key: traffic singal control, probabilistic inference, Latent variable
    • summary_CN: 本文聚焦于交通场景中的路口信号灯控制任务,每个路口的红绿灯有单独一个智能体进行控制。在多智能体场景中,对与每个智能体而言,由于对环境的部分感知以及对其它智能体策略的未知,导致环境的dynamics对于每个个体而言都是非稳态的。在这种情况下,每个Agent仅依靠自身的观测所学习到的policy很可能是次优、甚至无效的。本文中作者认为每个Agent的所面临的非稳态可以被一个潜在的因果变量刻画,并且每个Agent对应的潜在因果变量之间是互相影响(局限于相邻的Agent),从而可以通过整个潜在因果变量集合从概率的角度去刻画环境的非稳态。作者通过使用自身以及相邻Agent的状态动作历史序列利用变分推断去推断出其潜在因果变量对应的分布。从我的视角来看,这篇文章并没用到相关因果推断的技术,也没解释为什么会存在这样一个Latent causal variable,它的Causality体现在哪里?

Credit Assigment

  • [1] Pina R, De Silva V, Artaud C. Discovering Causality for Efficient Cooperation in Multi-Agent Environments[J]. arXiv preprint arXiv:2306.11846, 2023.
    • key: Reward Assigment, Non-linear Granger Causality, ACD
    • summary_CN:本文聚焦与多智能体中的信用分配问题。该问题说的是这样一个情况:多个智能体做出了对应的行为,然后得到了环境返回的作为多个智能体联合行动的奖励。那么各个单独的智能体无法知道其行为是否好,如果使用团队奖励作为每个单智能体的奖励,那么就会导致那些实际没有共享的智能体得到一个lazy policy,而无法得到一个好的policy。作者通过使用ACD去发现每个智能体的观测o和奖励r之间的因果关系,然后根据这个因果关系对奖励进行分配,从而惩罚那些lazy agent.
  • [2] Wang Z, Du Y, Zhang Y, et al. MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment[J]. arXiv preprint arXiv:2312.03644, 2023.
    • key: Reward predictor, Dynamic Bayesian Network
    • summary_CN: 本文聚焦与多智能体中的信用分配问题。通过学习每个智能体的状态、动作和奖励(团队奖励)之间的因果关系,从而对每个智能体都生成一个个体奖励,进而指导智能体的学习。在这篇文章中,作者使用动态贝叶斯网络(DBN)来描述奖励的潜在生成过程。在本文中,考虑到多智能体场景的因果关系对应的dynamics可能随着时间而演变,其使用了一个因果结构预测器在每个时间步产生一个结构估计,然后基于此结构以及得到的团队奖励进行个体奖励的预测。
  • [3] Foerster J, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).
    • key: policy gradient, counterfactual, baseline function
    • summary_CN: 简单来所,本文在Policy gradient的基础上引入了一个Counterfactual baseline,从而隐式的缓解了credit assigment的问题。 具体来说,这个方法应用与集中式训练的分布式执行的范式,在对每个Agent,计算优势函数的时候,使用Q网络估计在当前联合状态下其余Agent行为不变,而当前agent行为改变时(从a变成a’),这样一个“反事实”的优势值,这个优势值可以理解成当前Agent对团队奖励的预期贡献。

Scalability

  • [1] Ma H, Pu Z, Pan Y, et al. Causal Mean Field Multi-Agent Reinforcement Learning[C]//2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023: 1-8.
    • key: mean-filed,intervention, causal inference
    • summary_CN:本文提出了一种名为因果平均场Q学习(CMFQ)的算法,通过建立结构因果模型(SCM)使用干预来量化每个智能体交互的重要程度,并基于此设计了因果感知的紧凑表示方法,实现了相比于之前MFRL的性能的提升。

Other direction

  • [1] Baradel F, Neverova N, Mille J, et al. Cophy: Counterfactual learning of physical dynamics[J]. arXiv preprint arXiv:1909.12000, 2019.
  • [2] Sancaktar C, Blaes S, Martius G. Curious exploration via structured world models yields zero-shot object manipulation[J]. Advances in Neural Information Processing Systems, 2022, 35: 24170-24183.
  • [3] Li Z, Zhu X, Lei Z, et al. Deconfounding Physical Dynamics with Global Causal Relation and Confounder Transmission for Counterfactual Prediction[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(2): 1536-1545.
  • [4] What If You Were Not There? Learning Causally-Aware Representations of Multi-Agent Interactions (ICLR 2024 open review)
    • key: Multi-agent forecasting, Causal effect, sim-to-real
    • link: https://openreview.net/forum?id=viJlKbTfbb
    • summary_CN: 本文的核心思想一句话说就是希望学习到的表征不仅能包含Direct causal agent对ego agent的causal effect,也希望可以包含Indirect causal agent对ego agent的causal effect(举个例子,你开车只看到了前面有个公交车,而公交车前面有个自行车,那么要预测你的开车轨迹不仅仅要考虑公交车还需要考虑自行车)。 这样学到的表征才能对扰动足够鲁棒,可以进行长时间步的rollout,可以快速的对新环境进行适应(前提是表征捕获到的因果机制在新环境也是相同的,Invariant Causal mechanism)。作者首先是根据一定的规则构造了反事实的数据集,并根据causal effect(paper中的公式2)对数据集中除了Ego agent以外的agent定义了三种agent:Non-causal agent、Direct causal agent、Indirect causal agent. 作者认为,如果agent $i$,$j$在场景A中的对于ego agent的causal effect $\epsilon_i$ < $\epsilon_j$,那么其在反事实场景中其causal effect也具有类似的关系。通过这样的直觉,作者分别制定了Causal Contrastive Learning以及Causal Ranking Learning两种学习方式去学习期望的表征。最终在in-distribution和OOD的实验下验证了其所提方法的有效性。
    • [5] Robust agents learn causal world models (ICLR 2024 accept-oral 886)
      • key: causality, distributional shift, world model
      • summary_CN: 本文告诉了我们这样一个结果:“任何能够适应足够大的分布变化的智能体都必须学习数据生成过程的因果模型”!本文表明了在其图1的结构中,在不同分布偏移场景下的最优策略一定是学习到接近数据产生机制的真正因果模型,换句话说,如果一个agent在不同的context下,表现都很好,那么这个agent一定是认识到了其行为和预期结果之间的因果关系。

benchmark

  • [1] Ahmed O, Träuble F, Goyal A, et al. Causalworld: A robotic manipulation benchmark for causal structure and transfer learning[J]. arXiv preprint arXiv:2010.04296, 2020.

Book

  • [1] Luczkow V. Structural Causal Models for Reinforcement Learning[M]. McGill University (Canada), 2021.