Three co-authored research papers from OMRON SINIC X presented in International Conference on Machine Learning 2023
"One of the three papers accepted with the Outstanding Paper Award"

  • August 09, 2023

OMRON SINIC X Corporation (HQ: Bunkyo-ku, Tokyo; President and CEO: Masaki Suwa; hereinafter "OSX") is pleased to announce that our three co-authored research papers by an OSX senior researcher, Tadashi Kozuno, and external co-researchers have been accepted to be presented at the International Conference on Machine Learning 2023 (hereinafter "ICML 2023") held in Honolulu on July 23.

Along with NeurIPS*1, ICML is one of the premier international conferences with international authority in the field of machine learning and related areas. More than 5,000 research papers were submitted to the conference this year, and just 27.9% were accepted.
*1: Neural Information Processing Systems

In these research papers, mathematical analyses are carried out and summarized as results in order to improve the efficiency and performance of the reinforcement learning algorithm and the algorithm for imperfect information games. In particular,

In particular, "Adapting to game trees in zero-sum imperfect information games" won an Outstanding Paper Award, given to only six papers out of all submissions.

OSX continues to create value via technological innovation through collaboration with universities and external research institutions.

Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

Toshinori Kitamura*1, Tadashi Kozuno*2, Yunhao Tang*3, Nino Vieillard*4, Michal Valko*3, Wenhao Yang*5, Jincheng Mei*4, Pierre Mテゥnard*6, Mohammad Gheshlaghi Azar*3, Rテゥmi Munos*3, Olivier Pietquin*4, Matthieu Geist*4, Csaba Szepesvテ。ri*7,3, Wataru Kumagai*1, Yutaka Matsuo*1
*1: The University of Tokyo, *2: OSX, *3: Google DeepMind, *4: Google Research, Brain team, *5: Peking University, *6: Otto von Guericke University Magdeburg, *7: University of Alberta


Kullback-Leibler (KL) divergence and entropy regularization play an important role in recent reinforcement learning algorithms. For example, promote exploration, robustify algorithms against value estimation errors, and enable monotonic performance improvement guarantee. Mirror Descent Value Iteration (MDVI for short) is a method that incorporates both KL divergence and entropy regularization into policy evaluation and policy update, and is known to achieve optimal sample efficiency in reinforcement learning without function approximation.
In this study, we extended MDVI and proposed a method that can achieve optimal sample efficiency even in situations where function approximation is required.


In situations where function approximation is required, the least squares method is commonly used. While it is necessary to consider the variance of value function estimation error to achieve optimal sample complexity in reinforcement learning, the least squares method cannot examine the variance of errors. Therefore, we proposed Variance-Weighted Least-Squares MDVI (VWLS-MDVI for short), which combines MDVI with variance-weighted least squares method and makes it possible to achieve optimal sampling efficiency even with function approximation. Furthermore, we implemented VWLS-MDVI with deep learning, and realized a more practical algorithm called Deep Variance Weighting.


In this study, we assume that the environment is represented by a linear Markov decision process. This is a common assumption used in theoretical analysis, but it is a relatively strong assumption. We will conduct further research on whether optimal learning efficiency can be achieved under weaker assumptions.

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

Yunhao Tang*1, Tadashi Kozuno*2, Mark Rowland*1, Anna Harutyunyan*1, Rテゥmi Munos*1, Bernardo テ」ila Pires*1, Michal Valko*1
*1: Google DeepMind, *2: OSX


Reinforcement learning methods alternate between policy evaluation and policy improvement based on the value function. Methods that can only use data that follows the current policy during policy evaluation and improvement are called on-policy learning, while those that do not are called off-policy learning. Off-policy learning is considered important for improving learning efficiency because it can learn from previously accumulated data.

In addition to these classifications, there is also a classification between single-step learning and multi-step learning. Single-step learning refers to the use of only the action and its immediate result at each time step during policy evaluation and improvement, while multi-step learning refers to the use of actions and their results over multiple consecutive time steps. Empirically, it is known that multi-step learning is effective in improving performance.

However, previous methods that used multi-step learning for policy improvement were limited to on-policy learning. In this study, we aimed to further improve reinforcement learning methods by proposing DoMo-VI and DoMo-AC, which are off-policy and multi-step learning methods for both policy evaluation and policy improvement.


The proposed DoMo-VI method guarantees faster convergence to the optimal policy than existing methods by combining multi-step policy improvement and multi-step policy evaluation in a dynamic programming algorithm. DoMo-AC is an implementation of DoMo-VI that is more suitable for deep learning.
1) Atari-57 is a benchmark task consisting of 57 different games from the Atari 2600, and it is often used to evaluate the performance of reinforcement learning methods.
We implemented DoMo-AC based on IMPALA, a distributed deep reinforcement learning algorithm, and tested it on the Atari-57 benchmark task. DoMo-AC achieved stable performance improvements compared to IMPALA. It also showed low sensitivity to parameters that adjust the degree of multi-step in policy evaluation and policy improvement. This indicates that it is easy to use in practice.


As a future work, we will conduct further research to see if DoMo-AC can achieve high performance even in cases where the actions are continuous, such as precise control of robots, etc.

Adapting to game trees in zero-sum imperfect information games

Cテエme Fiegel*1, Pierre Mテゥnard*2, Tadashi Kozuno*3, Rテゥmi Munos*4, Vianney Perchet*1, 5, Michal Valko*5
*1: CREST, ENSAE, IP Paris, *2: ENS Lyon, *3: OSX, *4: Google DeepMind, *5: CRITEO AI Lab


This paper focuses on learning in two-player zero-sum imperfect information games (IIGs). IIGs are games where each player only partially observes the current game state, and they can model complex strategic behavior such as bluffing.

There are two goals in IIGs. The first one is to adapt and choose the optimal strategy against the opponent. However, this is not easy as the opponent also changes their strategy. The other one is to compute a Nash equilibrium. There are methods to compute Nash equilibria when the game structure, transition probabilities, and reward function are known beforehand, but they are computationally expensive and not practical.

We proposed a method that achieves both goals with low computational cost.


In this paper, we proposed two methods based on the well-known Follow the Regularized leader (FTRL) algorithm.

The first one is Balanced FTRL. Balanced FTRL can achieve optimal sample efficiency and performance by using prior knowledge of the game structure. The second one is Adaptive FTRL, which learns while estimating the game structure knowledge required for Balanced FTRL. It can achieve near-optimal sample efficiency and performance without requiring knowledge of the game structure.

We verified their performances in practice and confirmed that Adaptive FTRL showed almost the same performance as that of Balanced FTRL and is more practical because it does not require knowledge of the game structure.


We will continue the study of algorithms that do not require game knowledge while having optimal sample efficiency and performance, algorithms that directly output Nash equilibria, and algorithms that have optimal performance even when using function approximation.

About OMRON SINIC X Corporation
OMRON SINIC X Corporation is a strategic subsidiary seeking to realize the "near future designs" that OMRON forecasts. Researchers with cutting-edge knowledge and experience across many technological domains, including AI, Robotics, IoT, and sensing, are affiliated with OSX, and with the aim of solving social issues, they are working to create near future designs by integrating innovative technologies with business models and strategies in technology and IP. The company will also accelerate the creation of near future designs through joint research with universities and external research institutions. For more details, please refer to

For press inquiries related to this release, please contact the following:
Tech Communications and Collaboration Promo Dept.
Strategy Division
Technology and Intellectual Property H.Q.
OMRON Corporation
TEL: +81-774-74-2010

Adobe Acrobat Reader is free software that lets you view and print Adobe Portable Document Format (PDF) files.