OMRON

OMRON SINIC X Corporation | Global

OMRON SINIC X to Latest Research Paper Accepted to ACL 2026, Top-tier Conference in the Field of Natural Language Processing

OMRON SINIC X Corporation (HQ: Bunkyo-ku, Tokyo; President and CEO: Masaki Suwa; hereinafter “OSX”)  is pleased to announce that its latest research paper has been accepted for The 64th Annual Meeting of the Association for Computational Linguistics(hereinafter “ACL 2026”).

ACL 2026 is one of the top-tier international conferences in the field of natural language processing. The conference will be held from July 2 to July 7, 2026, in San Diego, United States (local time).

The research paper to be presented by OSX are as follows.

ACL 2026 presentations

[Main Conference]

■ Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge (The University of Osaka, OSX intern), Mai Nishimura (OSX), Jiaxin Ma (OSX)

Agentic RAG has emerged as a dominant paradigm, enabling models to autonomously decide when to retrieve, how to formulate queries, and how to synthesize multiple documents into an answer. To date, however, such capabilities have been realized only in large models with billions of parameters, and achieving sufficient performance with compact models in resource-constrained environments remains largely underexplored.
This study aims to enable agentic search capabilities for compact models (0.5–1B parameters).
Post-training methods such as reinforcement learning (RL) elicit such capabilities, yet compact models start from low initial performance, yielding almost no reward signal and causing training to collapse.
This study aims to elicit agentic search capabilities from compact models (0.5–1B parameters). Post-training methods such as reinforcement learning (RL) can unlock such capabilities, yet compact models start from low initial performance, yielding almost no reward signal which results in collapse. To address this, we propose Distillation-Guided Policy Optimization (DGPO), an RL framework built on a simple principle: reward if right, mimic the teacher if wrong. When the student generates a correct answer, RL rewards its own reasoning; when it fails, the student learns from teacher demonstrations instead. Teacher distillation also stabilizes the policy during early training, preventing the cold-start collapse that pure RL suffers. This combination enables compact models to acquire agentic search capabilities that distillation alone cannot achieve.
Across seven question-answering benchmarks, the average performance of a 0.5B model improved from near-zero (0.006) to 0.329, reaching a level comparable to its 3B teacher model (0.353). On several datasets, the student even surpassed the teacher. These results demonstrate the potential of deploying capable search agents not only on resource-constrained devices but also as lightweight sub-agents within agent orchestration.

https://arxiv.org/abs/2508.20324
https://github.com/omron-sinicx/dgpo

 

※Author information is current as of the date of writing or submission. Please be advised that the information may become outdated after that point.

 



For any inquiries about OSX, please contact us here.

share
home
Page
Top