OpenTinker: Democratizing Agentic Reinforcement Learning as a Service

Design Philosophy

Core principles driving OpenTinker's architecture

Programming & Execution Disaggregation

Users can perform RL training and inference without local GPU resources. Built-in Distributed Training and Job Scheduling manage resources transparently.

Environment & Training Disaggregation

Simplifies the design of various agentic task environments. Includes support for any single-turn and multi-turn agentic tasks.

Training to Inference

Environments and agentic workflows can be seamlessly connected to inference, allowing trained models to be directly applied without code changes.

System Architecture & Protocol

OpenTinker's architecture consists of three core components with a streamlined three-phase communication protocol:

Architecture Components

Client: Lightweight local interface for defining environments and submitting training jobs. Requires no local GPU.
Scheduler & Worker Pool: Central coordinator that manages GPU resource allocation and maintains a pool of available Workers.
Training Server (GPU Worker): Dedicated GPU-powered worker that executes model training and rollout generation.

Protocol Flow

Job Submit: Client sends job request (model config, training/inference args) to the Scheduler.
Allocation: Scheduler allocates GPU Worker(s) from the pool and spawns a Training/Inference Server instance.
Data Streaming: Client establishes a link with the Training/Inference Server for real-time metrics.

Programming Interfaces

Unified APIs for Training and Inference

OpenTinker provides a high-level Python API that abstracts away the complexity of distributed systems. Key components include:

Environment Abstraction

A unified wrapper that encapsulates your custom game logic. It handles data loading—supporting both static datasets and dynamic generation—and integrates seamlessly with the distributed training loop.

Key features:

Standardized reset() and step() interface
Built-in prompt engineering integration
Automatic data generation hooks

Interface Example:

from opentinker.environment.base_game import AbstractGame, StepResult

class CustomGame(AbstractGame):
    def get_system_prompt(self) -> str:
        return "You are an agent playing a custom game..."

    def get_initial_user_message(self) -> str:
        return "Game starts. Your turn."

    def reset(self, **kwargs) -> str:
        # Initialize game state
        self.state = "start"
        return "Initial observation"

    def step(self, action: str) -> StepResult:
        # 1. Parse and validate action
        if not self.is_valid(action):
             return StepResult(observation="Invalid", reward=-1.0, done=False)
        
        # 2. Update state and check win condition
        new_obs, reward, done = self.engine.step(action)
        
        # 3. Return result
        return StepResult(observation=new_obs, reward=reward, done=done)

Client Design

OpenTinker provides three specialized clients to manage the full lifecycle: Job Scheduler, Training Service, and Inference Manager.

Training Workflow:

# 1. Submit Job via Scheduler
scheduler = SchedulerClient(url=server_url)
job_result = scheduler.submit_job(config=args, num_gpus=8)
job_id, server_url = job_result["job_id"], job_result["server_url"]

# 2. Initialize Environment
env = GameEnvironment(
    game_class=GomokuGame,
    config=args,
    job_id=job_id
)

# 3. Connect & Train
client = ServiceClient(server_url=server_url)
client.set_config(args, env)
client.fit(env=env, num_epochs=args.num_epochs)

Inference Workflow:

# 1. Submit Inference Job
scheduler = InferenceSchedulerClient(url=scheduler_url)
job_result = scheduler.submit_inference_job(
    model_path=args.model_path,
    num_gpus=args.num_gpus
)
job_id, vllm_server_url = job_result["job_id"], job_result["vllm_server_url"]

# 2. Run Inference via Pipeline
run_inference(
    vllm_server_url=vllm_server_url,
    data_path=args.data_path,
    game_class=MathGame,
    job_id=job_id,
    output_path=args.output_path
)

Internal API: The RL `fit()` Interface

The fit() method acts as the central Reinforcement Learning orchestrator, transforming the complex distributed training backend into a clean, synchronous procedure.
Note: While encapsulated within fit(), users are fully empowered to modify this logic for custom training schedules.

Initialization & Handshake

Establishes secure connections, verifies server health, and provisions worker groups.

State Synchronization

Aligns experiment context, environment configs, and hyperparameters with the remote service.

RL Execution Loop

Orchestrates the RL loop: serializes data, executes training steps on GPU clusters, and streams metrics.

Lifecycle Management

Handles periodic policy evaluation and ensures checkpoints are safely persisted.

Agent Loop Design

Based on GenericAgentLoop

We extend the Verl agent loop to support flexible, multi-turn interactions. The `GenericAgentLoop` implements a robust state machine:

PENDING: Tokenize initial prompt.
GENERATING: LLM generates response tokens (mask=1 for loss).
INTERACTING: System executes action in environment, receives observation (mask=0).
TERMINATED: Episode complete.

This design supports both single-turn reasoning (Math) and multi-turn sequential decision making (Gomoku, Math Tool Calling) within a unified codebase, and can be easily extended to any similar agentic environments.

State Machine

PENDING
  ↓
GENERATING ◄──┐
  ↓           │
INTERACTING ──┘
  ↓
TERMINATED

Citation

@misc{opentinker2025,
  title        = {OpenTinker: Democratizing Agentic Reinforcement Learning as a Service},
  author       = {Siqi Zhu and Jiaxuan You},
  year         = {2025},
  howpublished = {\url{https://github.com/open-tinker/OpenTinker}},
  note         = {GitHub repository}
}

OpenTinker

Democratizing Agentic Reinforcement Learning as a Service

OpenTinker empowers users to perform RL training and inference without local GPU resources through a seamless, distributed architecture.