AdaptiveResourceGatheringWithExperienceTracking

RESEARCH HYPOTHESIS: In dynamically changing environments where the environment state transitions after each agent action, agents that incorporate self-observation data (including recent action sequences, reward histories, environment change patterns, and strategy effectiveness metrics) into their observation space will demonstrate superior adaptation performance compared to agents that only observe external environment state SUB-HYPOTHESES: H1: Agents with access to their own recent action history and corresponding rewards will adapt faster to dynamic environment changes than agents without this self-observation capability; H2: Agents that track environment change patterns (how the environment responds to their actions) will develop more robust strategies in dynamic settings; H3: Agents that monitor their own strategy effectiveness (progress toward goals over recent timesteps) will avoid repeating ineffective action sequences and converge to better policies USER'S ORIGINAL IDEA: dynamic reinforcement learning environment and adaptive agent: environment that dynamically change in training, so the agent inside should adapt the new situation every time, everything in the environment changes after each action. the hypothesis: when the environment is dynamic and changes after agent's each action, thus the agent must store the experience and must do self observation while taking actions and also must store the self observation experience and learn from these experiences in order to adapt the dynamically changing environment. Senin hipotezim 3 katmanlı: Environment dinamik değişmeli → ✅ B Agent deneyimi depolamalı (experience storage) → agent'ın "bu durumda şu oldu, ortam böyle değişti" bilgisini açıkça saklaması. Agent self-observation yapmalı → Agent sadece dış ortamı gözlemliyor (duvarlar, hedef, kaynaklar). Ama kendi iç durumunu — "son 5 adımda ne yaptım, ne oldu, ortam nasıl değişti, stratejim işe yaradı mı" — gözlemlesin tezin env kodunda olması gereken şey: Observation space'e agent'ın kendi deneyim özeti eklenmeli: Son N adımdaki action dizisi (ne yaptım) Son N adımdaki reward dizisi (ne oldu) Ortam değişim vektörü (ortam nasıl değişti — duvar toggle oranı, hedef ne kadar kaydı) Strateji etkinliği (son K adımda hedefe yaklaştım mı uzaklaştım mı) Deneyim kalıpları (bu tür ortam değişiminde hangi stratejim işe yaradı) Yani agent'ın observation'ı sadece "dış dünya şu an nasıl" değil, "ben ne yaptım, ne oldu, ortam nasıl tepki verdi" bilgisini de içermeli CRITICAL: The environment MUST implement ALL aspects of the hypothesis, including any agent-side mechanisms (self-observation, experience storage, adaptive behavior tracking) as part of the OBSERVATION SPACE and REWARD FUNCTION. Do not just build the environment dynamics — also embed the agent-side requirements into the env's observation/reward design. ENVIRONMENT SPECIFICATION: OBSERVATION SPACE: 89-dimensional vector containing: (1) Agent position (x,y) [2 dims], (2) Resource locations (5 resources, each with x,y,type,quantity) [20 dims], (3) Agent inventory (4 resource types) [4 dims], (4) Market prices for each resource type [4 dims], (5) Last 8 actions (0=move_up,1=move_down,2=move_left,3=move_right,4=gather,5=sell) [8 dims], (6) Last 8 rewards [8 dims], (7) Resource regeneration pattern: quantity change per resource over last 5 timesteps [25 dims], (8) Price volatility: price change per resource over last 4 timesteps [16 dims], (9) Gathering efficiency: resources gathered per gathering action over last 6 attempts [6 dims], (10) Market timing effectiveness: profit per sell action over last 4 sales [4 dims], (11) Exploration diversity: number of different resource types interacted with in last 10 actions [1 dim], (12) Strategy consistency score: correlation between action sequences and positive rewards over last 15 actions [1 dim]. ACTION SPACE: Discrete(6) - move in 4 directions, gather resource at current location, sell inventory at market. TRANSITION DYNAMICS: After each action: (a) Resource quantities change by ±20-50% with 60% probability, (b) Resource types may change (wood→stone, etc.) with 25% probability, (c) Market prices fluctuate ±10-30% based on agent's recent selling behavior, (d) New resources spawn randomly while others deplete. REWARD FUNCTION: +value for selling resources (based on quantity×price), +2 for gathering rare resources, -0.05 per timestep, +3.0 bonus for selling when prices are in top 25% of recent history (market timing), +1.5 bonus for maintaining diverse resource portfolio, -1.0 penalty for attempting same failed gathering sequence (last 4 actions) that previously yielded zero resources. EPISODE TERMINATION: 300 timesteps elapsed, total profit exceeds 100 units, or agent profit below -20 (bankruptcy). AGENT-SIDE REQUIREMENTS: Agent must track market timing patterns, resource availability changes, and maintain experience buffer linking action sequences to profitability outcomes.

Domain

custom

Difficulty

hard

Observation

Box(shape=?)

Action

Discrete(shape=?)

Reward

see spec

Max Steps

1000

Version

Tests (8/8)

syntaximportresetstepobs_spaceaction_spacereward_sanitydeterminism

Open in Builder

Use via API

import kualia

env = kualia.make("adaptiveresourcegatheringwithexperiencetracking")
obs, info = env.reset()

Environment Code

7390 chars

import gymnasium as gym
import numpy as np


class DynamicAdaptiveTraderEnv(gym.Env):
    """
    A hard-difficulty dynamic resource trading environment designed to test 
    adaptive agents with self-observation capabilities. The environment features 
    stochastically evolving resource distributions, fluctuating market prices, 
    and partial observability. Agents must track their own action histories 
    and reward sequences.
    
    Observation Space (Box):
        - Current prices (NUM_RESOURCES,) normalized to [-1, 1]
        - Inventory levels (NUM_RESOURCES,) normalized to [-1, 1]
        - Cash position (1,) normalized to [-1, 1]
        - Last HISTORY_LENGTH actions (HISTORY_LENGTH * NUM_RESOURCES,) in [-1, 1]
        - Last HISTORY_LENGTH rewards (HISTORY_LENGTH,) normalized to [-1, 1]
        - Market volatility indicator (1,) in [-1, 1]
        Total dimension: 2*NUM_RESOURCES + 1 + HISTORY_LENGTH*NUM_RESOURCES + HISTORY_LENGTH + 1
    
    Action Space (Box):
        - Continuous vector (NUM_RESOURCES,) in [-1, 1] representing target 
          position allocations for each resource
    
    Reward Structure:
        - Primary: Normalized change in portfolio value
        - Penalty: Small negative penalty proportional to trading volume
        - Range: Clipped to [-10.0, 10.0]
    """
    
    NUM_RESOURCES: int = 3
    HISTORY_LENGTH: int = 5
    MAX_STEPS: int = 1000
    INITIAL_CASH: float = 1000.0
    MAX_PRICE: float = 1000.0
    TRANSACTION_COST: float = 0.01
    MAX_INVENTORY: float = 100.0
    
    def __init__(self, render_mode: str | None = None) -> None:
        super().__init__()
        
        self.render_mode = render_mode
        
        self.action_space = gym.spaces.Box(
            low=-1.0, 
            high=1.0, 
            shape=(self.NUM_RESOURCES,), 
            dtype=np.float32
        )
        
        obs_dim = (
            self.NUM_RESOURCES + 
            self.NUM_RESOURCES + 
            1 + 
            self.HISTORY_LENGTH * self.NUM_RESOURCES + 
            self.HISTORY_LENGTH + 
            1
        )
        
        self.observation_space = gym.spaces.Box(
            low=-1.0,
            high=1.0,
            shape=(obs_dim,),
            dtype=np.float32
        )
        
        self.prices: np.ndarray | None = None
        self.inventory: np.ndarray | None = None
        self.cash: float = 0.0
        self.market_trend: np.ndarray | None = None
        self.step_count: int = 0
        self.action_history: np.ndarray | None = None
        self.reward_history: np.ndarray | None = None
        self.prev_portfolio_value: float = 0.0
        
    def reset(self, *, seed: int | None = None, options: dict | None = None) -> tuple[np.ndarray, dict]:
        super().reset(seed=seed)
        
        self.prices = self.np_random.uniform(10.0, 100.0, size=(self.NUM_RESOURCES,)).astype(np.float32)
        self.market_trend = self.np_random.uniform(-0.02, 0.02, size=(self.NUM_RESOURCES,)).astype(np.float32)
        self.inventory = np.zeros(self.NUM_RESOURCES, dtype=np.float32)
        self.cash = self.INITIAL_CASH
        self.step_count = 0
        
        self.action_history = np.zeros((self.HISTORY_LENGTH, self.NUM_RESOURCES), dtype=np.float32)
        self.reward_history = np.zeros(self.HISTORY_LENGTH, dtype=np.float32)
        
        self.prev_portfolio_value = self._calculate_portfolio_value()
        
        obs = self._get_obs()
        info = {}
        
        return obs, info
    
    def step(self, action: np.ndarray) -> tuple[np.ndarray, float, bool, bool, dict]:
        action = np.clip(action.astype(np.float32), -1.0, 1.0)
        
        self.action_history = np.roll(self.action_history, -1, axis=0)
        self.action_history[-1] = action
        
        max_affordable = self.cash / (self.prices * (1.0 + self.TRANSACTION_COST) + 1e-8)
        target_inventory = (action + 1.0) / 2.0 * max_affordable
        
        trade_diff = target_inventory - self.inventory
        trade_value = trade_diff * self.prices
        trade_cost = np.abs(trade_diff) * self.prices * self.TRANSACTION_COST
        total_cost = trade_value + trade_cost
        
        if np.sum(total_cost) > self.cash:
            scale = self.cash / (np.sum(total_cost) + 1e-8)
            trade_diff *= scale
        
        self.inventory = np.maximum(self.inventory + trade_diff, 0.0)
        self.cash -= np.sum(trade_diff * self.prices + np.abs(trade_diff) * self.prices * self.TRANSACTION_COST)
        self.cash = max(0.0, self.cash)
        
        self._update_market()
        
        current_portfolio_value = self._calculate_portfolio_value()
        value_change = current_portfolio_value - self.prev_portfolio_value
        reward = value_change / 100.0
        trade_penalty = -0.01 * np.sum(np.abs(trade_diff))
        reward += trade_penalty
        reward = float(np.clip(reward, -10.0, 10.0))
        
        self.reward_history = np.roll(self.reward_history, -1)
        self.reward_history[-1] = reward
        
        self.prev_portfolio_value = current_portfolio_value
        self.step_count += 1
        
        terminated = bool(self.cash < 1.0 and np.sum(self.inventory * self.prices) < 1.0)
        truncated = self.step_count >= self.MAX_STEPS
        
        obs = self._get_obs()
        
        info = {
            "reward_components": {
                "value_change": float(value_change / 100.0),
                "trade_penalty": float(trade_penalty),
                "portfolio_value": float(current_portfolio_value),
                "cash": float(self.cash)
            }
        }
        
        return obs, reward, terminated, truncated, info
    
    def _get_obs(self) -> np.ndarray:
        norm_prices = np.clip((self.prices / self.MAX_PRICE) * 2.0 - 1.0, -1.0, 1.0)
        norm_inventory = np.clip((self.inventory / self.MAX_INVENTORY) * 2.0 - 1.0, -1.0, 1.0)
        norm_cash = np.clip((self.cash / (self.INITIAL_CASH * 2.0)) * 2.0 - 1.0, -1.0, 1.0)
        norm_rewards = np.clip(self.reward_history / 10.0, -1.0, 1.0)
        trend_strength = np.mean(np.abs(self.market_trend)) * 50.0
        norm_trend = np.clip(trend_strength * 2.0 - 1.0, -1.0, 1.0)
        
        obs = np.concatenate([
            norm_prices,
            norm_inventory,
            np.array([norm_cash], dtype=np.float32),
            self.action_history.flatten(),
            norm_rewards,
            np.array([norm_trend], dtype=np.float32)
        ]).astype(np.float32)
        
        return obs
    
    def _update_market(self) -> None:
        trend_noise = self.np_random.normal(0.0, 0.001, size=self.NUM_RESOURCES).astype(np.float32)
        self.market_trend += trend_noise
        self.market_trend = np.clip(self.market_trend, -0.05, 0.05)
        
        price_noise = self.np_random.normal(0.0, 0.02, size=self.NUM_RESOURCES).astype(np.float32)
        returns = self.market_trend + price_noise
        self.prices *= (1.0 + returns)
        self.prices = np.maximum(self.prices, 1.0)
        
        if self.np_random.random() < 0.05:
            self.market_trend = self.np_random.uniform(-0.03, 0.03, size=self.NUM_RESOURCES).astype(np.float32)
    
    def _calculate_portfolio_value(self) -> float:
        return float(self.cash + np.sum(self.inventory * self.prices))
    
    def close(self) -> None:
        pass