Orgteh Blog

In-depth analysis of the latest AI research — with practical applications you can build on today.

Stop Drowning in Agent Logs: A Lightweight Way to Surface the Trajectories That Actually Matter

2026-04-05

Stop Drowning in Agent Logs: A Lightweight Way to Surface the Trajectories That Actually Matter

Production LLM agents can generate tens of thousands of interaction traces per day. Most look identical: same tools, similar feedback, quiet success. Hidden inside are rare, messy failures—loops, mis-alignments, edge cas

Read More
Autonomous Coding Agents in the Wild: What 110k Pull Requests Teach Us About AI-Generated Code That Lasts

2026-04-05

Autonomous Coding Agents in the Wild: What 110k Pull Requests Teach Us About AI-Generated Code That Lasts

Open-source repositories have a new contributor: the AI agent that opens pull requests, reviews code, and even argues in comments—no fingerprints on the keyboard. A recent study of 110,000 real-world PRs compared five ag

Read More
From Silent to Curious: Teaching AI Coding Agents When to Ask Questions

2026-04-03

From Silent to Curious: Teaching AI Coding Agents When to Ask Questions

Ever pushed a “quick” Jira ticket to an AI pair-programmer only to watch it confidently ship the wrong feature? The problem isn’t model size—it’s silence. Today’s coding agents are trained to act, not inquire. A fresh pr

Read More
When Users Change Their Mind: Building Interruptible AI Agents That Survive Mid-Task Goal Changes

2026-04-03

When Users Change Their Mind: Building Interruptible AI Agents That Survive Mid-Task Goal Changes

Imagine you’ve just deployed an LLM agent that books multi-city business trips. It’s halfway through reserving a flight, hotel, and rental car when the user messages: “Actually, skip the rental—I’ll use Uber, and upgrade

Read More
From Idle to Instant: How “Eager” Hides Execution Latency in LLM Code Generation

2026-04-02

From Idle to Instant: How “Eager” Hides Execution Latency in LLM Code Generation

Waiting for a 200-line Python script to finish generating before the first runs feels like watching paint dry. Today’s LLM coding agents—GitHub Copilot, ChatGPT Code Interpreter, your homemade ReAct loop—still follow a

Read More
Think-Anywhere: How to Let Your Code-Gen LLM Pause, Reflect and Ship Better Code

2026-04-01

Think-Anywhere: How to Let Your Code-Gen LLM Pause, Reflect and Ship Better Code

Most “reasoning” models front-load all their brain-power: they think once, then spit out a long answer. That works for math puzzles, but real-world coding is messier—bugs appear only after 40 lines, edge cases hide insid

Read More
From 8B to 30B Performance: Building Deep Research Agents with a Verification-Centric Design

2026-03-31

From 8B to 30B Performance: Building Deep Research Agents with a Verification-Centric Design

Open-ended research is the killer app for large language models, yet most home-grown agents collapse after three or four tool calls. The Marco DeepResearch paper (arXiv 2603.28376) shows the bottleneck isn’t model size—i

Read More
From Prompts to Civilization: How EpochX Turns AI Agents into a Self-Improving Economy

2026-03-31

From Prompts to Civilization: How EpochX Turns AI Agents into a Self-Improving Economy

Most “AI agents” today are fancy wrappers around a single LLM call. EpochX, an open protocol described in a recent arXiv pre-print, treats every completed job as a Lego brick that future agents can snap into place—while

Read More
Teaching LLMs to “Think Local”: How Online Repository Memory Turns Generic Coders into Project-Native Contributors

2026-03-30

Teaching LLMs to “Think Local”: How Online Repository Memory Turns Generic Coders into Project-Native Contributors

New research shows how AI agents can learn project-specific conventions by replaying past commits, dramatically improving pull-request acceptance rates.

Read More
From Hard-Wired to Human-Readable: How Natural-Language Agent Harnesses Let You Swap LLMs Without Rewriting Code

2026-03-30

From Hard-Wired to Human-Readable: How Natural-Language Agent Harnesses Let You Swap LLMs Without Rewriting Code

Agent builders face a paradox: the smarter the agent, the more its “brains” are tangled inside controller code, framework glue, and vendor-specific SDKs. Swapping GPT-4 for Llama-3 can take a week of refactoring; A/B-tes

Read More
Stop Shipping Swiss-Cheese Agents: How “Agent Audit” Finds 40/42 Real-World LLM Vulnerabilities in <1 s

2026-03-29

Stop Shipping Swiss-Cheese Agents: How “Agent Audit” Finds 40/42 Real-World LLM Vulnerabilities in <1 s

You just wired the last tool function, the agent passes the smoke test, and the PM wants a prod URL. Did you verify that the cute tool won’t when the model has a bad day? Most teams don’t, which is why agents leak keys

Read More
Why Your AI Coding Agent Quietly Turns to Slop—And How to Measure It Before It’s Too Late

2026-03-29

Why Your AI Coding Agent Quietly Turns to Slop—And How to Measure It Before It’s Too Late

Every week a new model tops the coding leaderboard, yet anyone who ships real software knows the dirty secret: the code that passes today’s unit tests can become tomorrow’s maintenance nightmare. SlopCodeBench, a new ope

Read More
1 2 3 4 5

Orgteh Assistant

Online
Hello! I am Orgteh Assistant. How can I help you?