All posts

Introducing INT21 and PTX Kernel Factory

We are building self-improving AI systems for the software beneath modern AI. Our first product generates and optimizes low-level GPU software, then proves its work with tests and benchmarks.

By INT21
INT21 Use compute to improve compute

Today, we are launching INT21: the first company to achieve self-improving AI agent swarms, applied to AI infrastructure.

The Layer Most People Don’t See

Most conversations about AI progress focus on models. But every model depends on a less visible layer: the software that tells GPUs how to perform each operation.

That software has an outsized effect on speed and cost. It is also difficult to build. Reaching the best performance on a new GPU often requires specialists who understand both the algorithm and the hardware in extraordinary depth.

INT21 exists to make that work more scalable. We call this new category Self-Improving Compute Infrastructure.

Our First Product: PTX Kernel Factory

PTX Kernel Factory is an AI system that generates and improves software for NVIDIA GPUs. A team defines the operation, the requirements, and the measure of success. The factory writes an implementation, tests it, measures it on the target hardware, learns from the result, and repeats.

The first four implementations produced by PTX Kernel Factory are open source today. The product is also entering beta, with early access available at int21.ai.

For AI workloads running on NVIDIA GPUs, each model operation eventually becomes instructions executed by the hardware. A GPU kernel is the small, specialized program responsible for one such operation, such as normalization, attention, or moving data through memory. Thousands of these kernels run beneath every training job and AI application.

PTX is NVIDIA’s low-level, assembly-like GPU language. It sits between higher-level GPU software and the final machine instructions executed by the hardware, making it one of the closest programmable layers in the NVIDIA stack.

Working at this level gives precise control over how data moves through memory, how threads cooperate, when work is synchronized, and which specialized GPU instructions are used. Those choices can determine whether an expensive GPU spends its time working or waiting.

Very few engineers can write and optimize PTX well. The work requires a rare combination of algorithm knowledge, GPU architecture expertise, numerical rigor, and performance intuition. Higher-level tools, libraries, and compilers make GPU development accessible to far more people, but when a new AI operation has no mature implementation, or when existing abstractions cannot reach the required performance, this scarce low-level expertise becomes a bottleneck.

It is also exceptionally difficult. A kernel can look correct while failing on one rare input. It can be fast for one shape and slow for another. It can use too many registers, move too much data, or perform well on Hopper and regress on Blackwell. Even expert engineers must test many ideas, and most of those ideas do not work. Each hardware generation changes part of the problem.

That combination makes PTX an ideal first proving ground for INT21: it is technically demanding, economically important, and objectively measurable. A generated kernel is correct or it is not. It is faster or it is not.

PTX Kernel Factory turns the expert loop of writing, testing, profiling, and revising low-level GPU code into a process that can run continuously and learn from its results.

How PTX Kernel Factory Works

The interface is intentionally simple:

  1. Describe the operation. Define what the kernel needs to do and the inputs it must support.
  2. Set the requirements. Provide correctness tests, target hardware, and any integration constraints.
  3. Define success. Choose the performance metric the system should optimize.

From there, the factory runs a long-horizon engineering process. It generates candidate implementations, compiles them, rejects incorrect results, benchmarks valid candidates, and uses the evidence to guide the next round.

The released implementations combine CUDA C++ with inline PTX, giving the system control over hardware details that higher-level tools may intentionally hide. Rather than relying on a single agent to produce a one-shot answer, PTX Kernel Factory coordinates multiple AI agents across this loop.

Human engineers still define the goal, constraints, and acceptance criteria. PTX Kernel Factory automates the expensive search between a clear specification and a strong implementation.

What We Mean by Self-Improving

A coding agent can produce an answer. A reliable engineering system also needs to determine whether the answer works, understand why an attempt failed, and carry useful knowledge into the next attempt.

That is what we mean by self-improving.

Most efforts in this space focus on self-improving the AI itself. We are taking a fundamentally different path, where the agent swarms self-improve the infrastructure they run on, preserving human control while still compounding performance with every production cycle.

The challenge is less about producing one plausible kernel, and more about building a system that can reject thousands of wrong, fragile, or misleading attempts, preserve the few lessons that matter, and keep improving without drifting away from correctness.

PTX Kernel Factory does not treat every generation as a fresh prompt. It preserves useful discoveries from successful and failed experiments, allowing later work to build on earlier evidence. The goal is to improve the process that produces the code.

This is how the factory’s performance compounds over time. Each generation begins with more evidence about what works, what fails, and which directions are worth exploring.

Our broader thesis is:

Use compute to improve compute.

Intelligence is not only what a model knows. It is the ability to search, test, remember, correct, and improve. We believe systems that make those abilities cumulative will become an important part of future compute infrastructure and the broader self-improving evolution in AI.

Our First Proof: Two Very Different AI Workloads

For the first public release, we chose two workloads that test different abilities.

RMSNorm is a common operation used throughout modern language models. It is mature, widely understood, and already has strong human-written implementations. It tests whether the factory can compete on established work.

Kimi Delta Attention (KDA) is a newer attention mechanism. It is more specialized and has fewer established implementation patterns. It tests whether the factory can adapt quickly to a newer research workload.

We compared the generated implementations with well-optimized human-built baselines: QuACK for RMSNorm and the CUTLASS-based FlashKDA implementation for KDA. The comparisons were run on the same hardware with correctness checks before timing.

Benchmark Highlights

WorkloadHardwareResult against the expert baseline
KDANVIDIA GH200, Hopper1.24x to 1.59x as fast across six fixed- and variable-length scenarios
KDANVIDIA B200, Blackwell1.42x as fast through the standard public interface, and 1.52x as fast in an optimized integration
RMSNormNVIDIA GH200, Hopper8.17% faster on geometric mean across 11 forward cases using a common 16-bit AI format; 15% to 34% faster in selected backward comparisons
RMSNormNVIDIA B200, BlackwellFaster in all 126 comparable cases across the full default benchmark matrix

These are operator-level benchmark results, not claims about full-model speedups. The effect on an application depends on its model, workload, shapes, software stack, and how much total time it spends in the optimized operation.

We also report the less flattering results. In the Hopper RMSNorm matrix, two other number formats included small regressions; the slowest cases were 0.42% and 0.84% behind QuACK. We would rather publish the boundary of the result than hide it behind a single favorable number.

Correctness Comes Before Speed

A fast kernel is useless if it changes the result or fails on real inputs.

Every performance comparison therefore begins with correctness:

  • The B200 KDA implementation passed all 580 upstream tests.
  • The Hopper KDA implementation passed 584 upstream tests and 235 package tests on the validated GH200 system.
  • The RMSNorm implementations passed their complete package suites on the validated hardware: 48 tests plus 65 subtests on GH200 and 66 tests on B200.

The suites cover different data types, input sizes, fixed and variable-length sequences, optional state, forward and backward paths where supported, and difficult edge cases.

Benchmarks can still be environment-specific, so each repository includes the source, test commands, measurement method, software versions, and raw or generated reports needed to inspect and reproduce the results.

Why Start Here

GPU kernels are a useful first test for our approach because the feedback is unforgiving.

The output is correct or it is not. The implementation is faster or it is not. A change can be compiled, tested, and measured. Progress is grounded in evidence rather than judged by how convincing generated text sounds.

Kernel optimization also sits at an important bottleneck. AI models are evolving quickly, hardware is changing every generation, and the supply of low-level optimization expertise cannot grow at the same rate.

Turning more of this work into a repeatable system could help teams:

  • Bring new model operations to production faster.
  • Make better use of new GPU generations.
  • Explore optimization ideas that would be too expensive to test manually.
  • Reserve expert time for architecture, requirements, and system-level decisions.

This is not about removing engineers. It is about giving a small number of experts more leverage.

Open-Sourcing the First Four Factory Artifacts

Today, we are releasing:

We are publishing the code because performance claims should be inspectable. Developers should be able to read the implementation, run the tests, reproduce the benchmarks, and challenge the results.

These releases are not the end product. They are the first public evidence that the factory can produce useful low-level software across established and emerging workloads, and across two generations of NVIDIA hardware.

About Us

INT21 was founded by Bing Xu in April 2026 as an AI-native company, built on a simple idea: the company’s own engineering capacity should scale with compute.

Bing was a co-author of the original Generative Adversarial Nets paper, the original creator of XGBoost’s Python package, and a co-creator of MXNet and AITemplate.

In February 2025, he co-authored NVIDIA’s early work on agentic GPU kernel generation, using a reasoning model and inference-time scaling to generate and optimize attention kernels. Before INT21, Bing was a Distinguished Engineer at NVIDIA. He joined NVIDIA after it acquired HippoML, the GPU inference startup he co-founded and led as CEO.

A New Era of Compute

PTX Kernel Factory is now in beta for teams building AI models, inference systems, training platforms, and other GPU-intensive products.

The starting point can be an operation that is too slow, a new architecture without a mature kernel, or an important workload that has not justified weeks of specialist time. The team supplies the problem and the definition of success. The factory takes on the search.

PTX Kernel Factory is also the first step in a larger direction for INT21, and for the industry more broadly.

Most compute infrastructure today is static: people write it, optimize it, and revisit it when requirements or hardware change. AI systems can generate impressive outputs, but reliably deploying them in production, at scale, remains largely unsolved.

We believe more of that infrastructure will become adaptive. It will test its own work, preserve what it learns, and improve the way it solves the next problem.

We are starting with one of the hardest, most measurable layers of the stack. Human knowledge does not scale, but agent swarms can keep getting better with every run. Self-improving compute infrastructure is a fundamental shift in how AI is built.

Describe the kernel. Define success. Let the factory improve the implementation.

Learn more and request early access.