Summary
We used OpenAI’s 5.5 Cyber preview model inside a purpose-built Windows vulnerability research scaffold to hunt for memory corruption bugs in endpoint products we own. The goal was not to ask a model for “a bug” and trust the answer. Instead, the goal was to give the model the same feedback loops a skilled human vulnerability researcher depends on: source context, build access, live target execution, debugger output, crash dumps, and the ability to iterate.
The result was a repeatable workflow that found and validated multiple classes of issues:
- A confirmed kernel memory corruption crash in a communication path.
- A confirmed user-mode service crash in a downstream consumer.
- Two further kernel memory corruption issues on a separate surface.
- Several lower-impact leads that helped tune the research loop even when they did not become reportable memory corruption findings.
This post focuses on the scaffold, the technique, and the mindset. The individual bugs are discussed only at a high level because the interesting lesson is not a single offset, object name, command ID, or driver detail. The interesting lesson is how quickly a model becomes useful once it can test its own hypotheses against a real system.
Why we built a scaffold
Memory corruption research is feedback-heavy work. Static review gives you hypotheses, but those hypotheses are often wrong in mundane ways; a command has been deprecated, an ABI has shifted, a port is gated, a service holds the only connection slot, or a validation helper was added after an older harness was written.
The scaffold was designed around one principle: Every model-generated claim should be forced through a verifier.
For this project, the verifiers were:
- The source tree and a code-review graph for call relationships and impact radius.
- Protocol-aware harnesses compiled with MSVC.
- A real Windows target VM with the product installed.
- A live kernel debugger attached from a separate VM.
- Crash triage commands such as .lastevent, !analyze -v, kb, registers, pool inspection, and minidumps.
- Regression harnesses that distinguish stale historical bugs from live attack surfaces.
What made the model effective was that it could cheaply generate and retire hypotheses, while the scaffold kept every claim grounded against a real system.
The lab architecture
The physical host runs Codex Desktop alongside VMware Workstation. Any crash-prone testing is confined to a target Windows VM—never the host itself. A separate Windows VM runs WinDbg and stays attached to the target through kernel debugging for the duration of the session.
The setup has four pieces: Codex Desktop on the physical Windows host; three MCP servers (hostvm-remote, debugger, and vmware); a HostVM running the target Windows 11 build with the product, PoCs, fuzzers, and MSVC toolchain; and a separate DebuggerVM running WinDbg.
The scaffold has three main MCP servers:
- hostvm-remote: uploads files to the target VM over SFTP, compiles C/C++ with MSVC, runs PoCs, and treats execution timeout as a possible crash signal.
- debugger: connects host-side CDB to the WinDbg server running in the debugger VM, then collects kernel crash evidence.
- vmware: checks VM state and handles lifecycle operations such as start, reset, snapshot, and revert.
The important architectural choice is that all orchestration stays on the physical host. When the target VM bugchecks, the model, MCP servers, debugger connection, and VM controls all stay alive on the host. The crash becomes just another observable state the loop can reason about, rather than an event that tears down the session.
The hunting loop
The basic loop looked like this:
- Start from source and our existing internal harnesses (in-house build fuzzers). Map the reachable kernel/user communication surfaces.
- Use the code graph to identify handlers, callers, sinks, and missing test coverage.
- Build a taint map from user-controlled buffers to allocation, copy, string, path, cache, and policy sinks.
- Write a small protocol-aware PoC or fuzzer for one hypothesis.
- Upload and compile it inside the target VM.
- Run it against the live product.
- If the run times out or the target stops responding to SSH, query WinDbg for crash evidence before issuing a reboot.
- Classify the result as confirmed, stale, gated, lower impact, or needs deeper instrumentation.
- Feed the result back into the next harness.
That last classification step mattered a lot. In several cases the model found code that looked dangerous, but the live binary rejected the message, used an older ABI, or did not expose the assumed device object. The scaffold made those failures cheap and useful instead of letting them turn into false-positive reports.
Source first, then protocol reality
The products had multiple local communication and parsing surfaces:
- A primary user-to-kernel control channel.
- A secondary high-throughput event channel.
- A telemetry or monitoring channel.
- Conditional low-level device interfaces.
- Indirect kernel paths reachable through standard system operations.
- Parsers for names, patterns, and policy expressions that looked safe at the input boundary but changed shape before reaching downstream consumers.
Our in-house fuzzers had already found interesting crashes, but many assumptions were stale. Some command IDs were deprecated, some parsers had been hardened, and the installed target accepted an older event ABI than the current source suggested.
One useful pattern was to write “ABI probes” before writing exploit-shaped PoCs. For example, instead of immediately sending a large malformed event, we first sent small benign messages across version and layout combinations. That told us the live target accepted event version X and rejected version Y. This single probe saved a lot of time and explained why some of the PoCs returned an error.
The same idea applied outside explicit binary protocols. For name and pattern parsers, the model learned to ask a different question: What representation is being validated, and what representation is later copied, stored, expanded, compared, or trusted? Several useful findings came from looking at the gap between those two representations.
The lesson was simple. In mature products, the hard part is often not mutation. The hard part is speaking enough of the real protocol, name grammar, or pattern language to reach the interesting code.
What the model did well
The 5.5 Cyber model was strongest when we gave it a narrow, inspectable slice of the system. For instance:
- Compare our in-house build fuzzers against the latest source.
- Explain why a historical crash is now stale.
- Rebuild a harness around the live ABI.
- Rank reachable surfaces by validation depth and sink quality.
- Rapidly turn a suspicious allocation/copy pattern into a minimal PoC.
- Follow attacker-controlled data across kernel/user boundaries.
- Notice transform-then-trust patterns in parsers, canonicalizers, and matchers.
- Interpret debugger output and decide whether a crash matched the hypothesized root cause.
The model also helped avoid wasted effort. It consistently deprioritized paths that looked promising on paper but were gated in practice—surfaces with missing device objects, single-client channels already held by the service, deprecated command identifiers, or parser branches closed off by signature and version checks.
How the findings unfolded
The findings did not arrive as one dramatic “find bug” moment. They unfolded as a sequence of scaffold-backed corrections.
First, the model mapped the available surfaces and compared them against our older internal harnesses. Several attractive paths were retired almost immediately because the live product no longer exposed the assumed interface, no longer accepted the assumed message shape, or had moved validation closer to the sink.
Next, the model built small probes to learn the live protocol and parser behavior before attempting crash-oriented tests. That changed the workflow from blind mutation to informed reachability. Once the model could send messages the product actually accepted, it started producing narrow reproducer candidates instead of broad fuzz noise.
The first confirmed crash followed the cleanest version of the loop: source hypothesis, minimal reproducer, compile inside the target VM, run against the live product, observe the target stop responding, collect debugger evidence, preserve the dump, and recover the VM. The key point was not the specific bug shape. The key point was that the model had enough instrumentation to prove that a source-level concern corresponded to a real runtime failure.
A second confirmed issue came from following data across a trust boundary. The scaffold made it easy to see that validation in one component did not guarantee safety in the next consumer. That shifted the model from local parser review to end-to-end dataflow review, which is where several of the more interesting hypotheses came from.
The same pattern played out on a different product surface. The model first learned the accepted grammar, then looked for places where the accepted form and the consumed form diverged. We are intentionally not describing those issues further while fixes are in progress. For this post, the relevant takeaway is that the same scaffold worked outside the original communication-port target: map the grammar, verify reachability, test one hypothesis at a time, and let runtime evidence decide what survives.
Not every useful result became a confirmed vulnerability. The model also surfaced lower-impact leads and inconclusive candidates that needed more instrumentation. Those leads were still valuable because they taught the model which validators mattered, which surfaces were reachable, and which code patterns deserved human validation time.
What did not work
The failed paths were as informative as the confirmed ones.
Several older harnesses turned out to be less useful than expected. The objects they targeted were no longer present on the current build, some command identifiers had been deprecated, and certain historical crash paths had since been closed off by added validators. One promising source-level hypothesis reached its target parser at runtime but did not arrive at a crashing sink.
Some parser hypotheses also failed for a more interesting reason: The input grammar was real, but the model had not yet aligned its mental model with the product’s live grammar. That pushed us toward smaller probes, grammar discovery, and transform-focused review before aggressive mutation.
This is the shape of real vulnerability work. The model does not need every hypothesis to succeed. It needs a way to discover quickly why a hypothesis failed, preserve the lesson, and choose a better next experiment.
The mindset that worked
The most productive posture was to treat the model as a tireless junior researcher with excellent tool access and a strict lab notebook.
We did not ask it to produce exploit chains. We asked it to:
- Build a map.
- Explain reachability.
- Find narrow memory-unsafe patterns.
- Write minimal reproducers.
- Validate against the live product.
- Preserve negative results.
- Generalize confirmed bugs into review heuristics.
- Escalate only the bugs that survived the loop.
This kept the work defender-oriented. The output we wanted was not a weaponized exploit; it was a high-confidence report with crash evidence, root-cause explanation, and enough reproduction detail for engineering teams to fix the issue internally.
Practical takeaways
Scaffolding matters more than prompting. A simple prompt can start the work, but the value comes from verifiers: compilers, live targets, debuggers, crash dumps, source graphs, and regression tests.
Protocol correctness beats blind mutation. The most useful harnesses spoke enough of the product’s actual ABI to survive coarse validation and reach deeper sinks.
Parser reality matters as much as protocol reality. For names, patterns, and policy expressions, tune the model to compare pre-validation representation, post-normalization representation, and sink representation.
Negative results should be first-class artifacts. Stale command IDs, absent objects, version mismatches, grammar mismatches, and fixed historical bugs all changed the search strategy.
Keep the target isolated. Crash probes and PoCs ran only inside the target VM, while orchestration and debugging stayed outside the blast radius.
Separate finding from exploiting. The scaffold was optimized for discovery, confirmation, and repair. Exploitability assessment can happen later, with a different risk posture and stronger controls.
For teams trying this themselves:
- Pick a target you own and can crash freely.
- Stand up an isolated VM and a separate debugger VM before involving the model.
- Wire up at least one verifier, such as a compiler, debugger, or live target, before any prompting.
- Start with source mapping, ABI probes, and grammar probes, not exploit-shaped PoCs.
- Ask the model to look for representation mismatches: validated form versus normalized form versus sink form.
- Treat negative results as data, not failure.
The interesting lesson is how quickly a model becomes useful once it can test its own hypotheses against a real system.