TL;DR:
- After 80+ tracked releases of a WordPress plugin with 234 source files, six refactoring bug classes appeared often enough to become a taxonomy: extraction residue, phantom method calls, unguarded state transitions, dead features, missing cascades, and split-brain presentation.
- They are not exotic defects but structural leftovers, stale call sites, implicit state rules, incomplete feature wiring, missing data cleanup, and divergent UI queries.
- In our current gate architecture, five of the six are caught before release through PHPStan, build checks, and structured audits. The sixth still requires runtime smoke testing and human comparison across UI surfaces.
Context: WordPress plugin refactoring across a self-hosted, PHP-based plugin portfolio, including 80+ tracked releases and 2,281 tests.
Problem: Refactoring failures do not spread randomly; as they recur as a small number of structural defect classes.
Intervention: We catalogued the recurring bug classes, mapped each one to a detection tier, and converted repeated findings into pre-push, build, audit, and smoke-test gates.
Controls: PHPStan, PHPUnit, build scripts, repository-layer conventions, post-extraction checklists, convention adoption sweeps, and pre-release audit skills.
Outcome: Five of six recurring classes now have pre-release detection paths; split-brain presentation remains the hardest because it requires cross-surface runtime comparison.
Entities: WordPress, PHP, PHPStan, PHPUnit, wpdb, JSON-LD, CTS-EMEIA Labs engineering showcase, CTS Data Solutions.
The Problem with “It Works on My Machine”
Refactoring is supposed to improve code without changing behavior. That is the textbook definition. The reality, across 370 development sessions and 2,200+ unit tests within CTS-EMEIA Labs, is that refactoring introduces bugs at a predictable rate, in predictable shapes, through predictable mechanisms.
The interesting part is not that bugs exist. Every developer knows that. The interesting part is that they cluster. Six patterns account for roughly 85% of all refactoring-introduced defects across our WordPress plugin portfolio. The remaining 15% are genuinely novel. The 85% are mechanical failures that a checklist or a static analysis tool could have caught before the commit was pushed.
We did not arrive at this taxonomy by reading a book. We arrived at it by shipping broken code, fixing it, shipping the same class of broken code three releases later, fixing it again, and eventually getting angry enough to formalize what we were seeing. The anger was productive. It produced gates.
Why WordPress Plugins Are a Perfect Refactoring Stress Test
WordPress plugins operate under constraints that amplify refactoring risk. The hook system means plugin code is executed by a framework that calls callbacks at specific points in the request lifecycle; callbacks receive only the arguments that the hook passes and that the plugin registered through $accepted_args. That gives WordPress its extension power, but it also creates runtime surfaces where a stale callback, renamed method, or wrong argument assumption can remain invisible until a specific admin action, AJAX endpoint, cron event, or front-end path triggers it. The wordpress documentation sometimes gets wrong.
Add to this the WordPress convention of backward compatibility at all costs. You cannot simply delete an old function and let consumers adapt. You need deprecation notices, compatibility shims, and a sunset timeline. Every extraction, every rename, every architectural improvement carries the weight of every previous decision that led to the code you are now improving.
This makes WordPress plugins an excellent laboratory for studying refactoring failure modes. If a bug pattern can survive in a codebase with 2,200 tests, PHPStan at level 6, and three layers of pre-push gates, it is a resilient bug pattern. Worth cataloging. Worth killing systematically.
Pattern 1: Extraction Residue
What it is: You copy methods from ClassA to ClassB as part of a refactoring. ClassB works. Tests pass. The original methods in ClassA are now dead code, but nobody deletes them.
Why it persists: Because the extraction feels complete the moment ClassB works. The developer’s attention has moved forward. Going back to ClassA to clean up feels like busywork. It is not busywork. It is the most important part of the extraction.
Real numbers: In version 2.77.0, we extracted a 900-line admin page class into four specialized Renderer classes plus a refactored TemplateEditor. The extraction was clean. The new classes worked. Seven methods in the original class were now dead. Nobody noticed for two releases. PHPStan finally flagged them as method.unused when we added it to the pre-push hook.
Seven methods. In a single extraction. Each one a potential source of confusion for the next developer who reads the class and thinks “this method exists, so it must be used somewhere.” Each one inflating code coverage metrics. Each one a false signal.
Dead methods are not harmless. They mislead future developers, inflate complexity metrics, and mask real bugs where the live implementation diverged from the dead copy.
Detection tier: Tool-detectable. PHPStan’s method.unused rule catches private dead methods automatically. Public and protected methods require manual verification against WordPress hook registrations, which PHPStan cannot trace. Our post-extraction checklist covers both cases.
Prevention: We added a mandatory post-extraction checklist to our development process. For each method moved to the new class: delete it from the old class. If it was private, safe to delete (no external callers). If it was public or protected, grep all callers first. Then run PHPStan. Then run the test suite. The checklist takes four minutes. Skipping it costs two releases of confusion.
Pattern 2: Phantom Method Calls
What it is: Code calls a method that does not exist on the target class. In a compiled language, this is a build error. In PHP, it is a runtime fatal that waits patiently for someone to trigger the specific code path.
Why it persists: Because PHP is dynamically typed, because IDE autocompletion sometimes suggests methods from the wrong class, and because test coverage rarely hits every code path. The method existed on the old class. You refactored. The call site still references the old location. Everything passes because the test suite does not exercise that particular button click.
Real numbers: Our logic audit at version 2.65.0 found a critical bug classified as BUG-PREWARM-PHANTOM-001. A cache rebuild process called a method on a class that had been refactored. The method no longer existed at that location. The result was a fatal error, but only when the cache was cold and the rebuild process kicked in. During normal operation, the cache was warm. During testing, the cache was seeded. The fatal hid for an entire release cycle.
Detection tier: Tool-detectable. PHPStan’s method.notFound rule is the single most valuable static analysis check for this pattern. It catches calls to methods that do not exist on the declared type. This is why we promote PHPStan from advisory mode to blocking mode in the pre-push hook as soon as the baseline is manageable.
Prevention: PHPStan in the pre-push hook, blocking mode. If the baseline has more than five existing errors, start in advisory mode (prints warnings but does not block the push). Graduate to blocking mode once the baseline is clean. We run PHPStan at level 6 with a memory limit of 2GB. The analysis adds about 8 seconds to every push. Those 8 seconds have prevented at least three production fatals that we know of.
Pattern 3: Unguarded State Transitions
What it is: Code changes a status field without checking whether the current state allows that transition. A queue item goes from “sent” to “pending” because nothing verified it was in a state where resending makes sense. A campaign gets marked “active” while it is mid-deletion.
Why it persists: Because state machines are implicit. Nobody draws the state diagram. The allowed transitions live in the developer’s head during implementation, and they evaporate after the commit. New code that touches the same status field does not know about the constraints the original developer intended.
Real numbers: Our logic audit at version 2.79.2 found eight medium-severity findings related to unguarded state transitions. Two methods in the queue processing pipeline (scheduleRetry() and markDuplicatePrevented()) could be called regardless of the item’s current state. The convention of adding status precondition guards had been introduced in version 2.78.0 for new code, but the existing methods predating the convention were never updated.
This is the pattern within the pattern. A convention is introduced. New code follows it. Old code does not get swept. The old code is where the bugs hide. A convention that covers 80% of call sites is worse than no convention at all, because the 80% creates a false sense of security.
Detection tier: Audit-detectable. Static analysis cannot determine whether a state transition is valid without understanding the business rules. Our /audit-logic skill examines each method that changes a status field and checks for a precondition guard in the first five lines. No precondition, no guard, flagged as a finding.
Prevention: Convention adoption sweeps. When you introduce a new pattern (status guards, permission checks, input validation), you do not get to stop at the new code. You grep the entire codebase for the old pattern and update every instance. We formalized this as a rule: every new convention gets a sweep ticket that migrates all existing code. The convention is not “adopted” until the sweep is complete.
Pattern 4: Dead Features
What it is: A settings field is wired to the admin UI. The user can toggle it on and off. The value is saved to the database. But no business logic anywhere in the codebase reads that value. The feature exists in the interface and in storage, but it does nothing.
Why it persists: Because the UI works. The setting saves. The test for “does the setting save” passes. Nobody writes a test for “does the setting actually affect behavior.” The implementation was planned in two phases: Phase 1 was the UI and storage. Phase 2 was the business logic. Phase 2 never happened because something more urgent came along.
A setting that saves but does nothing is worse than a missing feature. The missing feature gets reported. The dead feature gets trusted.
Real numbers: We have found dead features in every audit we have run. The detection method is straightforward but tedious: for each user-facing setting, trace the value from UI to storage to consumption. If the consumption call site does not exist, the feature is dead. The tracing cannot be automated because “consumption” might mean a conditional branch, a filter parameter, a template variable, or a REST response field.
Detection tier: Audit-detectable. Specifically, our /audit-logic Phase L4 (feature wiring verification). This phase traces each setting from registration through sanitization through storage through retrieval through consumption. A broken chain at any point is a finding. The phase is manual because the “consumption” step requires understanding what the setting is supposed to do.
Prevention: Do not split feature implementation across releases unless both phases are ticketed and the second phase has a version target. If Phase 1 ships in version 2.80.0, Phase 2 must have a ticket targeting version 2.81.0 at the latest. An untargeted Phase 2 is an abandoned Phase 2.
Pattern 5: Missing Cascades
What it is: A DELETE operation removes a row from one table without checking whether other tables have rows that reference it. The parent is gone. The children are orphaned. The orphaned rows accumulate silently, consuming storage, appearing in aggregate queries, and confusing any operation that expects referential integrity.
Why it persists: Because WordPress does not enforce foreign keys at the database level. The wpdb abstraction layer is a query builder, not an ORM. There are no cascade rules. There are no constraint violations. When you delete a campaign, the queue items referencing that campaign continue to exist. They reference a campaign ID that no longer resolves to anything. They are ghosts.
Real numbers: Version 2.81.0 shipped a transaction safety bundle that addressed three cascade failures in the disconnect, campaign delete, and account reassignment flows. Each one had been individually harmless for months because the orphaned rows did not cause errors. They caused data inconsistency. Queue items for deleted campaigns would appear in aggregate counts. Engagement metrics for disconnected accounts would inflate totals. The bugs were invisible until someone looked at the numbers and asked why they did not add up.
Detection tier: Audit-detectable. Our /audit-plugin Phase A1 (architecture review) checks every DELETE and ->delete() call against all tables with potential references. Missing cascades are flagged as findings. The check is mechanical but requires knowledge of the schema relationships, which static analysis tools do not have.
Prevention: Centralized cascade methods plus transaction wrappers. Every delete operation that touches a parent table must go through one Repository method that deletes dependent rows inside a transaction when the database engine supports it. In MySQL/InnoDB, database-level foreign keys can support ON DELETE CASCADE, but many WordPress plugins still manage custom tables through $wpdb and explicit SQL, so the plugin must own its cascade policy. The cascade logic lives in the Repository layer, not in the Service layer. Scattering delete logic across services is how orphans become invisible.
Pattern 6: Split-Brain Presentation
What it is: The same data point shows different values in different parts of the admin interface. The dashboard says 47 items in the queue. The queue page says 43. The health tab says 51. All three are querying the same table, but with different WHERE clauses, different caching strategies, or different counting methods.
Why it persists: Because each UI component was built independently. The dashboard widget was built in version 2.10.0. The queue page was built in version 2.15.0. The health tab was built in version 2.69.0. Each developer wrote their own count query. None of them knew about the others. The queries diverged when new status values were added, when site-scoping was introduced, when soft-delete was implemented.
Three components showing three different numbers for the same metric is not a bug trifecta. It is an architecture failure that metastasized into the presentation layer.
Real numbers: The “sent today” metric was the canonical example. The dashboard counted all items with status “sent” and a timestamp within the current day. The queue page counted items with status “sent” or “delivered” within the current day. The health tab counted items where the last-modified timestamp fell within the current day, regardless of status. Three queries, three numbers, one confused administrator.
Detection tier: Runtime-detectable. No static analysis tool can compare the semantics of two SQL queries and determine that they should return the same number but do not. This pattern is caught by users, by smoke tests that compare values across pages, or by auditors who systematically check every place a metric appears.
Prevention: Canonical query methods in the Repository layer. If “sent today count” is a metric, there is exactly one method that computes it: QueueRepository::getSentTodayCount(). Every UI component calls that method. No component writes its own query. When the counting logic changes (new status values, site scoping, timezone handling), it changes in one place. This is the Repository pattern doing what it was designed to do.
The Detection Matrix
Not all bugs are created equal in terms of when they can be caught. The earlier you catch them, the cheaper they are. Here is the matrix:
| Pattern | Detection Tier | Tool | When | Cost to Fix |
|---|---|---|---|---|
| Extraction residue | Tool-detectable | PHPStan method.unused |
Pre-push | Minutes |
| Phantom method calls | Tool-detectable | PHPStan method.notFound |
Pre-push | Minutes |
| Unguarded state transitions | Audit-detectable | /audit-logic L1 |
Pre-release | Hours |
| Dead features | Audit-detectable | /audit-logic L4 |
Pre-release | Hours |
| Missing cascades | Audit-detectable | /audit-plugin A1 |
Pre-release | Hours to days |
| Split-brain presentation | Runtime-detectable | Cross-page smoke tests / human review / user reports | Pre-release if smoke tests compare surfaces; otherwise post-release | Hours to days |
The progression is clear: the first two are cheap because tools catch them mechanically. The middle two require reasoning about business logic, which is why they need structured audits. The last one requires comparing runtime behavior across UI surfaces, which is why it escapes everything except actual usage.
The Gate Architecture
Knowing the patterns is necessary. Preventing them is the actual work. We use a four-tier progressive gate architecture:
Tier 0: Pre-commit. PHP syntax linting and version coherence checks. These run on every commit. They catch typos and obvious parse errors. They do not catch any of the six patterns above.
Tier 1: Pre-push. PHPUnit test suite plus PHPStan analysis. This is where Patterns 1 and 2 die. PHPStan at level 6 with baseline management catches dead methods and phantom calls before the code leaves the developer’s machine. The analysis adds 8 seconds to every push. We also run grep-based gates here: silent catch blocks, SQL outside the Repository layer, debug functions in source code, raw exception messages in user-facing responses.
Tier 2: Build gates. The build script (build.ps1) runs 21 checks before producing a ZIP artifact. This includes everything from Tier 1 plus ABSPATH guards on all PHP files, response safety checks, model serialization warnings, AJAX nonce coverage, test file coverage metrics, and ZIP integrity verification. Patterns 3 through 5 are partially caught here through convention-enforcement greps, but the full detection requires Tier 3.
Tier 3: Audit skills. Nine specialized audit skills covering security, quality, engineering health, operational safety, API compliance, dependencies, performance, logic correctness, and UX. These run pre-release and catch Patterns 3, 4, and 5 through structured multi-phase analysis. A full audit runs 33 phases across four domains. Each finding gets a severity classification, a concrete fix, and a ticket reference.
Pattern 6 has no static gate. It needs runtime comparison: a smoke check that opens the dashboard, queue page, health tab, and API response and compares the numbers that claim to represent the same metric. We run 54 smoke checks before every release, but smoke coverage only protects the metrics it explicitly compares. This class of bug is not a coding failure alone; it is a semantic consistency failure across UI surfaces.
What This Taxonomy Does Not Catch
This taxonomy is deliberately narrow. It does not cover security vulnerabilities, performance regressions, browser compatibility, WordPress version compatibility, localization issues, accessibility regressions, or product-design mistakes. Those have separate audit gates.
The taxonomy covers defects introduced by refactoring: code moves, responsibility shifts, status logic changes, feature wiring, persistence cleanup, and presentation consistency.
That boundary matters because a taxonomy that tries to cover every bug becomes a slogan instead of an operating tool.
Convention Adoption Sweeps: The Missing Discipline
The single highest-impact process change we made was formalizing convention adoption sweeps. The concept is simple: when you introduce a new pattern, you do not get to apply it only to new code. You must sweep the entire codebase for the old pattern and migrate every instance.
Consider the introduction of model status constants. Before version 2.79.0, status values were raw strings scattered across the codebase. 'active', 'pending', 'sent', 'failed'. Twenty-three raw status strings across six files. When we introduced Template::STATUS_DRAFT and friends, the new code used the constants correctly. The old code continued using raw strings. Both worked identically at runtime. They would diverge the moment someone changed a constant value, or the moment an audit flagged raw strings as a code quality issue.
The sweep ticket for that single constant introduction touched six files and replaced 23 instances. It took about 40 minutes. Without the sweep, those 23 instances would have persisted for months, each one a potential source of drift, each one a signal that “we use raw strings here” is acceptable practice.
The rule we enforce now: when you add a constant, enum, utility method, or canonical method that replaces inline values, you grep the entire codebase for the raw value and update all call sites in the same commit. If the sweep is too large for one commit, you create a sweep ticket with a version target. But the sweep must happen. A convention that covers 80% of call sites creates a false sense of security while the remaining 20% harbors the bugs.
The Numbers That Matter
Across our primary WordPress plugin (BunnyPoster, 234 files, 80+ releases tracked):
- 7 dead methods found after a single class extraction in v2.77.0. All caught by PHPStan after we added it to the pre-push hook. Before PHPStan, these would have persisted indefinitely.
- 23 raw status strings remaining after model constant introduction in v2.79.2. Found during logic audit. All swept in one commit.
- 8 unguarded state transitions found in the v2.79.2 logic audit. Two in core queue processing methods that had been live for 15+ releases without incident, because the specific state combination that would trigger incorrect behavior was rare but not impossible.
- 3 cascade failures fixed in the v2.81.0 transaction safety bundle. Each had been silently orphaning rows for multiple releases. Total orphaned rows at time of discovery: unknown, because the orphans looked like legitimate data.
- 2,281 tests in the current suite, with 5,344 assertions. The test suite catches regression in existing behavior. It does not catch the six patterns above. Tests verify that code does what it did before. They do not verify that the architecture is sound after a refactoring.
- 21 build gates in the current pipeline. Five specifically target patterns from this taxonomy. The remaining 16 cover other defect classes (security, compatibility, release integrity).
Tests verify that code does what it did before. They do not verify that the architecture is sound after a refactoring. Those are different problems requiring different tools.
Why This Taxonomy Is Stable Enough to Use
Six patterns, for now. This is an empirical taxonomy from our own release history, not a claim about every WordPress plugin. It covers the bug classes we have shipped, debugged, audited, and converted into gates. If a seventh pattern recurs across three or more releases, it becomes part of the taxonomy. Until then, forcing every incident into a larger theoretical framework would add noise without improving detection.
The taxonomy is also ordered by detection cost. Patterns 1 and 2 are cheap because PHPStan can catch them mechanically. Patterns 3, 4, and 5 cost audit time because they require reasoning about business logic, data relationships, or product wiring. Pattern 6 costs the most because it requires runtime comparison across surfaces. Investing in earlier detection tiers is the obvious strategy, and yet many WordPress plugin codebases still ship without static analysis, baseline control, or structured pre-release audits.
The PHP type system is getting better. Scalar and class type declarations, union types, readonly properties, and enums reduce ambiguous values and strengthen static analysis surfaces. Enums are especially useful for status fields because they prevent arbitrary status strings from entering typed code paths. They do not, by themselves, prove that a transition from one valid status to another is allowed. That still requires an explicit transition method, guard, or state-machine rule. Our v3.0.0 roadmap includes enum-backed statuses precisely because enums reduce invalid values while guards continue to enforce valid movement.
Applying This to Your Own Plugin
You do not need 2,200 tests and 21 build gates to start. You need two things:
First, add PHPStan to your pre-push hook. Start in advisory mode (prints warnings, does not block). Run composer require --dev phpstan/phpstan and add vendor/bin/phpstan analyse to .git/hooks/pre-push. Level 5 is a reasonable starting point. This catches Patterns 1 and 2 automatically. Promote to blocking mode once your baseline has fewer than five errors. The PHPStan documentation covers the setup in about ten minutes.
Second, run a post-extraction checklist after every refactoring commit. For each method moved: delete the original. For each constant introduced: grep for the raw value and update all instances. For each class renamed: grep for the old class name. This is a five-minute checklist that prevents the most expensive class of bugs in the taxonomy.
If you want to go further, structured audits with formal finding reports and ticket formalization will catch Patterns 3 through 5. But the first two steps alone will eliminate the cheapest and most frequent refactoring bugs in your codebase.
The bugs are predictable. The patterns are documented. The tools exist. The only variable is whether you decide to use them before or after shipping the same class of defect for the third time.
About CTS-EMEIA Labs
CTS-EMEIA Labs is the engineering division behind CTS Data Solutions, building modular, self-hosted analytics, security, schema-governance, and data pipeline tools for execution under governance constraints. Its product work includes WordPress-native behavior analytics, WAF and bot-mitigation patterns, schema and metadata governance, and technical reference material for high-governance operating environments.
This article is part of the Labs engineering-notes corpus: field-tested technical lessons converted into reusable controls, checklists, and audit patterns.

