Debugging CI Failures
This guide explains how to diagnose failing v2 CI jobs by tracing test failures to their source clusters and reading diagnostic artifacts.
Finding Test Results
Each TestGroup produces a JUnit XML file named by its JUnitFile field. These land in ARTIFACT_DIR in the Prow job artifacts.
For example, the Azure self-managed job produces:
junit_self_managed_azure_public.xmljunit_self_managed_azure_private.xmljunit_self_managed_azure_oauth_lb.xmljunit_nodepool_autoscaling.xmljunit_lifecycle_upgrade.xmljunit_lifecycle_etcd_chaos.xml
Additionally, create-guests emits junit_hosted_cluster_{name}.xml for each cluster that reaches Phase 4 (version rollout wait), recording either success or failure. On failure, the JUnit file contains the HostedCluster and NodePool conditions at the time of failure. On success, it records a passing test case confirming the rollout completed.
Mapping Failures to Clusters
To find which cluster a failing test ran against, trace the path:
- JUnit file name →
TestGroup.Name(e.g.,junit_self_managed_azure_public.xml→"public") - TestGroup.Name →
TestGroup.ClusterFile(e.g.,"public"→"cluster-name-public") - ClusterFile → cluster name derived from
PROW_JOB_ID+ variant (e.g.,public-a1b2c3d4e5)
The run-tests step log shows the mapping explicitly:
Running public tests against public-a1b2c3d4e5...
Running private tests against private-f6e7d8c9b0...
To find the cluster name for a failing test, search the run-tests step log for the line matching the test group name.
Reading Ginkgo Verbose Output
The --ginkgo.v flag produces verbose output that maps test failures to their source code structure.
When a test is running, Ginkgo prints the full test description. The exact format depends on the Ginkgo version; with Ginkgo v2 it looks approximately like:
Control Plane Workloads Deployment generation kube-apiserver should not indicate rapid rollouts
This maps to the nested test block structure:
Describe("Control Plane Workloads") →
Context("Deployment generation") →
Context("kube-apiserver") →
It("should not indicate rapid rollouts")
On failure, Ginkgo prints:
- The exact
Itblock description - The assertion that failed (e.g.,
Expected <5> to be <= <3>) - File path and line number (e.g.,
control_plane_workloads_test.go:42) - Full label set (e.g.,
[control-plane-workloads])
Use this information to locate the failing test in the codebase and understand what condition was violated.
create-guests Failures
The most common failure point in v2 jobs is Phase 4 (version rollout wait) in create-guests. When this happens:
- Check for JUnit XML: Look for
junit_hosted_cluster_*.xmlin artifacts - Read conditions: The JUnit file contains
HostedClusterandNodePoolconditions at the time of failure - No JUnit file?: If no JUnit file exists, the failure happened before Phase 4 — check the
create-guestsstep log for earlier phases
Common pre-Phase 4 failures:
- Phase 1 (cluster creation):
hypershift create clustercommand failure — check for invalid flags or missing credentials - Phase 2 (post-create hooks): Platform-specific hook failure — check for API errors when patching resources
- Phase 3 (wait Available): Timeout waiting for
HostedClusterAvailablecondition — indicates control plane startup failure - Phase 5 (write cluster names): Failure writing cluster names to
SHARED_DIR— rare, typically caused by filesystem or permissions errors
dump-guests Artifacts
hypershift dump cluster collects must-gather, events, pod logs, and other diagnostics. These appear in the Prow artifact directory under artifacts/<step-name>/.
The dump-guests binary exits non-zero if required environment variables (PROW_JOB_ID, ARTIFACT_DIR) are missing. Once environment setup succeeds, it always exits 0 even if individual cluster dumps fail, so missing artifacts indicate a dump failure for that cluster. To diagnose:
- Check the
dump-guestsstep log forWARNINGmessages - Look for timeout or API errors when fetching resources
- Verify the cluster still existed when dump ran (it may have been deleted by an earlier cleanup step)
Common artifacts to check:
namespaces/<control-plane-namespace>/pods/— pod logs for control plane componentscluster-scoped-resources/events/— cluster events showing resource creation/deletionnamespaces/<control-plane-namespace>/core/events.yaml— control plane eventsnamespaces/openshift-*/— hosted cluster namespace logs
Understanding Informing Test Skips
When a test labeled Informing fails, the custom fail handler converts it to a skip. It appears as "skipped" (not "failed") in JUnit reports.
To see the actual failure reason, look at the Ginkgo verbose output in the run-tests step log. The skip message includes the original failure:
[SKIP] informing test failure: Expected <false> to be true
The test failure does not block CI or appear in the JUnit summary. This is by design — Informing tests are informational only.
Informing tests are invisible to Sippy
Informing test failures do not appear in Sippy or Component Readiness dashboards. If you need a test to be tracked by these tools, do not use the Informing label.