Bug Report: Incomplete Teardown When Provision Phase Fails

# PKB Bug Report: Incomplete Teardown When Provision Phase Fails

**Issue Type**: Bug  
**Severity**: Critical  
**Component**: Resource Cleanup / Teardown  
**Related Issue**: [#1239](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/issues/1239)  
**Status**: Proposed Fix which implemented and tested successfully on my environment

## Fix Summary

The `Delete()` method in `benchmark_spec.py` fails to clean up network and firewall resources when:
1. The `deleted` flag is prematurely set to `True` during provision failures
2. Pickle data corruption causes network/firewall objects to be invalid
3. Teardown runs with `--run_stage=teardown` but skips cleanup due to the `deleted` flag

**Implemented Solution**: Three-layer defense strategy in `perfkitbenchmarker/benchmark_spec.py`:
1. **Ignore deleted flag during teardown-only runs**: Don't trust the `deleted` flag when `stages.TEARDOWN` is in `FLAGS.run_stage`
2. **Validate objects before use**: Check that network/firewall objects have valid `Delete()` methods
3. **Fallback to direct cleanup**: When objects are invalid, use `gcloud` commands to delete resources by name pattern

**Test Results**: Successfully cleaned up orphaned resources from run URI `e848137d`:
- Deleted firewalls: `default-internal-10-0-0-0-8-e848137d`, `perfkit-firewall-e848137d-22-22`
- Deleted network: `pkb-network-e848137d`

---

## Original Bug Report

## Summary

PerfKitBenchmarker fails to delete network and firewall resources when a benchmark fails during the provision phase, leaving orphaned GCP resources that:
1. Cost money
2. Block future runs with the same `run_uri`
3. Require manual cleanup

## Root Cause Analysis

The primary issue is a fatal `ControlPath too long` error during the SSH connection setup within the `WaitForBootCompletion` method. This error is triggered because the temporary directory path for the SSH control socket exceeds the maximum allowed length on the user's macOS system. This initial error triggers a `KeyboardInterrupt`, which then causes the incomplete teardown.

The `Delete()` method in `benchmark_spec.py` has a critical flaw: **firewall and network deletion is skipped when `self.vms` is empty**, which occurs when a benchmark fails during VM provisioning.

### Code Analysis

In `perfkitbenchmarker/benchmark_spec.py` (lines 958-987):

```python
def Delete(self):
    if self.deleted:
        return
    
    # ... other resource deletions ...
    
    if self.vms:  # ← PROBLEM: This condition gates VM deletion
        try:
            background_tasks.RunThreaded(self.DeleteVm, self.vms)
            background_tasks.RunThreaded(
                lambda vm: vm.DeleteScratchDisks(), self.vms
            )
        except Exception:
            logging.exception(...)
    
    # Placement groups deletion (outside if block)
    
    for firewall in self.firewalls.values():  # ← This code runs
        try:
            firewall.DisallowAllPorts()  # ← But only DISABLES ports, doesn't DELETE
        except Exception:
            logging.exception(...)
    
    # ... container cluster deletion ...
    
    for net in self.networks.values():  # ← Network deletion fails
        try:
            net.Delete()  # ← Fails because firewall rules still exist
        except Exception:
            logging.exception(...)
```

**The Bug**: 
1. `firewall.DisallowAllPorts()` only disables firewall rules, it doesn't delete them
2. Network deletion fails because firewall rules are dependencies
3. When `self.vms` is empty (provision failure), the firewall deletion logic is effectively bypassed

## Reproduction Steps

1. Run any PKB benchmark that creates networks and VMs
2. Kill the process during VM provisioning (after network/firewall creation but before VMs are fully created)
3. Run teardown with `--run_stage=teardown --run_uri=<failed_run_uri>`
4. Observe that network and firewall rules are NOT deleted

### Example Command

```bash
# Start benchmark
./pkb.py --benchmarks=nginx \
  --cloud=GCP \
  --project=my-project \
  --zone=us-central1-a

# Kill process during VM creation (Ctrl+C)

# Attempt cleanup
./pkb.py --benchmarks=nginx \
  --cloud=GCP \
  --project=my-project \
  --zone=us-central1-a \
  --run_uri=<failed_run_uri> \
  --run_stage=teardown
```

**Result**: Teardown reports "SUCCEEDED" but resources remain:
```bash
gcloud compute networks list | grep <run_uri>
# pkb-network-<run_uri>  True

gcloud compute firewall-rules list | grep <run_uri>
# perfkit-firewall-<run_uri>-22-22
# default-internal-10-0-0-0-8-<run_uri>
```

## Actual Behavior

1. Benchmark fails during provision phase
2. `self.vms` list is empty in pickled spec
3. Teardown runs but skips effective firewall deletion
4. Network deletion fails due to firewall dependencies
5. PKB reports teardown as "SUCCEEDED" despite orphaned resources
6. Resources remain in GCP, costing money

## Expected Behavior

1. Teardown should delete ALL created resources regardless of `self.vms` state
2. Firewall rules should be explicitly deleted, not just disabled
3. Network deletion should succeed after firewall deletion
4. PKB should validate that resources were actually deleted
5. If resources remain, teardown should report FAILED

## Impact

### Resource Leakage
- Orphaned networks cost ~$0.01/hour each
- Orphaned firewall rules accumulate over time
- No automatic cleanup mechanism exists

### Operational Issues
- Cannot resume benchmarks with same `run_uri`
- Manual cleanup required for each failed run
- Difficult to track which resources are orphaned

### Silent Failure
- PKB reports "SUCCEEDED" for teardown even when resources remain
- No warning or error messages
- Users unaware of orphaned resources until billing alerts

## Evidence

### Our Case Study

**Run URI**: `e848137d`

**Timeline**:
```
17:10:38 - Network created successfully
17:11:16 - Firewall rules created successfully  
17:11:30 - VM creation started
17:15:47 - VM deleted (benchmark failed)
17:19:18 - Network deletion attempted → FAILED (firewall rules still exist)
20:23:56 - Teardown with --run_stage=teardown → Reports "SUCCEEDED"
```

**Orphaned Resources** (verified with gcloud):
```bash
$ gcloud compute networks list --project=YOUR_PROJECT_ID | grep e848137d
pkb-network-e848137d  True

$ gcloud compute firewall-rules list --project=YOUR_PROJECT_ID | grep e848137d
default-internal-10-0-0-0-8-e848137d  pkb-network-e848137d  INGRESS
perfkit-firewall-e848137d-22-22       pkb-network-e848137d  INGRESS
```

**PKB Teardown Log**:
```
2025-11-17 20:23:56,540 e848137d MainThread nginx(1/0) INFO Tearing down resources for benchmark nginx
2025-11-17 20:23:56,546 e848137d MainThread INFO Benchmark run statuses:
Name   UID     Status     Failed Substatus
nginx  nginx0  SUCCEEDED  UNCATEGORIZED   
Success rate: 100.00% (1/1)
```

Note: Teardown completed in **0.006 seconds** - no actual deletion occurred!

## My Failed Attempts

My previous attempts to fix the issue were unsuccessful because I was focused on the wrong problem. I initially believed the issue was with the teardown logic itself, but the real problem is the `ControlPath too long` error that triggers the entire failure cascade. My fixes to the teardown logic were ineffective because the script was never reaching that point in the code.

The following `run_uri` was used for testing: `a25b5b22`

When the pickle file is missing, the following error is thrown:
```
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/perfkitbenchmarker/runs/e848137d/cluster_boot0'
```

## Related Issues

This bug is related to [Issue #1239](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/issues/1239) which describes similar cleanup problems in the `object_storage_service` benchmark. That issue has been **open since December 2016** (8+ years) with no resolution.

Both issues stem from the same root cause: **incomplete resource tracking and cleanup logic in PKB's teardown phase**.

## Proposed Fix

### Option 1: Always Delete Networks and Firewalls

Modify `benchmark_spec.py` to delete networks and firewalls regardless of VM state:

```python
def Delete(self):
    if self.deleted:
        return
    
    # ... other resource deletions ...
    
    if self.vms:
        try:
            background_tasks.RunThreaded(self.DeleteVm, self.vms)
            background_tasks.RunThreaded(
                lambda vm: vm.DeleteScratchDisks(), self.vms
            )
        except Exception:
            logging.exception(...)
    
    # ALWAYS delete firewalls, even if no VMs exist
    for firewall in self.firewalls.values():
        try:
            firewall.Delete()  # ← Explicit deletion, not just DisallowAllPorts()
        except Exception:
            logging.exception(...)
    
    # ... container cluster deletion ...
    
    # ALWAYS delete networks, even if no VMs exist
    for net in self.networks.values():
        try:
            net.Delete()
        except Exception:
            logging.exception(...)
```

### Option 2: Comprehensive Resource Tracking

Track all created resources in the pickled spec, not just VMs:

```python
class BenchmarkSpec:
    def __init__(self, ...):
        # ... existing code ...
        self.created_resources = {
            'networks': [],
            'firewalls': [],
            'vms': [],
            'disks': [],
            # etc.
        }
    
    def Delete(self):
        # Delete all tracked resources regardless of state
        for resource_type, resources in self.created_resources.items():
            for resource in resources:
                try:
                    resource.Delete()
                except Exception:
                    logging.exception(...)
```

### Option 3: Validate Cleanup

Add validation to ensure resources were actually deleted:

```python
def Delete(self):
    # ... existing deletion code ...
    
    # Validate cleanup
    remaining_resources = []
    for net in self.networks.values():
        if net.Exists():
            remaining_resources.append(f"Network: {net.name}")
    
    for firewall in self.firewalls.values():
        if firewall.Exists():
            remaining_resources.append(f"Firewall: {firewall.name}")
    
    if remaining_resources:
        raise errors.Resource.CleanupError(
            f"Failed to delete resources: {remaining_resources}"
        )
```

## Workaround

Until this is fixed, users must manually clean up orphaned resources:

```bash
# List orphaned resources
RUN_URI="<failed_run_uri>"
PROJECT="<your-project>"

gcloud compute networks list --project=${PROJECT} | grep ${RUN_URI}
gcloud compute firewall-rules list --project=${PROJECT} | grep ${RUN_URI}

# Delete firewall rules first (dependency)
gcloud compute firewall-rules delete \
  perfkit-firewall-${RUN_URI}-22-22 \
  default-internal-10-0-0-0-8-${RUN_URI} \
  --project=${PROJECT} \
  --quiet

# Delete network
gcloud compute networks delete \
  pkb-network-${RUN_URI} \
  --project=${PROJECT} \
  --quiet
```

## Environment

- **PKB Version**: v1.12.0-5933-g20771be8
- **Cloud Provider**: GCP
- **Benchmark**: nginx (but affects all benchmarks)
- **Platform**: macOS (but cloud-agnostic issue)
- **Python**: 3.11

## Additional Context

This bug has significant cost implications for organizations running PKB at scale. Each failed benchmark run leaves orphaned resources that accumulate over time. Without automated cleanup or proper error reporting, these resources can go unnoticed until billing alerts trigger.

The bug is particularly insidious because:
1. PKB reports teardown as "SUCCEEDED" even when it fails
2. No warning messages are generated
3. The pickled spec doesn't track all created resources
4. Manual intervention is required for every failed run

## Deep Dive Analysis (November 17, 2025 - 21:36)

### Pickle File Investigation

Examined the pickled BenchmarkSpec for `run_uri=e848137d`:

```python
# Pickle file: /tmp/perfkitbenchmarker/runs/e848137d/nginx0
Networks: ['[', '  {', '    "autoCreateSubnetworks": true,', ...]  # JSON strings!
Firewalls: []  # Empty
VMs: 0  # No VMs
Deleted flag: True  # Already marked as deleted
```

### Critical Discovery: Corrupted Pickle Data

The `spec.networks` dictionary contains **raw JSON strings as keys** instead of `GceNetwork` objects. This corruption explains the silent failure:

1. **`if self.deleted: return`** - The `deleted` flag is already `True`, so `Delete()` returns immediately
2. **`if self.networks:`** - Evaluates to `True` (dict has JSON string keys)
3. **`for net in self.networks.values():`** - Iterates over JSON strings, not network objects
4. **`net.Delete()`** - Calling `.Delete()` on a string does nothing (no error, no deletion)
5. **Teardown completes in 0.006 seconds** - No actual work performed

### Root Cause: Three-Layer Failure

1. **Layer 1: Provision Failure** - Network creation succeeded but stored raw JSON instead of objects
2. **Layer 2: Pickle Corruption** - The `networks` dict was pickled with malformed data
3. **Layer 3: Silent Teardown Failure** - The `deleted` flag prevents re-execution, and invalid objects cause silent no-ops

### Why Previous Fix Failed

The previous attempt added resource discovery but only for `FileNotFoundError`. In this case:
- Pickle file **exists** at `/tmp/perfkitbenchmarker/runs/e848137d/nginx0`
- The `deleted=True` flag causes immediate return from `Delete()`
- Resource discovery code never executes
- No cleanup occurs

## Revised Proposed Fix

### Strategy: Robust Teardown with Validation and Recovery

The fix must handle three scenarios:

#### Scenario 1: Normal Teardown (provision succeeded)
- `self.networks` contains valid `GceNetwork` objects
- `self.firewalls` contains valid `GceFirewall` objects
- Standard deletion works

#### Scenario 2: Corrupted Pickle (THIS CASE)
- `self.networks` contains invalid data (JSON strings, empty, etc.)
- `self.firewalls` may be empty or invalid
- `deleted` flag may be `True`
- **Must bypass early return and force cleanup**

#### Scenario 3: Missing Pickle
- No pickle file exists
- Must create minimal spec and discover resources

### Implementation Plan

```python
def Delete(self):
    """Delete all benchmark resources with robust error handling."""
    
    # CHANGE 1: Don't trust the deleted flag during teardown-only runs
    if self.deleted and stages.TEARDOWN not in FLAGS.run_stage:
        return
    
    # CHANGE 2: Validate network/firewall objects before use
    valid_networks = self._ValidateNetworkObjects()
    valid_firewalls = self._ValidateFirewallObjects()
    
    # CHANGE 3: If validation fails, fall back to direct GCP cleanup
    if not valid_networks or not valid_firewalls:
        logging.warning(
            'Invalid network/firewall objects detected. '
            'Falling back to direct GCP resource cleanup.'
        )
        self._CleanupOrphanedGCPResources()
    
    # ... rest of deletion logic ...
    
    # CHANGE 4: Always attempt firewall deletion (not just DisallowAllPorts)
    if self.firewalls:
        for firewall in self.firewalls.values():
            try:
                if hasattr(firewall, 'Delete') and callable(firewall.Delete):
                    firewall.Delete()
                else:
                    logging.warning(f'Invalid firewall object: {firewall}')
            except Exception:
                logging.exception('Got an exception deleting firewalls.')
    
    # CHANGE 5: Always attempt network deletion
    if self.networks:
        for net in self.networks.values():
            try:
                if hasattr(net, 'Delete') and callable(net.Delete):
                    net.Delete()
                else:
                    logging.warning(f'Invalid network object: {net}')
            except Exception:
                logging.exception('Got an exception deleting networks.')
    
    self.deleted = True

def _ValidateNetworkObjects(self) -> bool:
    """Validate that network objects are actual GceNetwork instances."""
    if not self.networks:
        return False
    for net in self.networks.values():
        if not hasattr(net, 'Delete') or not callable(net.Delete):
            return False
    return True

def _ValidateFirewallObjects(self) -> bool:
    """Validate that firewall objects are actual GceFirewall instances."""
    if not self.firewalls:
        return False
    for fw in self.firewalls.values():
        if not hasattr(fw, 'Delete') or not callable(fw.Delete):
            return False
    return True

def _CleanupOrphanedGCPResources(self) -> None:
    """Directly query and delete GCP resources by run_uri pattern."""
    if FLAGS.cloud != provider_info.GCP:
        logging.warning('Direct cleanup only supported for GCP')
        return
    
    from perfkitbenchmarker.providers.gcp import util as gcp_util
    import json
    
    project = FLAGS.project or self.project
    if not project:
        logging.error('No project specified for cleanup')
        return
    
    logging.info(f'Cleaning up orphaned GCP resources for run_uri: {FLAGS.run_uri}')
    
    # Delete firewall rules first (dependency for networks)
    try:
        cmd = gcp_util.GcloudCommand(
            self, 'compute', 'firewall-rules', 'list',
            '--filter', f'name~-{FLAGS.run_uri}',
            '--format', 'value(name)'
        )
        cmd.flags['project'] = project
        stdout, _, retcode = cmd.Issue(raise_on_failure=False)
        if retcode == 0 and stdout.strip():
            for firewall_name in stdout.strip().split('\n'):
                logging.info(f'Deleting orphaned firewall: {firewall_name}')
                del_cmd = gcp_util.GcloudCommand(
                    self, 'compute', 'firewall-rules', 'delete', firewall_name
                )
                del_cmd.flags['project'] = project
                del_cmd.Issue(raise_on_failure=False)
    except Exception as e:
        logging.exception(f'Failed to clean up firewalls: {e}')
    
    # Delete networks
    try:
        cmd = gcp_util.GcloudCommand(
            self, 'compute', 'networks', 'list',
            '--filter', f'name~pkb-network.*{FLAGS.run_uri}',
            '--format', 'value(name)'
        )
        cmd.flags['project'] = project
        stdout, _, retcode = cmd.Issue(raise_on_failure=False)
        if retcode == 0 and stdout.strip():
            for network_name in stdout.strip().split('\n'):
                logging.info(f'Deleting orphaned network: {network_name}')
                del_cmd = gcp_util.GcloudCommand(
                    self, 'compute', 'networks', 'delete', network_name
                )
                del_cmd.flags['project'] = project
                del_cmd.Issue(raise_on_failure=False)
    except Exception as e:
        logging.exception(f'Failed to clean up networks: {e}')
```

### Key Changes

1. **Don't trust `deleted` flag** during teardown-only runs
2. **Validate objects** before calling methods on them
3. **Fall back to direct GCP queries** when objects are invalid
4. **Use gcloud commands** to delete resources by name pattern
5. **Delete firewalls before networks** (dependency order)
6. **Change `DisallowAllPorts()` to `Delete()`** in GceFirewall class

## Recommendation

This should be treated as a **critical priority** bug because:
1. It causes resource leakage and unexpected costs
2. It has existed for 8+ years (related to #1239)
3. It affects all cloud providers and benchmarks
4. The pickle corruption issue makes it worse than initially thought
5. It impacts PKB's reliability and trustworthiness



Bug Report: Incomplete Teardown When Provision Phase Fails #6223

Description

PKB Bug Report: Incomplete Teardown When Provision Phase Fails

Fix Summary

Original Bug Report

Summary

Root Cause Analysis

Code Analysis

Reproduction Steps

Example Command

Actual Behavior

Expected Behavior

Impact

Resource Leakage

Operational Issues

Silent Failure

Evidence

Our Case Study

My Failed Attempts

Related Issues

Proposed Fix

Option 1: Always Delete Networks and Firewalls

Option 2: Comprehensive Resource Tracking

Option 3: Validate Cleanup

Workaround

Environment

Additional Context

Deep Dive Analysis (November 17, 2025 - 21:36)

Pickle File Investigation

Critical Discovery: Corrupted Pickle Data

Root Cause: Three-Layer Failure

Why Previous Fix Failed

Revised Proposed Fix

Strategy: Robust Teardown with Validation and Recovery

Scenario 1: Normal Teardown (provision succeeded)

Scenario 2: Corrupted Pickle (THIS CASE)

Scenario 3: Missing Pickle

Implementation Plan

Key Changes

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions