WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Alert Engine Guide

Adan edited this page Oct 6, 2025 · 1 revision

The Alert Engine transforms raw container signals into actionable alerts and safe remediation steps. This guide explains the building blocks, workflows, and best practices for crafting reliable rules.

Core concepts

Concept Description
Rule A saved definition that listens for events (logs, status changes, performance thresholds) and executes actions.
Trigger One condition per rule (keyword match, metric threshold, or container event) that determines when actions fire.
Scope Determines which containers a rule inspects (all, specific labels/groups, or explicit container IDs).
Action What happens when the trigger fires: notify, restart, stop, kill, start, or run a script.
Advanced Settings The "Advanced Settings" panel in the UI (Gatekeeper & keyword tabs) where cooldowns, verification delays, rate limits, and keyword behaviour are configured.

Rule lifecycle

  1. Create -" Start from a template or blank rule within the Alert Engine UI.
  2. Scope -" Select containers/groups. Use include/exclude lists for precision.
  3. Trigger -" Choose a trigger type (keywords, container events, metrics) and configure thresholds.
  4. Actions -" Add one or more actions with optional delays between steps.
  5. Advanced Settings -" Configure cooldowns, max executions, verification delays, backoff, and keyword behaviour.
  6. Activate -" Enable the rule. Evaluations begin immediately.
  7. Review -" Inspect alert history, acknowledgements, and audit logs to tune behavior.

Trigger types

Trigger Description Example
Log keyword Matches one or many substrings (ANY/ALL) in container logs. Optional timeline settings require N matches within M minutes before firing. Alert when OutOfMemoryError appears 3 times in 2 minutes for backend-* containers.
Performance metric (LogForge Pro) Evaluates CPU, memory, or restart counters against a threshold, with optional sustained-time windows. Trigger when memory usage stays above 85% for 5 minutes.
Container event Reacts to lifecycle events emitted by the LogForge backend (start, stop, die, oom, etc.), with optional "N events in M minutes" thresholds. Notify when a database container restarts twice within 10 minutes.

Each rule supports only one trigger type; create additional rules if you need to combine different signal types.

Actions

Action Details
Notify Sends the alert payload to one or more channels configured in the Notifier service. Supports templated bodies and includes context (container, rule, timestamps).
Restart / Stop / Start / Kill Executes Docker lifecycle operations via the backend. Guardrails stop repeated restarts if the container fails health checks.
Run script Executes the first executable .sh script found under /logforge-scripts/ inside the container. Ensure the directory exists, scripts are executable, and a shell (/bin/sh) is present.
Delay Chain actions with delays to stage responses (e.g., notify immediately, restart after 30 seconds if not acknowledged).

Each action has additional safeguards:

  • Verification delay -" Wait for a steady state before confirming success.
  • Max executions -" Cap the number of times the action runs within a cooldown window.
  • Cooldown -" Minimum wait before the rule can fire again.

Templates

The UI includes templates covering common reliability and security cases:

  • Crash loop detection
  • High memory or CPU usage
  • Log spike / noisy errors
  • TLS certificate renewal reminder
  • Security keyword detection
  • Container start/stop notifications

Templates are editable after import. Use them to ensure guardrails are pre-populated.

Building a rule -" example

Goal: Restart a worker if it throws repeated queue errors and notify Slack.

  1. Scope: Containers tagged with group workers.
  2. Trigger: Log keyword Failed to fetch job with frequency 3 times in 60 seconds.
  3. Actions:
    • Notify Slack channel #on-call (immediate).
    • Delay 30 seconds.
    • Restart container. Verification delay 45 seconds.
  4. Guardrails:
    • Cooldown: 10 minutes.
    • Max executions per hour: 2.
    • Abort if the container was restarted manually in the last 5 minutes.

This pattern avoids restart storms while keeping operators informed.

Alert history & insights

  • The Alerts dashboard shows the latest events, total alert count, and a rolling view of recent triggers.
  • Switch to the Stats sub-tab to explore trend charts, rule and container breakdowns, and timeline analytics.
  • Free edition retains the most recent alerts (displayed at the top of the page); upgrading lifts that limit for deeper history.
  • Use the built-in filters (rule, container, timeframe) to focus on the signals that matter before exporting data manually if needed.

Troubleshooting rules

  • Verify rule definitions in the Alert Engine UI and confirm the trigger preview matches your intent.
  • Review backend logs (docker compose logs alert-engine-backend) for evaluation errors or guardrail messages.
  • Ensure the Notifier service is reachable if notifications fail; inspect the Notifier dashboard (Logs tab) for recent delivery attempts and response codes.
  • For script actions, confirm the container has /logforge-scripts/ with an executable .sh script and that /bin/sh is available.

Advance to the Automation Playbooks for high-level strategies that combine multiple rules.

Clone this wiki locally