-
Notifications
You must be signed in to change notification settings - Fork 13
Alert Engine Guide
Adan edited this page Oct 6, 2025
·
1 revision
The Alert Engine transforms raw container signals into actionable alerts and safe remediation steps. This guide explains the building blocks, workflows, and best practices for crafting reliable rules.
| Concept | Description |
|---|---|
| Rule | A saved definition that listens for events (logs, status changes, performance thresholds) and executes actions. |
| Trigger | One condition per rule (keyword match, metric threshold, or container event) that determines when actions fire. |
| Scope | Determines which containers a rule inspects (all, specific labels/groups, or explicit container IDs). |
| Action | What happens when the trigger fires: notify, restart, stop, kill, start, or run a script. |
| Advanced Settings | The "Advanced Settings" panel in the UI (Gatekeeper & keyword tabs) where cooldowns, verification delays, rate limits, and keyword behaviour are configured. |
- Create -" Start from a template or blank rule within the Alert Engine UI.
- Scope -" Select containers/groups. Use include/exclude lists for precision.
- Trigger -" Choose a trigger type (keywords, container events, metrics) and configure thresholds.
- Actions -" Add one or more actions with optional delays between steps.
- Advanced Settings -" Configure cooldowns, max executions, verification delays, backoff, and keyword behaviour.
- Activate -" Enable the rule. Evaluations begin immediately.
- Review -" Inspect alert history, acknowledgements, and audit logs to tune behavior.
| Trigger | Description | Example |
|---|---|---|
| Log keyword | Matches one or many substrings (ANY/ALL) in container logs. Optional timeline settings require N matches within M minutes before firing. | Alert when OutOfMemoryError appears 3 times in 2 minutes for backend-* containers. |
| Performance metric (LogForge Pro) | Evaluates CPU, memory, or restart counters against a threshold, with optional sustained-time windows. | Trigger when memory usage stays above 85% for 5 minutes. |
| Container event | Reacts to lifecycle events emitted by the LogForge backend (start, stop, die, oom, etc.), with optional "N events in M minutes" thresholds. |
Notify when a database container restarts twice within 10 minutes. |
Each rule supports only one trigger type; create additional rules if you need to combine different signal types.
| Action | Details |
|---|---|
| Notify | Sends the alert payload to one or more channels configured in the Notifier service. Supports templated bodies and includes context (container, rule, timestamps). |
| Restart / Stop / Start / Kill | Executes Docker lifecycle operations via the backend. Guardrails stop repeated restarts if the container fails health checks. |
| Run script | Executes the first executable .sh script found under /logforge-scripts/ inside the container. Ensure the directory exists, scripts are executable, and a shell (/bin/sh) is present. |
| Delay | Chain actions with delays to stage responses (e.g., notify immediately, restart after 30 seconds if not acknowledged). |
Each action has additional safeguards:
- Verification delay -" Wait for a steady state before confirming success.
- Max executions -" Cap the number of times the action runs within a cooldown window.
- Cooldown -" Minimum wait before the rule can fire again.
The UI includes templates covering common reliability and security cases:
- Crash loop detection
- High memory or CPU usage
- Log spike / noisy errors
- TLS certificate renewal reminder
- Security keyword detection
- Container start/stop notifications
Templates are editable after import. Use them to ensure guardrails are pre-populated.
Goal: Restart a worker if it throws repeated queue errors and notify Slack.
-
Scope: Containers tagged with group
workers. -
Trigger: Log keyword
Failed to fetch jobwith frequency 3 times in 60 seconds. -
Actions:
- Notify Slack channel
#on-call(immediate). - Delay 30 seconds.
- Restart container. Verification delay 45 seconds.
- Notify Slack channel
-
Guardrails:
- Cooldown: 10 minutes.
- Max executions per hour: 2.
- Abort if the container was restarted manually in the last 5 minutes.
This pattern avoids restart storms while keeping operators informed.
- The Alerts dashboard shows the latest events, total alert count, and a rolling view of recent triggers.
- Switch to the Stats sub-tab to explore trend charts, rule and container breakdowns, and timeline analytics.
- Free edition retains the most recent alerts (displayed at the top of the page); upgrading lifts that limit for deeper history.
- Use the built-in filters (rule, container, timeframe) to focus on the signals that matter before exporting data manually if needed.
- Verify rule definitions in the Alert Engine UI and confirm the trigger preview matches your intent.
- Review backend logs (
docker compose logs alert-engine-backend) for evaluation errors or guardrail messages. - Ensure the Notifier service is reachable if notifications fail; inspect the Notifier dashboard (Logs tab) for recent delivery attempts and response codes.
- For script actions, confirm the container has
/logforge-scripts/with an executable.shscript and that/bin/shis available.
Advance to the Automation Playbooks for high-level strategies that combine multiple rules.