New Feature
6 days ago

Smart Performance Insights for Apache Spark

ILUM now surfaces actionable run-time insights for Spark jobs, so you can spot regressions, memory pressure, and planner optimizations without digging through logs.

What’s new
  • Job health summary
    • Job Running Efficiently when no critical issues are detected.
    • Slow Job Detected when a run is N× slower than its rolling baseline (e.g., 4.0× longer than average), with a pointer to investigate data skew, resource constraints, or inefficient operations.
  • Stage-level memory hotspots
    Flags memory-intensive stages (e.g., “4 stages used significant memory; peak 1640 MB in stage 73”) so you can tune executor memory or operators before spills/OOM.
  • Planner/Optimizer signal
    Stage Optimization Detected when Spark (e.g., AQE) skips stages or coalesces partitions (“42 stages skipped”)—evidence that your query plan is being optimized.
  • Inline recommendations
    Each finding ships with a short, concrete recommendation (monitor, check skew, consider memory tuning). No fluff.
How it works (brief)
  • Builds a per-job baseline from recent runs and raises deviations using robust statistics (not single-run variance).
  • Reads stage/task metrics to locate peaks (memory, CPU time) and to count skipped/coalesced stages from the optimizer.
  • Keeps noise low with thresholds and minimum observation windows.
Where to find it
Job → Analytics (appears after the first complete run with metrics).

Why it helps
  • Faster triage of Spark performance regressions (latency spikes vs. baseline).
  • Early warning on memory-intensive stages that precede spills/OOM.
  • Confirmation that Adaptive Query Execution is producing real wins (skipped stages), not just toggled on.