Summary
Spark jobs failed to launch
Root Cause
The job execution platform used by our service failed to scale up with increased load, which caused our execution environment to transition into a standby state.
Timeline (UTC)
20:47 Trifacta engineers notified via automated alarms of job execution failure rate increase
21:30 Trifacta engineers freed up resources on the execution environment and transitioned the state back to active
21:52 Trifacta engineers confirmed job execution is completing as expected
22:00 Service returns to normal operation
Mitigation & Next Steps
To address the immediate issue the team freed up resources on the execution environment to enable jobs to complete successfully. We are planning on adjusting scale out thresholds to meet increased demand more aggressively.