Spark jobs do not run properly

Incident Report for Alteryx

Postmortem

Summary
Spark jobs failed to launch

Root Cause
The job execution platform used by our service failed to scale up with increased load, which caused our execution environment to transition into a standby state.

Timeline (UTC)
20:47 Trifacta engineers notified via automated alarms of job execution failure rate increase
21:30 Trifacta engineers freed up resources on the execution environment and transitioned the state back to active
21:52 Trifacta engineers confirmed job execution is completing as expected
22:00 Service returns to normal operation

Mitigation & Next Steps
To address the immediate issue the team freed up resources on the execution environment to enable jobs to complete successfully. We are planning on adjusting scale out thresholds to meet increased demand more aggressively.

Posted Jan 08, 2021 - 23:15 UTC

Resolved

This incident has been resolved.

Posted Jan 04, 2021 - 22:00 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 04, 2021 - 21:34 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 04, 2021 - 21:20 UTC

Investigating

We are currently investigating this issue. Photon jobs continue to run normally.

Posted Jan 04, 2021 - 21:19 UTC