GitLab CI/CD – Retry

This article focuses on the retry keyword in GitLab CI/CD, explaining its utility in enhancing pipeline robustness by automatically re-executing failed jobs. We’ll cover how to configure retry attempts, set delay intervals, and apply it to specific jobs, providing practical examples and best practices for creating more resilient and reliable continuous integration and continuous delivery workflows.

Understanding the retry Keyword in GitLab CI/CD

The retry keyword in GitLab CI/CD allows you to automatically re-execute a job a specified number of times if it fails. This is an incredibly useful feature for dealing with transient failures – issues that are not caused by a fundamental problem in your code or configuration but rather by temporary external factors, such as network glitches, flaky test environments, or brief service outages. By configuring retries, you can make your CI/CD pipelines more resilient and reduce the need for manual intervention.

Why Use the retry Keyword?

The primary reasons to implement the retry keyword include:

  • Handling Transient Failures: Automatically recover from intermittent issues without human intervention.
  • Improving Pipeline Reliability: Increase the success rate of your pipelines by giving jobs multiple chances to pass.
  • Reducing Manual Reruns: Save time and effort by eliminating the need to manually retry jobs that failed due to temporary problems.
  • Stabilizing Flaky Tests: While not a substitute for fixing truly flaky tests, retries can help mitigate their impact on pipeline stability during the diagnosis and resolution phase.

Configuring retry in .gitlab-ci.yml

The retry keyword is defined at the job level within your .gitlab-ci.yml file. It accepts a dictionary of configuration options or a simple integer.

Basic retry Configuration

The simplest way to use retry is to specify an integer, which defines the number of times the job should be retried. The maximum number of retries is 2.

stages:
  - build
  - test

build_job:
  stage: build
  script:
    - echo "Building..."
    - exit 0 # This job will always succeed, no retry needed

flaky_test_job:
  stage: test
  script:
    - echo "Running flaky test..."
    - >
      if (( $RANDOM % 2 == 0 )); then
        echo "Test failed temporarily."
        exit 1 # Simulate a temporary failure
      else
        echo "Test passed."
        exit 0
      fi
  retry: 1 # If this job fails, retry it 1 time (total of 2 attempts)

In the example above, flaky_test_job will attempt to run. If it fails, GitLab CI/CD will automatically rerun it once more. If the second attempt also fails, the job will be marked as failed.

Advanced retry Configuration with max, when, and delay

For more granular control, you can use a dictionary with max, when, and delay to define the retry behavior.

retry:
  max: <integer> # Maximum number of retry attempts (0 to 2)
  when:           # Condition(s) under which to retry (optional)
    - always
    - unknown_failure
    - script_failure
    - api_failure
    - runner_system_failure
    - runner_unsupported
    - staled_and_stuck_runner
    - job_execution_timeout
    - archiver_failure
    - missing_dependency_failure
    - docker_system_failure
    - docker_unknown_failure
    - oom_failure
    - build_script_failure
    - stuck_or_timeout_failure
    - runner_service_shutdown
    - scheduler_failure
    - data_integrity_failure
  delay: <duration> # Delay before retrying (e.g., '10s', '5m') - requires GitLab Premium/Ultimate
  • max: The maximum number of retry attempts. The value must be between 0 and 2 (inclusive). If set to 0, no retries will occur.
  • when: (Optional) A list of failure conditions under which the job should be retried. If not specified, the job is retried on any failure (equivalent to always). This is a powerful feature for targeting specific types of transient issues.
    • always: Retry on any failure. (Default if when is omitted)
    • unknown_failure: A job failed for an unknown reason.
    • script_failure: The job’s script exited with a non-zero status.
    • api_failure: A GitLab API call failed (e.g., during artifact upload/download).
    • runner_system_failure: The runner itself experienced a system error.
    • runner_unsupported: The runner is outdated or unsupported.
    • staled_and_stuck_runner: The runner became unresponsive or stuck.
    • job_execution_timeout: The job exceeded its timeout limit.
    • archiver_failure: Artifact archiving failed.
    • missing_dependency_failure: A required artifact or cache dependency was not found.
    • docker_system_failure: A Docker-related system error (e.g., unable to pull image).
    • docker_unknown_failure: An unknown Docker error.
    • oom_failure: Job failed due to out of memory.
    • build_script_failure: Specifically for script failures within the build context.
    • stuck_or_timeout_failure: General failure due to a job getting stuck or timing out.
    • runner_service_shutdown: The runner service shut down during job execution.
    • scheduler_failure: Failure related to the job scheduler.
    • data_integrity_failure: Data integrity issues related to the job’s execution environment.
  • delay: (GitLab Premium/Ultimate) Specifies a duration to wait before retrying the job. This is crucial for transient issues that might resolve themselves after a short period (e.g., network congestion). The delay is applied after each failed attempt.
    • Accepted formats: 10s (seconds), 5m (minutes), 1h (hours).
    • Example: delay: 10s will wait 10 seconds before the first retry, then another 10 seconds before the second (if max: 2).

Example with Advanced retry Configuration

stages:
  - deploy

deploy_to_dev:
  stage: deploy
  script:
    - echo "Attempting to deploy to development environment..."
    - >
      if (( $CI_JOB_ATTEMPT <= 1 )); then # CI_JOB_ATTEMPT is 1 for the first run, 2 for first retry, etc.
        echo "Simulating a temporary network issue on first attempt."
        exit 1
      else
        echo "Deployment successful."
        exit 0
      fi
  retry:
    max: 2 # Total 3 attempts (initial + 2 retries)
    when:
      - script_failure # Only retry if the script itself fails
      - runner_system_failure # Or if there's a runner system issue
    delay: 5s # Wait 5 seconds before each retry (Premium/Ultimate feature)

network_check_job:
  stage: test
  script:
    - ping -c 1 example.com # Simulate a network check
  retry:
    max: 1
    when:
      - script_failure # Only retry if ping fails due to script exit code
    delay: 2s

In the deploy_to_dev job:

  • It will run up to 3 times (initial run + 2 retries).
  • Retries will only happen if the job fails due to a script_failure or runner_system_failure.
  • There will be a 5-second delay before each retry attempt.

The CI_JOB_ATTEMPT predefined variable can be useful within your script to behave differently on retry attempts, as shown in the example.

Important Considerations and Best Practices

  • Do Not Retry Persistent Failures: The retry keyword is not a substitute for fixing fundamental issues in your code, tests, or environment. If a job consistently fails for the same reason, retries will only waste resources. Identify and fix the root cause.
  • Use delay for Transient Issues: For issues like network problems or external service outages, adding a delay can be very effective as it gives the underlying problem a chance to resolve itself.
  • Choose when Conditions Carefully: Be specific about when a job should be retried. Retrying on script_failure is common, but you might want to exclude it if your scripts are expected to be robust and failures indicate a deeper problem.
  • Limit max Attempts: Keep the max retry attempts low (usually 1 or 2). Too many retries can significantly slow down your pipeline without providing much benefit if the failure is not transient.
  • Monitor Retried Jobs: Even with retries, keep an eye on jobs that frequently get retried. Frequent retries might indicate an underlying flakiness or instability that needs to be addressed.
  • Consider Impact on Downstream Jobs: If a job eventually succeeds after several retries, ensure that downstream jobs are robust enough to handle any slight delays or inconsistent states that might arise from the retries.
  • Visibility in UI: GitLab’s UI clearly indicates when a job has been retried, showing multiple attempts for a single job and which attempt ultimately passed or failed. This helps in debugging.

FAQs – GitLab CI/CD retry


What is the retry keyword in GitLab CI/CD?
The retry keyword in GitLab CI/CD allows you to automatically rerun a job if it fails. This is helpful for handling intermittent errors, such as network timeouts, flaky tests, or resource constraints during job execution.


How do I use the retry keyword in a job?
You can add retry to any job definition by specifying the number of retry attempts:

test-job:
  script:
    - run_tests.sh
  retry: 2  # Retry the job up to 2 times if it fails

This means GitLab will attempt to run the job a total of up to 3 times (initial + 2 retries) before marking it as failed.


What is the maximum number of retries allowed in GitLab?
The retry keyword can be set to an integer from 0 to 2. That means a job can be retried a maximum of 2 times automatically.

If you specify a number greater than 2, GitLab will throw a validation error during pipeline compilation.


Can I retry a job only for specific types of failures?
Yes, you can specify conditions for retrying by using the when option along with retry. Available values are:

  • always (default): Retry on any failure.
  • unknown_failure
  • script_failure
  • api_failure
  • stuck_or_timeout_failure
  • runner_system_failure
  • missing_dependency_failure
  • runner_unsupported
  • stale_schedule

Example:

deploy-job:
  script:
    - ./deploy.sh
  retry:
    max: 1
    when:
      - script_failure
      - api_failure

This will retry the job only if it fails due to a script or API-related error.


How does retry interact with manual job retrials in the UI?
The retry keyword defines automatic retries. Even after GitLab has exhausted the automatic retries, you can still manually retry the job from the GitLab UI by clicking the Retry button.


Does retry apply to failed manual or allow_failure jobs?
No. The retry keyword does not apply to:

  • Jobs with when: manual
  • Jobs with allow_failure: true

These jobs must be manually triggered or will not fail the pipeline, respectively.


Can I use retry for jobs in any stage?
Yes, you can use retry for jobs in any stage of the pipeline: build, test, deploy, etc. However, use it thoughtfully—especially for deployment jobs—to avoid unintended side effects like multiple deploys.


Can I see how many times a job was retried in the GitLab UI?
Yes, each retry attempt is logged in the Job details view in GitLab. You’ll see an indication like “Retry #1” or “Retry #2”, along with the status of each attempt.


Is retry useful for flaky tests or unstable runners?
Absolutely. If your CI runners are occasionally unstable or your test suite has non-deterministic failures (i.e., flaky tests), using retry can reduce false negatives in your pipelines.

However, it’s better to investigate root causes rather than relying too heavily on retries.


Does retrying a job affect pipeline duration?
Yes. Retries can increase total pipeline time, especially if failures happen frequently. GitLab waits for each retry to complete before moving to the next stage, which can delay overall execution.

Use retry only where it’s necessary to ensure pipeline efficiency.


How do I disable retry for a job?
To explicitly prevent a job from retrying, set retry: 0, although this is redundant because 0 is the default value.

build-job:
  script:
    - make build
  retry: 0

Author

Debjeet Bhowmik

Experienced Cloud & DevOps Engineer with hands-on experience in AWS, GCP, Terraform, Ansible, ELK, Docker, Git, GitLab, Python, PowerShell, Shell, and theoretical knowledge on Azure, Kubernetes & Jenkins. In my free time, I write blogs on ckdbtech.com

Leave a Comment