LLD Hub
llddistributed-systemscommand-patternobserver

Job Scheduler Low Level Design — Distributed Cron LLD Interview Guide

Design a job scheduling system with cron support, priority queues, retry with backoff, and distributed workers. Senior-level LLD problem at Amazon, Flipkart.

16 April 2025·10 min read

Practice this problem

Distributed Job Scheduler — get AI-scored feedback on your solution

Solve it →

Job Scheduler (Distributed Cron) is a senior-level Low Level Design problem asked at Amazon, Flipkart, and Uber. It requires priority queue scheduling, cron expression parsing, retry with exponential backoff, and distributed worker coordination. This guide covers the complete Job Scheduler LLD with Java code, class diagram, and interview FAQ.

Why Interviewers Ask Job Scheduler LLD

Job scheduling combines data structures, concurrency, and distributed systems thinking. Interviewers want to see:

  • Do you use PriorityQueue to efficiently find the next job to run?
  • Can you design a Job with cron expression, retry config, and execution history?
  • Do you use Command pattern to decouple job definition from execution?
  • Can you prevent duplicate execution when multiple workers compete for the same job?
  • Do you implement exponential backoff for failed jobs without blocking other jobs?

Functional Requirements

  • Schedule a one-time job to run at a specific time
  • Schedule a recurring job using a cron expression (e.g., every 5 minutes)
  • Jobs have a priority — higher priority jobs run first when multiple are due simultaneously
  • Failed jobs retry with exponential backoff (max 3 attempts)
  • Cancel a scheduled job
  • View job status: PENDING, RUNNING, SUCCESS, FAILED, CANCELLED
  • Workers are distributed — multiple machines can pick up jobs

Non-Functional Requirements

  • No two workers must execute the same job instance simultaneously
  • Missed jobs (scheduler was down) must be caught up or skipped based on policy
  • Adding a new job type must not change the scheduler core (OCP)
  • Scheduler must handle 10,000 jobs per minute

Core Entities — Job Scheduler LLD Class Design

  • Job — id, name, type, priority, schedule (cron or one-time), retryConfig, status
  • JobInstance — id, jobId, scheduledAt, startedAt, completedAt, attempt, status
  • JobCommand — interface; execute() — Command pattern for job logic
  • JobScheduler — main loop; picks next due job from priority queue
  • WorkerPool — thread pool that executes job commands
  • DistributedLock — ensures one worker per job instance (Redis SETNX)
  • RetryPolicy — maxAttempts, backoffMultiplier
  • CronExpression — parses cron string, computes next run time

Text-Based Class Diagram

Job
+-- id, name: String
+-- cronExpression: String  (null for one-time)
+-- nextRunTime: LocalDateTime
+-- priority: int (higher = more urgent)
+-- retryPolicy: RetryPolicy
+-- status: JobStatus (ACTIVE/PAUSED/CANCELLED)
+-- commandClass: String  (class name to instantiate)

JobInstance
+-- id, jobId: String
+-- scheduledAt, startedAt, completedAt: LocalDateTime
+-- attempt: int
+-- status: InstanceStatus (PENDING/RUNNING/SUCCESS/FAILED)
+-- errorMessage: String

JobCommand (interface)
+-- execute(JobContext): void

RetryPolicy
+-- maxAttempts: int
+-- initialDelaySeconds: long
+-- backoffMultiplier: double

CronExpression
+-- expression: String
+-- getNextFireTime(from: LocalDateTime): LocalDateTime

JobScheduler
+-- queue: PriorityQueue<Job>  (ordered by nextRunTime, then priority)
+-- start(): void  (main loop)
+-- schedule(job): void
+-- cancel(jobId): void

Command Pattern — Job Execution

public interface JobCommand {
    void execute(JobContext context) throws Exception;
}

// Example: email report job
public class SendDailyReportCommand implements JobCommand {
    @Override
    public void execute(JobContext context) throws Exception {
        String reportDate = context.getParam("reportDate");
        Report report = reportService.generateDailyReport(reportDate);
        emailService.send(report);
    }
}

// Command factory resolves class name to instance
public class JobCommandFactory {
    public JobCommand create(String commandClass) {
        try {
            Class<?> clazz = Class.forName(commandClass);
            return (JobCommand) applicationContext.getBean(clazz);
        } catch (ClassNotFoundException e) {
            throw new UnknownJobTypeException(commandClass);
        }
    }
}

JobScheduler — Priority Queue Main Loop

public class JobScheduler {
    // Min-heap by nextRunTime, then by priority (higher priority = lower queue value)
    private final PriorityQueue<Job> queue = new PriorityQueue<>(
        Comparator.comparing(Job::getNextRunTime)
                  .thenComparingInt(j -> -j.getPriority())
    );
    private final WorkerPool workerPool;
    private final DistributedLock distributedLock;
    private final JobRepository jobRepo;
    private volatile boolean running = true;

    public void start() {
        // Load all active jobs from DB into queue
        jobRepo.findByStatus(JobStatus.ACTIVE).forEach(queue::add);

        while (running) {
            Job job = queue.peek();
            if (job == null || job.getNextRunTime().isAfter(LocalDateTime.now())) {
                Thread.sleep(1000); // check every second
                continue;
            }

            queue.poll();
            String lockKey = "job-lock:" + job.getId();

            // Distributed lock: only one worker executes this instance
            if (!distributedLock.tryAcquire(lockKey, 30, TimeUnit.SECONDS)) {
                continue; // another worker got it
            }

            workerPool.submit(() -> executeJob(job, lockKey));

            // Re-schedule if recurring
            if (job.getCronExpression() != null) {
                LocalDateTime next = CronExpression.parse(job.getCronExpression())
                    .getNextFireTime(LocalDateTime.now());
                job.setNextRunTime(next);
                queue.add(job);
            }
        }
    }

    private void executeJob(Job job, String lockKey) {
        JobInstance instance = new JobInstance(UUID.randomUUID().toString(), job.getId(),
            LocalDateTime.now(), null, null, 1, InstanceStatus.RUNNING);
        instanceRepo.save(instance);

        try {
            JobCommand command = commandFactory.create(job.getCommandClass());
            command.execute(new JobContext(job.getParams()));
            instance.setStatus(InstanceStatus.SUCCESS);
        } catch (Exception e) {
            instance.setStatus(InstanceStatus.FAILED);
            instance.setErrorMessage(e.getMessage());
            scheduleRetry(job, instance);
        } finally {
            instance.setCompletedAt(LocalDateTime.now());
            instanceRepo.save(instance);
            distributedLock.release(lockKey);
        }
    }

    private void scheduleRetry(Job job, JobInstance instance) {
        RetryPolicy policy = job.getRetryPolicy();
        if (instance.getAttempt() >= policy.getMaxAttempts()) return;

        long delaySeconds = (long) (policy.getInitialDelaySeconds()
            * Math.pow(policy.getBackoffMultiplier(), instance.getAttempt() - 1));

        Job retryJob = job.copy();
        retryJob.setNextRunTime(LocalDateTime.now().plusSeconds(delaySeconds));
        queue.add(retryJob);
    }
}

Key Design Decisions

  • PriorityQueue ordered by nextRunTime then priority: The scheduler only needs to check one element — the head of the queue — on each tick. If the head's nextRunTime is in the future, all others are too. O(log n) insertion and O(1) peek.
  • Distributed lock prevents double execution: In a multi-worker environment, multiple schedulers may run simultaneously. A Redis SETNX lock with a TTL ensures only one worker executes a job instance. The lock expires automatically if the worker crashes mid-job.
  • Command pattern for extensibility: The scheduler stores a class name (String) for each job, not a closure or lambda. New job types are added by implementing JobCommand — no changes to JobScheduler. The factory resolves the class at runtime.
  • Retry as a new job instance: Failed jobs are re-added to the priority queue as a new entry with a delayed nextRunTime. This avoids blocking the queue while waiting for the retry delay and keeps retry state clean in JobInstance.

Common Follow-Up Questions

  • "What happens to jobs if the scheduler crashes?" — On restart, load all ACTIVE jobs from DB. Jobs with nextRunTime in the past are either immediately retried (catch-up mode) or skipped and rescheduled to the next future fire time (skip-missed mode) — configurable per job.
  • "How do you implement cron expression parsing?" — Use a library like Quartz CronExpression for production. In an interview, describe the five cron fields (minute, hour, day, month, weekday) and show how to compute the next fire time by incrementing each field until the expression matches.
  • "How do you scale the scheduler to handle 100,000 jobs?" — Partition jobs by a consistent hash of jobId across multiple scheduler nodes. Each node owns a subset of jobs. Coordinator election (ZooKeeper or etcd) handles node failures and rebalancing.

FAQ — Job Scheduler Low Level Design

What data structure should you use for a job scheduler?

A min-heap (Java PriorityQueue) ordered by nextRunTime. Peek returns the job due soonest in O(1). Insertion and removal are O(log n). For a distributed scheduler, a database table with an index on nextRunTime and a row-level lock serves as a persistent priority queue.

How do you prevent duplicate job execution in a distributed scheduler?

Use a distributed lock (Redis SETNX with TTL, or a database SELECT FOR UPDATE) keyed on the job instance ID. The first worker to acquire the lock executes the job. Others see the lock is taken and skip. The TTL prevents deadlock if the worker crashes.

What is exponential backoff in job retry?

After each failure, wait longer before retrying: delay = initialDelay * backoffMultiplier^attempt. Example: 30s, 60s, 120s for backoffMultiplier=2. This prevents retry storms when a downstream service is degraded — spreading retries over time gives the service time to recover.

What design patterns are used in Job Scheduler LLD?

The primary patterns are Command (JobCommand encapsulates job logic),Strategy (retry policy), and Observer (notify callers of job completion). The priority queue is a standard data structure, not a design pattern, but central to the scheduler's correctness.

Ready to practice?

Submit your solution and get AI-scored feedback on OOP, SOLID principles, design patterns, and code quality.

Solve Distributed Job Scheduler