Managing Long Running Tasks - Ryan Lynch's Hub

# Overview of Durable Job Processing #flashcard - When a job in a system is expected to run for an extended period of time (hours to days) there is a risk of the job failing and losing progress. A Durable Job Processing architecture addresses this issue by using a log (like [[Kafka]]) to store the jobs. The workers then process the jobs off the Kafka log while periodically checkpointing their progress into the log. If the worker crashes, the another worker can pickup the job based on the checkpoint. - You can implement using [[Async Job Worker Pool]], which is composed of a [[Message Queue]] and Workers - Uber's [[Cadence]] and the open source [[Temporal]] offer alternatives to Kafka, while achieving the similar result of a durable job.  ## Step-by-step Approach For Managing Long Running Tasks #flashcard 1. Web server validates the request and creates a job record in your database with status "pending" 2. Web server pushes a message to the queue containing the job ID (not the full job data) 3. Web server returns the job ID to the client immediately 4. Worker pulls message from queue, fetches job details from database 5. Worker updates job status to "processing" 6. Worker performs the actual work 7. Worker stores results (in S3 for files, database for metadata, etc.) 8. Worker updates job status to "completed" or "failed"  ## Decision Tree for Long Running Tasks #flashcard ![[image-6.png]]  # Implementation Details ## Handling Failures in Managing Long Running Jobs - How to know if a worker crashed? #flashcard - Use a [[heartbeat]] mechanism that checks in with the queue to show it is alive - The interval of the heartbeat is a key decision. Too long means the job will be significantly delayer. Too short means means be sending a lot of unnecessary messages to the queue or worse, you may mark jobs as failed when in fact they're still running - For most systems, 10-30 seconds is a good starting point.  - What about for repeated failures? If you don't do anything, it will be retried forever. #flashcard - Use a [[Dead Letter Queue (DLQ)]]. After a job fails a certain number of times (typically 3-5), you move it to a separate queue instead of retrying again  ## Preventing Duplicate Work To prevent duplicate work if a user submits a request multiple times... #flashcard - The solution is idempotency keys. When accepting a job, require a unique identifier that represents the operation. - For user-initiated actions, combine user ID + action + timestamp (likely rounded to the duration you want to prevent duplicate work on). - For system-generated jobs, use deterministic IDs based on the input data. Before starting work, check if a job with this key already exists. If it does, return the existing job ID instead of creating a new one.  ## Managing Queue Length #flashcard - Use [[Backpressure]] - slows down job acceptance when workers are overwhelmed. You can set queue depth limits and reject new jobs when the queue is too deep and return a "system busy" response immediately rather than accepting work you can't handle. - Autoscale workers based on queue depth. When the queue grows beyond a threshold, spin up more workers. When it shrinks, scale down. CloudWatch alarms + Auto Scaling groups make this straightforward on AWS. The key metric is queue depth, not CPU usage. By the time CPU is high, your queue is already backed up.  ## Handling Mixed Workloads #flashcard - The solution is to separate queues by job type or expected duration. Quick reports go to a "fast" queue with many workers. Complex reports go to a "slow" queue with fewer, beefier workers. This prevents head-of-line blocking and lets you tune each queue independently. - Alternatively, you can break large jobs into smaller chunks that use the same queue infrastructure, like splitting [news feed fanout](https://www.hellointerview.com/learn/system-design/problem-breakdowns/fb-news-feed) into batches of followers.  ## Orchestrating Job Dependencies #flashcard - For simple chains, have each worker queue the next job before marking itself complete. Include the full context in each job so steps can be retried independently. - For complex workflows with branching or parallel steps, use a workflow orchestrator like [AWS Step Functions](https://aws.amazon.com/step-functions/), [Temporal](https://temporal.io/), or [Airflow](https://airflow.apache.org/). These tools let you define workflows as code, handle retries per step, and provide visibility into where workflows get stuck. The tradeoff is additional complexity - only reach for orchestration when job dependencies become truly complex.  # Key Considerations # Pros #flashcard - **Fast user response times** - API calls return in milliseconds instead of timing out after 30 seconds. Users get immediate acknowledgment that their request was received. - **Independent scaling** - Web servers and workers scale separately. Add more workers during peak processing times without paying for idle web servers. - **Fault isolation** - A worker crash processing one video doesn't bring down your entire API. Failed jobs can be retried without affecting user-facing services - **Better resource utilization** - CPU-intensive workers run on compute-optimized instances. Memory-heavy tasks get high-memory machines. Web servers use cheap, general-purpose instances.  # Cons #flashcard - **System complexity** - You now have queues, workers, and job status tracking to manage. More moving parts means more things that can break. - **Eventual consistency** - The work isn't done when the API returns. Users might see stale data until background processing completes. - **Job status tracking** - You need infrastructure to store job states, handle retries, and expose status endpoints. This adds database load and API complexity. - **Monitoring overhead** - Queue depth, worker health, job failure rates, processing latency - you're monitoring a distributed system instead of simple request/response cycles. - **New failure modes** - What happens when the queue fills up? How do you handle poison messages that crash workers repeatedly? When do you give up retrying a failed job? These problems aren't impossible to solve, but they need planning that synchronous systems don't.  # Use Cases - To be used if the following non-functional requirements are a priority: - [[Durability]] - [[Fault Tolerance]] - [[Design a Video Hosting Platform]] - [[Design a Photo Sharing System]] - [[Design a Rideshare Service]] - [[Design a File Storage Service]] ## When to Use in System Design Interviews: #flashcard 1. When they mention specific slow operations** - The moment you hear "video transcoding", "image processing", "PDF generation", "sending bulk emails", or "data exports" that's your cue. These operations take seconds to minutes. Jump in immediately: "Video transcoding will take several minutes, so I'll return a job ID right away and process it asynchronously." 2. **When the math doesn't work** - If they say "we process 1 million images per day" and you know image processing takes 10 seconds, do the quick calculation out loud: "That's about 12 images per second, which means 120 seconds of processing time per second. We'd need 120+ servers just for image processing. I'll use async workers instead." 3. **When different operations need different hardware** - If the problem involves both simple API requests and GPU-heavy work (like ML inference or video processing), that's a clear async signal. "We shouldn't run GPU workloads on the same servers handling login requests. I'll separate these with async workers on GPU instances." 4. **When they ask about scale or failures** - Questions like "what if a server crashes during processing?" or "how do you handle 10x traffic?" are perfect openings to introduce async workers. "With async workers, if one crashes mid-job, another worker picks it up from the queue. No user requests are lost."  # Related Topics - [[Redis]] (with Bull or BullMQ) - [[AWS SQS]] - [[RabbitMQ]] - [[Kafka]]