Multi-Tenant CI/CD Workflow System

System DesignSystem DesignOnsitePhoneSoftware EngineerReported Mar, 2026High Frequency

Problem Statement

Design a scalable, fault-tolerant CI/CD system for a multi-tenant environment that schedules and executes user-defined workflows in response to git pushes. The system must handle workflow execution, job scheduling, real-time status updates, and ensure exactly-once execution semantics.

Video Solution: For a detailed solution walkthrough with complete architecture explanation, see this excellent video explanation.

Real Interview Experiences

Here are actual experiences shared by engineers who interviewed with this question:

Experience 1: Focus on Exactly-Once Execution

"The interviewer seemed to focus heavily on exactly-once execution. They kept emphasizing to just treat each job as a single task, not even requiring it to be linear. They also asked about how to design for multi-tenant scenarios."

"It's actually more focused on job scheduler content. I successfully passed this round."

Key Insight: The interviewer heavily emphasized exactly-once execution and kept the scope simple - treating each job as a single task, not even requiring linear workflows initially. Multi-tenant design was also a focus area.

Experience 2: Stateless Architecture with CDC

"The problem requires each action flow to be a simple linear multiple-step flow (not a DAG), so you need to handle starting the next step after one step completes. One approach is to have a DB entry for each step. At the beginning, your scheduler creates all step entries (all in pending status), but only puts the first step into a queue for workers to run. Then workers update the step entry status. After the status changes, you use CDC (Change Data Capture push notification) to notify your scheduler, then the scheduler queries the DB to find the next step and puts it in the queue. This way you achieve a completely stateless scheduler that can handle this task."

"Other follow-up questions I answered casually. I think the key point is the above - implement the high-level design where all servers are basically stateless and can horizontally scale at will."

Key Approach:

Create all step entries in database upfront (all PENDING status)

Only enqueue the first step to the queue

Workers update step status in DB after completion

Use CDC (Change Data Capture) to notify scheduler

Scheduler queries DB for next step and enqueues it

This creates a completely stateless scheduler that can horizontally scale

Experience 3: K8s and Docker Focus

"Design a CI/CD job scheduler that utilizes tech such as k8s and docker. Discuss how you would approach the design and the key components involved."

Tech Stack Focus: Kubernetes and Docker were explicitly mentioned as technologies to incorporate.

Experience 4: Front-End UI Integration

"Design a system for a multi-tenant CI/CD workflow triggered by a git push. The system should analyze the git push to obtain the GitHub repository ID using an internal API service. The workflow configuration is stored in the GitHub repository. After the workflow is created, it should schedule a sequence of jobs defined in the config file. Additionally, design a front-end UI to display the progress of the workflow after it has been created."

Additional Requirement: Front-end UI for real-time workflow progress display was part of the discussion.

Experience 5: Fault Tolerance Priority

"CI/CD question. Main focus was on how to implement exactly-once execution under the constraints of being fault-tolerant and scalable."

Core Focus: How to achieve exactly-once execution under the constraints of being fault-tolerant and scalable.

Common Interview Pattern

Multiple candidates reported similar problem descriptions:

"Design a multi-tenant CI/CD system which schedules and executes user-defined workflows in response to git pushes. The system receives information about pushes via API calls from an internal service which contain the repository id and the current state of the repository (commit hash). Workflows are a sequence of jobs which are defined within a single YAML file in a static location for each repository. Users should be able to view the output and status of jobs as they are running."

Consistent Elements Across Interviews:

Multi-tenant requirement

Git push triggers

Repository ID + commit hash in API payload

YAML workflow configuration in repository

Sequential job execution (not DAG)

Real-time job status/output viewing

Exactly-once execution semantics (most heavily emphasized)

Fault tolerance and scalability

Stateless architecture for horizontal scaling

Reference Solution (Solution Tab)

Approach

Sort shards by (start, end), then sweep from left to right. Maintain a min-heap of active shard ends plus a last_start guard so later shards cannot move back into a prefix that was already proven saturated. After the sweep, extend the previous kept shard whenever there is a gap so coverage stays continuous from the original minimum start to the original maximum end.

Why It Works

Sorting enforces the required priority order, so earlier shards always win. Whenever limit active shards already cover the candidate start, jumping to earliestEnd + 1 is the smallest shift that can reduce overlap. If that jump moves past the shard's end, the shard is fully eclipsed and must be dropped. Gap filling only extends the immediate predecessor, so it restores continuity without creating new overlap.

Complexity

Time: O(n log n)

Space: O(n)

Reference Implementation (Python)

import heapq

class Solution:
    def rebalance(self, limit: int, shards: List[str]) -> List[str]:
        if not shards:
            return []

        parsed = []
        max_end = -10**18

        for shard in shards:
            shard_id, start_raw, end_raw = shard.split(':')
            start = int(start_raw)
            end = int(end_raw)
            parsed.append([shard_id, start, end])
            if end > max_end:
                max_end = end

        parsed.sort(key=lambda shard: (shard[1], shard[2]))

        end_heap = []
        last_start = -10**18
        kept = []

        for shard_id, start, end in parsed:
            new_start = max(start, last_start)
            tentative = []

            while True:
                while end_heap and end_heap[0] < new_start:
                    heapq.heappop(end_heap)

                if len(end_heap) < limit:
                    break

                smallest = heapq.heappop(end_heap)
                tentative.append(smallest)
                new_start = smallest + 1

            if new_start > end:
                for value in tentative:
                    heapq.heappush(end_heap, value)
                continue

            kept.append([shard_id, new_start, end])
            heapq.heappush(end_heap, end)
            last_start = new_start

        if kept:
            cover_end = kept[0][2]

            for i in range(1, len(kept)):
                prev = kept[i - 1]
                cur = kept[i]

                if cur[1] > cover_end + 1:
                    prev[2] = cur[1] - 1
                    cover_end = prev[2]

                if cur[2] > cover_end:
                    cover_end = cur[2]

            if kept[-1][2] < max_end:
                kept[-1][2] = max_end

        return [f"{shard_id}:{start}:{end}" for shard_id, start, end in kept]

WhiteboardAuto-save enabled

Loading whiteboard…