Architecture

This document provides an overview of the Kubedoop Data Platform architecture, including its internal framework, built-in Operators, component dependencies, design principles, and data flow patterns.

Platform Architecture Overview

Kubedoop is a Kubernetes-native DataOps platform that manages 15+ big data components through a unified Operator framework. The platform uses Helm charts for Operator installation and lifecycle management, running entirely on top of Kubernetes.

graph TB
    subgraph Users["User Layer"]
        UI[Web UI / CLI]
        Apps[Data Applications]
    end

    subgraph Platform["Kubedoop Platform"]
        Helm[Helm Charts]
        subgraph Operators["Product Operators"]
            OP1[Spark Operator]
            OP2[Hive Operator]
            OP3[Trino Operator]
            OP4[Kafka Operator]
            OP5[HDFS Operator]
            OP6[... 8 more]
        end
        subgraph BuiltIn["Built-in Operators"]
            CO[Commons Operator]
            LO[Listener Operator]
            SO[Secret Operator]
        end
    end

    subgraph K8s["Kubernetes Cluster"]
        API[Kubernetes API Server]
        PV[Persistent Volumes]
        NET[Network Policies]
    end

    Users --> Platform
    Helm --> Operators
    Operators --> BuiltIn
    Operators --> K8s
    BuiltIn --> K8s

operator-go Framework

All Kubedoop Operators are built on top of the operator-go framework, an in-house library that provides a unified abstraction for managing stateful data infrastructure on Kubernetes.

Unified CRD Abstraction

The operator-go framework introduces a consistent CRD model across all Operators:

Cluster: The top-level resource representing a full component deployment
Roles: Logical groupings of processes with the same responsibility (e.g., NameNode, DataNode)
Role Groups: Multiple instances of a role, allowing differentiated configurations for high availability, resource isolation, or workload separation

apiVersion: {group}.kubedoop.dev/v1alpha1
kind: {ClusterKind}
metadata:
  name: my-cluster
spec:
  roleA:
    config:           # Role-level config
      resources:
        cpu: { min: "1" }
    roleGroups:
      group-1:        # Role group with default config
        replicas: 3
      group-2:        # Role group with overridden config
        replicas: 2
        config:
          resources:
            cpu: { min: "2" }

Lifecycle Management

The operator-go framework handles the full lifecycle of component deployments:

Phase	Description
Creation	Deploys StatefulSets, Services, ConfigMaps, and Secrets based on CRD specs
Scaling	Adjusts replica counts for role groups without disrupting existing pods
Upgrading	Performs rolling upgrades across role groups with configurable maxUnavailable
Failure Recovery	Automatically restarts failed pods and reconciles desired vs. actual state
Configuration Updates	Applies config changes with graceful rolling restarts

Source code: operator-go on GitHub

Built-in Operators

Kubedoop includes three built-in Operators that provide cross-cutting functionality shared by all product Operators:

graph LR
    subgraph ProductOps["Product Operators"]
        PO1[Spark Operator]
        PO2[Hive Operator]
        PO3[Trino Operator]
    end

    subgraph BuiltInOps["Built-in Operators"]
        CO["Commons Operator<br/>Environment variables<br/>JVM parameters<br/>Pod templates"]
        LO["Listener Operator<br/>Service / Ingress<br/>TLS certificates<br/>Service discovery"]
        SO["Secret Operator<br/>Password injection<br/>Certificate mounting<br/>Credential rotation"]
    end

    PO1 --> CO
    PO1 --> LO
    PO1 --> SO
    PO2 --> CO
    PO2 --> LO
    PO2 --> SO
    PO3 --> CO
    PO3 --> LO
    PO3 --> SO

Commons Operator

The Commons Operator manages shared configuration that applies across all product Operators:

Environment variables: Injects common environment variables into component pods
JVM parameters: Configures JVM heap size, GC settings, and other Java runtime options
Pod templates: Provides a base Pod template (annotations, labels, affinity) that product Operators extend

Listener Operator

The Listener Operator provides automated service discovery and network configuration:

Service / Ingress generation: Automatically creates Kubernetes Services and Ingress resources based on listener definitions
TLS certificate management: Provisions and rotates TLS certificates for encrypted communication
Service discovery: Enables components to discover each other through DNS and built-in service resolution

Secret Operator

The Secret Operator handles secure credential management:

Password injection: Automatically generates and injects passwords into component pods as environment variables or files
Certificate mounting: Mounts TLS certificates and keys into pods from centralized Secret resources
Credential rotation: Supports periodic rotation of credentials without manual intervention

Component Dependencies

The following diagram shows the dependency relationships between Kubedoop product Operators:

graph TD
    ZK["Zookeeper Operator"]

    HDFS["HDFS Operator"]
    DB["Database<br/>(External)"]

    Hive["Hive Operator"]
    Trino["Trino Operator"]
    Spark["Spark Operator"]
    Kafka["Kafka Operator"]
    Superset["Superset Operator"]
    Doris["Doris Operator"]
    HBase["HBase Operator"]
    Kyuubi["Kyuubi Operator"]
    NiFi["NiFi Operator"]
    Airflow["Airflow Operator"]
    DS["DolphinScheduler Operator"]

    HDFS --> ZK
    Hive --> ZK
    Hive --> HDFS
    Hive --> DB
    Trino --> ZK
    Trino --> HDFS
    Trino --> Hive
    Spark --> HDFS
    Spark --> Hive
    Kafka --> ZK
    Superset --> DB
    Doris --> ZK
    HBase --> ZK
    HBase --> HDFS
    Kyuubi --> HDFS
    Kyuubi --> Hive
    NiFi --> ZK
    NiFi --> HDFS
    Airflow --> DB
    DS --> ZK
    DS --> DB

Operator	Dependencies
Zookeeper	None (foundational service)
HDFS	Zookeeper
Hive	Zookeeper, HDFS, Database
Trino	Zookeeper, HDFS, Hive
Spark	HDFS, Hive
Kafka	Zookeeper
Superset	Database
Doris	Zookeeper
HBase	Zookeeper, HDFS
Kyuubi	HDFS, Hive
NiFi	Zookeeper, HDFS
Airflow	Database
DolphinScheduler	Zookeeper, Database

Design Principles

Kubedoop is built on the following core design principles:

Kubernetes Native

All components are managed through Kubernetes Custom Resource Definitions (CRDs) and Operators. There are no custom orchestration layers — the platform relies entirely on the Kubernetes API for state management, scheduling, and self-healing.

Declarative Configuration

Users describe the desired state of their data infrastructure through YAML manifests. The Operators continuously reconcile the actual state with the desired state, ensuring consistency without manual intervention.

Pluggable Storage

Storage is abstracted through Kubernetes StorageClass, allowing users to choose the underlying storage backend (SSD, HDD, NFS, cloud storage) without changing their component configuration. This enables flexible deployment across different environments.

Unified Security Model

All Operators share a consistent security model through the built-in Secret Operator and Listener Operator. TLS encryption, authentication, and credential management are handled uniformly across all components.

Observability

Kubedoop provides built-in observability for all managed components:

Logging: Centralized log collection and management
Metrics: Exposed through Prometheus-compatible endpoints
Alerting: Integration with alerting systems for proactive monitoring

Data Flow Example

The following sequence diagram illustrates the data flow when a user submits a SQL query through Trino to read data from Hive:

sequenceDiagram
    participant User
    participant Trino as Trino Coordinator
    participant TrinoW as Trino Worker
    participant Hive as Hive Metastore
    participant HDFS as HDFS NameNode
    participant HDFSd as HDFS DataNode

    User->>Trino: Submit SQL query (SELECT * FROM hive_table)
    Trino->>Hive: Fetch table metadata (schema, location, format)
    Hive-->>Trino: Return table metadata

    Trino->>HDFS: Request file blocks from NameNode
    HDFS-->>Trino: Return block locations

    Trino->>TrinoW: Split query into tasks and assign to workers

    loop For each data block
        TrinoW->>HDFSd: Read data blocks
        HDFSd-->>TrinoW: Return data
    end

    TrinoW->>Trino: Return processed results
    Trino-->>User: Return query results

This flow demonstrates how Kubedoop's component Operators work together:

Trino receives the query and coordinates execution
Hive Metastore provides table schema and data location metadata
HDFS NameNode manages the file system namespace and block locations
HDFS DataNodes serve the actual data blocks to Trino Workers
Trino Workers process the data in parallel and return results

Platform Architecture Overview​

operator-go Framework​

Unified CRD Abstraction​

Lifecycle Management​

Built-in Operators​

Commons Operator​

Listener Operator​

Secret Operator​

Component Dependencies​

Design Principles​

Kubernetes Native​

Declarative Configuration​

Pluggable Storage​

Unified Security Model​

Observability​

Data Flow Example​