🛠️ EGS Configuration Details
The script requires a YAML configuration file to define various parameters and settings for the installation process. Below is an example configuration file (egs-installer-config.yaml) with descriptions for each section.
⚠️ Warning
Do not copy the YAML configuration directly from this README. Hash characters (#) used for comments may not be interpreted correctly. Always refer to the actual egs-installer-config.yaml file available in the repository for accurate configuration.
YAML Configuration File
########################### MANDATORY PARAMETERS ####################################################################
# Kubeconfig settings
global_kubeconfig: "" # Relative path to the global kubeconfig file (must be in the script directory) - Mandatory
global_kubecontext: "" # Global kubecontext to use - Mandatory
use_global_context: true # If true, use the global kubecontext for all operations by default
# Enable or disable specific stages of the installation
enable_install_controller: true # Enable the installation of the Kubeslice controller
enable_install_ui: true # Enable the installation of the Kubeslice UI
enable_install_worker: true # Enable the installation of Kubeslice workers
# Enable or disable the installation of additional applications(prometheus, gpu-operator, postgresql)
enable_install_additional_apps: false # Set to true to enable additional apps installation
# Enable custom applications
# Set this to true if you want to allow custom applications to be deployed.
# This is specifically useful for enabling NVIDIA driver installation on your nodes.
enable_custom_apps: true
# Command execution settings
# Set this to true to allow the execution of commands for configuring NVIDIA MIG.
# This includes modifications to the NVIDIA ClusterPolicy and applying node labels
# based on the MIG strategy defined in the YAML (e.g., single or mixed strategy).
run_commands: false
#########################################################################################################################
########################### OPTIONAL CONFIGURATION PARAMETERS ###########################################################
# Project and cluster registration settings
enable_project_creation: true # Enable project creation in Kubeslice
enable_cluster_registration: true # Enable cluster registration in Kubeslice
enable_prepare_worker_values_file: true # Prepare the worker values file for Helm charts
enable_autofetch_egsagent_endpoint_and_token: true # if False then, skip update values of egsAgent token and endpoint in values file.
# Global image pull secret settings for AirGap Installations
global_image_pull_secret:
registry: "https://index.docker.io/v1/" # Docker registry URL
username: "" # Global Docker registry username
password: "" # Global Docker registry password
# Global monitoring endpoint settings
global_auto_fetch_endpoint: false # Enable automatic fetching of monitoring endpoints globally
global_grafana_namespace: egs-monitoring # Namespace where Grafana is globally deployed
global_grafana_service_type: ClusterIP # Service type for Grafana (accessible only within the cluster)
global_grafana_service_name: prometheus-grafana # Service name for accessing Grafana globally
global_prometheus_namespace: egs-monitoring # Namespace where Prometheus is globally deployed
global_prometheus_service_name: prometheus-kube-prometheus-prometheus # Service name for accessing Prometheus globally
global_prometheus_service_type: ClusterIP # Service type for Prometheus (accessible only within the cluster)
# Precheck options
precheck: true # Run general prechecks before starting the installation
kubeslice_precheck: true # Run specific prechecks for Kubeslice components
# Global installation verification settings
verify_install: false # Enable verification of installations globally
verify_install_timeout: 600 # Timeout for global installation verification (in seconds)
skip_on_verify_fail: true # If set to true, skip steps where verification fails, otherwise exit on failure
# Base path settings
base_path: "" # If left empty, the script will use the relative path to the script as the base path
# Helm repository settings
use_local_charts: true # Use local Helm charts instead of fetching them from a repository
local_charts_path: "charts" # Path to the directory containing local Helm charts
global_helm_repo_url: "" # URL for the global Helm repository (if not using local charts)
global_helm_username: "" # Username for accessing the global Helm repository
global_helm_password: "" # Password for accessing the global Helm repository
readd_helm_repos: true # Re-add Helm repositories even if they are already present
#### Kubeslice Controller Installation Settings ####
kubeslice_controller_egs:
skip_installation: false # Do not skip the installation of the controller
use_global_kubeconfig: true # Use global kubeconfig for the controller installation
specific_use_local_charts: true # Override to use local charts for the controller
kubeconfig: "" # Path to the kubeconfig file specific to the controller, if empty, uses the global kubeconfig
kubecontext: "" # Kubecontext specific to the controller; if empty, uses the global context
namespace: "kubeslice-controller" # Kubernetes namespace where the controller will be installed
release: "egs-controller" # Helm release name for the controller
chart: "kubeslice-controller-egs" # Helm chart name for the controller
#### Inline Helm Values for the Controller Chart ####
inline_values:
global:
imageRegistry: harbor.saas1.smart-scaler.io/avesha/aveshasystems # Docker registry for the images
namespaceConfig: # user can configure labels or annotations that EGS Controller namespaces should have
labels: {}
annotations: {}
kubeTally:
enabled: true # Enable KubeTally in the controller
#### Postgresql Connection Configuration for Kubetally ####
postgresSecretName: kubetally-db-credentials # Secret name in kubeslice-controller namespace for PostgreSQL credentials created by install, all the below values must be specified
# then a secret will be created with specified name.
# alternatively you can make all below values empty and provide a pre-created secret name with below connection details format
existingSecret: false # Set to true if secret is pre-created externally
postgresAddr: "kt-postgresql.kt-postgresql.svc.cluster.local" # Change this Address to your postgresql endpoint
postgresPort: 5432 # Change this Port for the PostgreSQL service to your values
postgresUser: "postgres" # Change this PostgreSQL username to your values
postgresPassword: "postgres" # Change this PostgreSQL password to your value
postgresDB: "postgres" # Change this PostgreSQL database name to your value
postgresSslmode: disable # Change this SSL mode for PostgreSQL connection to your value
prometheusUrl: http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090 # Prometheus URL for monitoring
kubeslice:
controller:
endpoint: "" # Endpoint of the controller API server; auto-fetched if left empty
replication:
minio:
install: "true" # Whether to install MinIO for replication
storage: 1Gi # Storage size for MinIO
username: minioadmin # Username for MinIO
password: minioadmin # Password for MinIO
service:
type: "LoadBalancer" # MinIO service type
serviceMonitor:
enabled: true # Enable ServiceMonitor for Prometheus monitoring
namespace: egs-monitoring # Namespace where ServiceMonitor will be deployed
#### Helm Flags and Verification Settings ####
helm_flags: "--wait --timeout 5m --debug" # Additional Helm flags for the installation
verify_install: false # Verify the installation of the controller
verify_install_timeout: 30 # Timeout for the controller installation verification (in seconds)
skip_on_verify_fail: true # If verification fails, do not skip the step
#### Troubleshooting Settings ####
enable_troubleshoot: false # Enable troubleshooting mode for additional logs and checks
#### Kubeslice UI Installation Settings ####
kubeslice_ui_egs:
skip_installation: false # Do not skip the installation of the UI
use_global_kubeconfig: true # Use global kubeconfig for the UI installation
kubeconfig: "" # Path to the kubeconfig file specific to the UI, if empty, uses the global kubeconfig
kubecontext: "" # Kubecontext specific to the UI; if empty, uses the global context
namespace: "kubeslice-controller" # Kubernetes namespace where the UI will be installed
release: "egs-ui" # Helm release name for the UI
chart: "kubeslice-ui-egs" # Helm chart name for the UI
#### Inline Helm Values for the UI Chart ####
inline_values:
global:
imageRegistry: harbor.saas1.smart-scaler.io/avesha/aveshasystems # Docker registry for the UI images
kubeslice:
prometheus:
url: http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090 # Prometheus URL for monitoring
uiproxy:
service:
type: LoadBalancer # Service type for the UI proxy
## if type selected to NodePort then set nodePort value if required
# nodePort:
# port: 443
# targetPort: 8443
labels:
app: kubeslice-ui-proxy
annotations: {}
ingress:
## If true, ui‑proxy Ingress will be created
enabled: false
## Port on the Service to route to
servicePort: 443
## Ingress class name (e.g. "nginx"), if you're using a custom ingress controller
className: ""
hosts:
- host: ui.kubeslice.com # replace with your FQDN
paths:
- path: / # base path
pathType: Prefix # Prefix | Exact
## TLS configuration (you must create these Secrets ahead of time)
tls: []
# - hosts:
# - ui.kubeslice.com
# secretName: uitlssecret
annotations: []
## Extra labels to add onto the Ingress object
extraLabels: {}
apigw:
env:
- name: DCGM_METRIC_JOB_VALUE
value: nvidia-dcgm-exporter
egsCoreApis:
enabled: true # Enable EGS core APIs for the UI
service:
type: ClusterIP # Service type for the EGS core APIs
#### Helm Flags and Verification Settings ####
helm_flags: "--wait --timeout 5m --debug" # Additional Helm flags for the UI installation
verify_install: false # Verify the installation of the UI
verify_install_timeout: 50 # Timeout for the UI installation verification (in seconds)
skip_on_verify_fail: true # If UI verification fails, do not skip the step
#### Chart Source Settings ####
specific_use_local_charts: true # Override to use local charts for the UI
#### Kubeslice Worker Installation Settings ####
kubeslice_worker_egs:
- name: "worker-1" # Worker name
use_global_kubeconfig: true # Use global kubeconfig for this worker
kubeconfig: "" # Path to the kubeconfig file specific to the worker, if empty, uses the global kubeconfig
kubecontext: "" # Kubecontext specific to the worker; if empty, uses the global context
skip_installation: false # Do not skip the installation of the worker
specific_use_local_charts: true # Override to use local charts for this worker
namespace: "kubeslice-system" # Kubernetes namespace for this worker
release: "egs-worker" # Helm release name for the worker
chart: "kubeslice-worker-egs" # Helm chart name for the worker
#### Inline Helm Values for the Worker Chart ####
inline_values:
global:
imageRegistry: harbor.saas1.smart-scaler.io/avesha/aveshasystems # Docker registry for worker images
kubesliceNetworking:
enabled: true # Enable/disable network component installation
operator:
env:
- name: DCGM_EXPORTER_JOB_NAME
value: gpu-metrics
egs:
prometheusEndpoint: "http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090" # Prometheus endpoint
grafanaDashboardBaseUrl: "http://<grafana-lb>/d/Oxed_c6Wz" # Grafana dashboard base URL
egsAgent:
secretName: egs-agent-access
agentSecret:
endpoint: ""
key: ""
metrics:
insecure: true # Allow insecure connections for metrics
kserve:
enabled: true # Enable KServe for the worker
kserve: # KServe chart options
controller:
gateway:
domain: kubeslice.com
ingressGateway:
className: "nginx" # Ingress class name for the KServe gateway
egsGpuAgent:
env:
- name: REMOTE_HE_INFO
value: "nvidia-dcgm.egs-gpu-operator.svc.cluster.local:5555"
- name: HEALTH_CHECK_INTERVAL
value: "15m"
monitoring:
podMonitor:
enabled: true # Enable PodMonitor for Prometheus monitoring
namespace: egs-monitoring # Namespace where PodMonitor will be deployed
#### Helm Flags and Verification Settings ####
helm_flags: "--wait --timeout 5m --debug" # Additional Helm flags for the worker installation
verify_install: true # Verify the installation of the worker
verify_install_timeout: 60 # Timeout for the worker installation verification (in seconds)
skip_on_verify_fail: false # Do not skip if worker verification fails
#### Troubleshooting Settings ####
enable_troubleshoot: false # Enable troubleshooting mode for additional logs and checks
#### Local Monitoring Endpoint Settings (Optional) ####
# local_auto_fetch_endpoint: true # Enable automatic fetching of monitoring endpoints
# local_grafana_namespace: egs-monitoring # Namespace where Grafana is deployed
# local_grafana_service_name: prometheus-grafana # Service name for accessing Grafana
# local_grafana_service_type: ClusterIP # Service type for Grafana (accessible only within the cluster)
# local_prometheus_namespace: egs-monitoring # Namespace where Prometheus is deployed
# local_prometheus_service_name: prometheus-kube-prometheus-prometheus # Service name for accessing Prometheus
# local_prometheus_service_type: ClusterIP # Service type for Prometheus (accessible only within the cluster)
#### Define Projects ####
projects:
- name: "avesha" # Name of the Kubeslice project
username: "admin" # Username for accessing the Kubeslice project
#### Define Cluster Registration ####
cluster_registration:
- cluster_name: "worker-1" # Name of the cluster to be registered
project_name: "avesha" # Name of the project to associate with the cluster
#### Telemetry Settings ####
telemetry:
enabled: true # Enable telemetry for this cluster
endpoint: "http://prometheus-kube-prometheus-prometheus.egs-monitoring.svc.cluster.local:9090" # Telemetry endpoint
telemetryProvider: "prometheus" # Telemetry provider (Prometheus in this case)
#### Geo-Location Settings ####
geoLocation:
cloudProvider: "" # Cloud provider for this cluster (e.g., GCP)
cloudRegion: "" # Cloud region for this cluster (e.g., us-central1)
#### Define Additional Applications to Install ####
additional_apps:
- name: "gpu-operator" # Name of the application
skip_installation: false # Do not skip the installation of the GPU operator
use_global_kubeconfig: true # Use global kubeconfig for this application
kubeconfig: "" # Path to the kubeconfig file specific to this application
kubecontext: "" # Kubecontext specific to this application; uses global context if empty
namespace: "egs-gpu-operator" # Namespace where the GPU operator will be installed
release: "gpu-operator" # Helm release name for the GPU operator
chart: "gpu-operator" # Helm chart name for the GPU operator
repo_url: "https://helm.ngc.nvidia.com/nvidia" # Helm repository URL for the GPU operator
version: "v25.3.4" # Version of the GPU operator to install
specific_use_local_charts: true # Use local charts for this application
#### Inline Helm Values for GPU Operator ####
inline_values:
hostPaths:
driverInstallDir: "/home/kubernetes/bin/nvidia"
toolkit:
installDir: "/home/kubernetes/bin/nvidia"
cdi:
enabled: true
default: true
# mig:
# strategy: "mixed"
# migManager: # Enable to ensure that the node reboots and can apply the MIG configuration.
# env:
# - name: WITH_REBOOT
# value: "true"
driver:
enabled: false
helm_flags: "--debug" # Additional Helm flags for this application's installation
verify_install: false # Verify the installation of the GPU operator
verify_install_timeout: 600 # Timeout for verification (in seconds)
skip_on_verify_fail: true # Skip the step if verification fails
enable_troubleshoot: false # Enable troubleshooting mode for additional logs and checks
- name: "prometheus" # Name of the application
skip_installation: false # Do not skip the installation of Prometheus
use_global_kubeconfig: true # Use global kubeconfig for Prometheus
kubeconfig: "" # Path to the kubeconfig file specific to this application
kubecontext: "" # Kubecontext specific to this application; uses global context if empty
namespace: "egs-monitoring" # Namespace where Prometheus will be installed
release: "prometheus" # Helm release name for Prometheus
chart: "kube-prometheus-stack" # Helm chart name for Prometheus
repo_url: "https://prometheus-community.github.io/helm-charts" # Helm repository URL for Prometheus
version: "v45.0.0" # Version of the Prometheus stack to install
specific_use_local_charts: true # Use local charts for this application
values_file: "" # Path to an external values file, if any
#### Inline Helm Values for Prometheus ####
inline_values:
prometheus:
service:
type: ClusterIP # Service type for Prometheus
prometheusSpec:
storageSpec: {} # Placeholder for storage configuration
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- egs-gpu-operator
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
action: drop
regex: .*-node-feature-discovery-master
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
grafana:
enabled: true # Enable Grafana
grafana.ini:
auth:
disable_login_form: true
disable_signout_menu: true
auth.anonymous:
enabled: true
org_role: Viewer
service:
type: ClusterIP # Service type for Grafana
persistence:
enabled: false # Disable persistence
size: 1Gi # Default persistence size
helm_flags: "--debug" # Additional Helm flags for this application's installation
verify_install: false # Verify the installation of Prometheus
verify_install_timeout: 600 # Timeout for verification (in seconds)
skip_on_verify_fail: true # Skip the step if verification fails
enable_troubleshoot: false # Enable troubleshooting mode for additional logs and checks
- name: "postgresql" # Name of the application
skip_installation: false # Do not skip the installation of PostgreSQL
use_global_kubeconfig: true # Use global kubeconfig for PostgreSQL
kubeconfig: "" # Path to the kubeconfig file specific to this application
kubecontext: "" # Kubecontext specific to this application; uses global context if empty
namespace: "kt-postgresql" # Namespace where PostgreSQL will be installed
release: "kt-postgresql" # Helm release name for PostgreSQL
chart: "postgresql" # Helm chart name for PostgreSQL
repo_url: "oci://registry-1.docker.io/bitnamicharts/postgresql" # Helm repository URL for PostgreSQL
version: "16.7.27" # Version of the PostgreSQL chart to install
specific_use_local_charts: true # Use local charts for this application
values_file: "" # Path to an external values file, if any
#### Inline Helm Values for PostgreSQL ####
inline_values:
auth:
postgresPassword: "postgres" # Explicit password (use if not relying on `existingSecret`)
username: "postgres" # Explicit username (fallback if `existingSecret` is not used)
password: "postgres" # Password for PostgreSQL (optional)
database: "postgres" # Default database to create
primary:
persistence:
enabled: true # Enable persistent storage for PostgreSQL
size: 10Gi # Size of the Persistent Volume Claim
helm_flags: "--wait --debug" # Additional Helm flags for this application's installation
verify_install: true # Verify the installation of PostgreSQL
verify_install_timeout: 600 # Timeout for verification (in seconds)
skip_on_verify_fail: false # Do not skip if verification fails
#### Define Custom Applications and Associated Manifests ####
manifests:
- appname: gpu-operator-quota # Name of the custom application
manifest: "" # URL or path to the manifest file; if empty, inline YAML is used
overrides_yaml: "" # Path to an external YAML file with overrides, if any
inline_yaml: | # Inline YAML content for this custom application
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-operator-quota
spec:
hard:
pods: 100 # Maximum number of pods
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass # Define scope for PriorityClass
values:
- system-node-critical
- system-cluster-critical
use_global_kubeconfig: true # Use global kubeconfig for this application
skip_installation: false # Do not skip the installation of this application
verify_install: false # Verify the installation of this application
verify_install_timeout: 30 # Timeout for verification (in seconds)
skip_on_verify_fail: true # Skip if verification fails
namespace: egs-gpu-operator # Namespace for this application
kubeconfig: "" # Path to the kubeconfig file specific to this application
kubecontext: "" # Kubecontext specific to this application; uses global context if empty
- appname: nvidia-driver-installer # Name of the custom application
manifest: "https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml"
# URL to the manifest file
overrides_yaml: "" # Path to an external YAML file with overrides, if any
inline_yaml: null # Inline YAML content for this application
use_global_kubeconfig: true # Use global kubeconfig for this application
kubeconfig: "" # Path to the kubeconfig file specific to this application
kubecontext: "" # Kubecontext specific to this application; uses global context if empty
skip_installation: false # Do not skip the installation of this application
verify_install: false # Verify the installation of this application
verify_install_timeout: 200 # Timeout for verification (in seconds)
skip_on_verify_fail: true # Skip if verification fails
namespace: kube-system # Namespace for this application
#### Define Commands to Execute ####
commands:
- use_global_kubeconfig: true # Use global kubeconfig for these commands
kubeconfig: "" # Path to the kubeconfig file specific to these commands
kubecontext: "" # Kubecontext specific to these commands; uses global context if empty
skip_installation: false # Do not skip the execution of these commands
verify_install: false # Verify the execution of these commands
verify_install_timeout: 200 # Timeout for verification (in seconds)
skip_on_verify_fail: true # Skip if command verification fails
namespace: kube-system # Namespace context for these commands
command_stream: | # Commands to execute
kubectl create namespace egs-gpu-operator --dry-run=client -o yaml | kubectl apply -f - || true
kubectl get nodes || true
kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity["nvidia.com/gpu"] != null) | .metadata.name' | xargs -I {} kubectl label nodes {} gke-no-default-nvidia-gpu-device-plugin=true cloud.google.com/gke-accelerator=true --overwrite || true
kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity["nvidia.com/gpu"] != null) | .metadata.name' | xargs -I {} sh -c "echo {}; kubectl get node {} -o=jsonpath='{.metadata.labels}' | jq ." || true
kubectl get clusterpolicies.nvidia.com/cluster-policy --no-headers || true
#### Troubleshooting Mode Settings ####
enable_troubleshoot:
enabled: false # Global enable troubleshooting mode for additional logs and checks
#### Resource Types to Troubleshoot ####
resource_types:
- pods
- deployments
- daemonsets
- statefulsets
- replicasets
- jobs
- configmaps
- secrets
- services
- serviceaccounts
- roles
- rolebindings
- crds
#### API Groups to Troubleshoot ####
api_groups:
- controller.kubeslice.io
- worker.kubeslice.io
- inventory.kubeslice.io
- aiops.kubeslice.io
- networking.kubeslice.io
- monitoring.coreos.com
#### Upload Log Settings ####
upload_logs:
enabled: false # Enable log upload functionality
command: | # Command to execute for log upload
#### List of Required Binaries ####
required_binaries:
- yq # YAML processor
- helm # Helm package manager
- jq # JSON processor
- kubectl # Kubernetes command-line tool
#### Node Labeling Settings ####
add_node_label: false # Enable node labeling during installation
# Version of the input configuration file
version: "1.15.4"
🆕 Key Updates in Version 1.15.4
New Features
MinIO Replication Configuration
- Purpose: Enables data replication for disaster recovery
- Location:
kubeslice_controller_egs.inline_values.kubeslice.controller.replication.minio - Options:
install: Whether to install MinIO ("true"or"false")storage: Storage size for MinIO (default:1Gi)username/password: MinIO credentialsservice.type: Service type (LoadBalancer,ClusterIP,NodePort)
ServiceMonitor for Controller
- Purpose: Enables Prometheus monitoring via ServiceMonitor CRD
- Location:
kubeslice_controller_egs.inline_values.serviceMonitor - Options:
enabled: Enable/disable ServiceMonitornamespace: Namespace for ServiceMonitor deployment
PodMonitor for Worker
- Purpose: Enables Prometheus monitoring via PodMonitor CRD
- Location:
kubeslice_worker_egs[].inline_values.monitoring.podMonitor - Options:
enabled: Enable/disable PodMonitornamespace: Namespace for PodMonitor deployment
EGS GPU Agent Configuration
- Purpose: Configures GPU agent for health checks
- Location:
kubeslice_worker_egs[].inline_values.egsGpuAgent - Options:
REMOTE_HE_INFO: DCGM endpoint for GPU health checksHEALTH_CHECK_INTERVAL: Interval for health checks
Updated Versions
| Component | Previous Version | Current Version |
|---|---|---|
| GPU Operator | v24.9.1 | v25.3.4 |
| PostgreSQL | 16.2.1 | 16.7.27 |
Configuration Changes
Global Image Pull Secret
- Field renamed from
repositorytoregistryfor clarity
📋 YAML Field Reference Guide
This section provides detailed explanations of all configuration fields available in the EGS installer configuration file.
Global Configuration Fields
| Field | Description | Default/Example |
|---|---|---|
base_path |
Base path to the root directory of the cloned repository. If empty, the script uses the relative path to the script as the base path. | "" (empty string) |
precheck |
Run prechecks before installation to validate the environment and required binaries. | true |
kubeslice_precheck |
Run specific prechecks for Kubeslice components, including cluster access validation and node label checks. | true |
verify_install |
Enable installation verification globally, ensuring that all installed components are running as expected. | false |
verify_install_timeout |
Global timeout for verification in seconds. Determines how long the script waits for all components to be verified as running. | 600 (10 minutes) |
skip_on_verify_fail |
Decide whether to skip further steps or exit the script if verification fails globally. | true |
global_helm_repo_url |
URL of the global Helm repository from which charts will be pulled. | "" (empty string) |
global_helm_username |
Username for accessing the global Helm repository, if required. | "" |
global_helm_password |
Password for accessing the global Helm repository, if required. | "" |
readd_helm_repos |
Re-add Helm repositories if they already exist to ensure the latest repository configuration is used. | true |
required_binaries |
List of binaries that are required for the installation process. The script will check for these binaries and exit if any are missing. | yq, helm, jq, kubectl |
global_image_pull_secret |
Global Docker registry credentials for pulling images. | registry: "https://index.docker.io/v1/", username: "", password: "" |
global_kubeconfig |
Relative path to the global kubeconfig file (must be in the script directory) - Mandatory. | "" (empty string) |
global_kubecontext |
Global kubecontext to use - Mandatory. | "" (empty string) |
use_global_context |
Use the global kubecontext for all operations by default. | true |
enable_install_controller |
Enable the installation of the Kubeslice controller. | true |
enable_install_ui |
Enable the installation of the Kubeslice UI. | true |
enable_install_worker |
Enable the installation of Kubeslice workers. | true |
enable_install_additional_apps |
Enable the installation of additional applications (prometheus, gpu-operator, postgresql). | false |
enable_custom_apps |
Enable custom applications deployment, useful for NVIDIA driver installation. | true |
run_commands |
Enable execution of commands for configuring NVIDIA MIG and node labeling. | false |
enable_project_creation |
Enable project creation in Kubeslice. | true |
enable_cluster_registration |
Enable cluster registration in Kubeslice. | true |
enable_prepare_worker_values_file |
Prepare the worker values file for Helm charts. | true |
enable_autofetch_egsagent_endpoint_and_token |
Auto-fetch egsAgent token and endpoint values. | true |
global_auto_fetch_endpoint |
Enable automatic fetching of monitoring endpoints globally. | false |
global_grafana_namespace |
Namespace where Grafana is globally deployed. | egs-monitoring |
global_grafana_service_type |
Service type for Grafana (accessible only within the cluster). | ClusterIP |
global_grafana_service_name |
Service name for accessing Grafana globally. | prometheus-grafana |
global_prometheus_namespace |
Namespace where Prometheus is globally deployed. | egs-monitoring |
global_prometheus_service_name |
Service name for accessing Prometheus globally. | prometheus-kube-prometheus-prometheus |
global_prometheus_service_type |
Service type for Prometheus (accessible only within the cluster). | ClusterIP |
use_local_charts |
Use local Helm charts instead of fetching them from a repository. | true |
local_charts_path |
Path to the directory containing local Helm charts. | "charts" |
add_node_label |
Enable node labeling during installation. | false |
version |
Version of the input configuration file. | "1.15.4" |
Kubeslice Controller Configuration (kubeslice_controller_egs)
| Subfield | Description | Default/Example |
|---|---|---|
skip_installation |
Skip the installation of the Kubeslice controller if it’s already installed or not needed. | false |
use_global_kubeconfig |
Use the global kubeconfig file for the controller installation. | true |
specific_use_local_charts |
Use local charts specifically for the controller installation, overriding the global use_local_charts setting. |
true |
kubeconfig |
Path to the kubeconfig file specific to the controller, if empty, uses the global kubeconfig. | "" (empty string) |
kubecontext |
Kubecontext specific to the controller; if empty, uses the global context. | "" (empty string) |
namespace |
Kubernetes namespace where the Kubeslice controller will be installed. | "kubeslice-controller" |
release |
Helm release name for the Kubeslice controller. | "egs-controller" |
chart |
Helm chart name used for installing the Kubeslice controller. | "kubeslice-controller-egs" |
inline_values |
Inline values passed to the Helm chart during installation. | See inline values section below |
helm_flags |
Additional Helm flags for the controller installation. | "--wait --timeout 5m --debug" |
verify_install |
Verify the installation of the Kubeslice controller after deployment. | false |
verify_install_timeout |
Timeout for verifying the installation of the controller, in seconds. | 30 (30 seconds) |
skip_on_verify_fail |
Skip further steps or exit if the controller verification fails. | true |
enable_troubleshoot |
Enable troubleshooting mode for additional logs and checks. | false |
Kubeslice UI Configuration (kubeslice_ui_egs)
| Subfield | Description | Default/Example |
|---|---|---|
skip_installation |
Skip the installation of the Kubeslice UI if it’s already installed or not needed. | false |
use_global_kubeconfig |
Use the global kubeconfig file for the UI installation. | true |
kubeconfig |
Path to the kubeconfig file specific to the UI, if empty, uses the global kubeconfig. | "" (empty string) |
kubecontext |
Kubecontext specific to the UI; if empty, uses the global context. | "" (empty string) |
namespace |
Kubernetes namespace where the Kubeslice UI will be installed. | "kubeslice-controller" |
release |
Helm release name for the Kubeslice UI. | "egs-ui" |
chart |
Helm chart name used for installing the Kubeslice UI. | "kubeslice-ui-egs" |
inline_values |
Inline values passed to the Helm chart during installation. | See inline values section below |
helm_flags |
Additional Helm flags for the UI installation. | "--wait --timeout 5m --debug" |
verify_install |
Verify the installation of the Kubeslice UI after deployment. | false |
verify_install_timeout |
Timeout for verifying the installation of the UI, in seconds. | 50 (50 seconds) |
skip_on_verify_fail |
Skip further steps or exit if the UI verification fails. | true |
specific_use_local_charts |
Use local charts specifically for the UI installation. | true |
Kubeslice Worker Configuration (kubeslice_worker_egs)
| Subfield | Description | Default/Example |
|---|---|---|
name |
Name of the worker node configuration. | "worker-1" |
use_global_kubeconfig |
Use the global kubeconfig file for the worker installation. | true |
kubeconfig |
Path to the kubeconfig file specific to the worker, if empty, uses the global kubeconfig. | "" (empty string) |
kubecontext |
Kubecontext specific to the worker; if empty, uses the global context. | "" (empty string) |
skip_installation |
Skip the installation of the worker if it’s already installed or not needed. | false |
specific_use_local_charts |
Use local charts specifically for the worker installation. | true |
namespace |
Kubernetes namespace where the worker will be installed. | "kubeslice-system" |
release |
Helm release name for the worker. | "egs-worker" |
chart |
Helm chart name used for installing the worker. | "kubeslice-worker-egs" |
inline_values |
Inline values passed to the Helm chart during installation. | See inline values section below |
helm_flags |
Additional Helm flags for the worker installation. | "--wait --timeout 5m --debug" |
verify_install |
Verify the installation of the worker after deployment. | true |
verify_install_timeout |
Timeout for verifying the installation of the worker, in seconds. | 60 (60 seconds) |
skip_on_verify_fail |
Skip further steps or exit if the worker verification fails. | false |
enable_troubleshoot |
Enable troubleshooting mode for additional logs and checks. | false |
Additional Applications Configuration (additional_apps)
| Subfield | Description | Default/Example |
|---|---|---|
name |
Name of the application to install (e.g., gpu-operator, prometheus, postgresql). | "gpu-operator" |
skip_installation |
Skip the installation of this application if it’s already installed or not needed. | false |
use_global_kubeconfig |
Use the global kubeconfig file for this application installation. | true |
kubeconfig |
Path to the kubeconfig file specific to this application. | "" (empty string) |
kubecontext |
Kubecontext specific to this application; uses global context if empty. | "" (empty string) |
namespace |
Kubernetes namespace where the application will be installed. | "egs-gpu-operator" |
release |
Helm release name for the application. | "gpu-operator" |
chart |
Helm chart name for the application. | "gpu-operator" |
repo_url |
Helm repository URL for the application. | "https://helm.ngc.nvidia.com/nvidia" |
version |
Version of the application to install. | "v25.3.4" |
specific_use_local_charts |
Use local charts for this application. | true |
values_file |
Path to an external values file, if any. | "" |
inline_values |
Inline values passed to the Helm chart during installation. | See inline values section below |
helm_flags |
Additional Helm flags for the application installation. | "--debug" |
verify_install |
Verify the installation of the application after deployment. | false |
verify_install_timeout |
Timeout for verifying the installation, in seconds. | 600 |
skip_on_verify_fail |
Skip the step if verification fails. | true |
enable_troubleshoot |
Enable troubleshooting mode for additional logs and checks. | false |
Custom Application Manifests (manifests)
| Field | Description | Type | Required | Example |
|---|---|---|---|---|
manifests |
A list of manifest configurations. Each entry defines how a specific Kubernetes manifest should be applied. | list |
Yes | See below for individual fields. |
manifests[].appname |
The name of the application or resource. Used for logging and identification purposes. | string |
Yes | gpu-operator-quota |
manifests[].manifest |
The path to the Kubernetes manifest file. Can be a local file or an HTTPS URL. | string |
No | manifests/gpu-quota.yaml or https://raw.githubusercontent.com/.../daemonset.yaml |
manifests[].overrides_yaml |
The path to a YAML file containing overrides for the base manifest. Merges with the base manifest before applying. | string |
No | overrides/gpu-quota-overrides.yaml |
manifests[].inline_yaml |
Inline YAML content to be merged with the base manifest. Allows for quick, in-line customization without separate files. | string (YAML) |
No | See inline YAML example below. |
manifests[].use_global_kubeconfig |
Determines whether the global kubeconfig and context should be used. If false, specific kubeconfig and context must be provided. |
boolean |
Yes | true |
manifests[].kubeconfig |
Path to a specific Kubernetes configuration file to be used instead of the global kubeconfig. | string |
No | /path/to/specific/kubeconfig |
manifests[].kubecontext |
The context name in the specific Kubernetes configuration file to be used for this manifest. | string |
No | specific-context |
manifests[].namespace |
The Kubernetes namespace where the manifest should be applied. | string |
Yes | egs-gpu-operator |
manifests[].skip_installation |
Whether to skip applying this manifest. | boolean |
No | false |
manifests[].verify_install |
Whether to verify the application of this manifest. | boolean |
No | false |
manifests[].verify_install_timeout |
Timeout for verification in seconds. | integer |
No | 30 |
manifests[].skip_on_verify_fail |
Whether to skip if verification fails. | boolean |
No | true |
Accessing Grafana Dashboard
After successful installation, you can access the Grafana dashboard:
# Port forward to Grafana
kubectl port-forward svc/prometheus-grafana 3000:80 -n egs-monitoring
# Access Grafana at http://localhost:3000
# Default credentials: admin / prom-operator
Note: The default credentials are set by the kube-prometheus-stack Helm chart. For production deployments, consider changing these credentials for security.