Big Monitoring, Small Budget: Powering Observability on Kubernetes with Prometheus, Grafana & Mimir
Overview
When I was working at a startup, our goal was to set up a monitoring solution to track infrastructure components like virtual machines and applications – all while staying within a limited budget and a short timeframe. To achieve this, I chose open-source tools such as Prometheus, Grafana, Mimir, and Nginx. Since we were hosted on Google Cloud, the easiest way to get started with infrastructure and application monitoring using these tools was by deploying them on Google Kubernetes Engine (GKE). However, this guide can easily be adapted to set up monitoring on any cloud platform.
The open-source monitoring stack I selected includes:
- Prometheus: A time-series database (TSDB) that collects and stores metrics from infrastructure and applications.
- Mimir: A scalable, long-term storage backend that extends Prometheus by handling large volumes of time-series data.
- Grafana: A rich visualization and monitoring tool that displays collected metrics in dashboards and supports alerting based on thresholds.
Component Descriptions and Flow:
- IoT Devices, Servers, and Applications: These are the data sources emitting metrics such as CPU usage, memory utilization, and custom application-specific metrics.
- Prometheus (TSDB): Collects and stores time-series metrics from IoT devices, servers, and applications.
- Grafana Mimir (Scaling Layer): Extends Prometheus by providing scalable, durable storage for large-scale metric workloads.
- Grafana (Visualization): Displays collected metrics in customizable dashboards and graphs and provides alerting capabilities.
- NGINX (Ingress Controller): Acts as a reverse proxy and secure access point to the Grafana and Prometheus user interfaces.
- Kubernetes: Orchestrates the entire monitoring stack as containerized services.
- Google Cloud Platform (GCP): Hosts the Kubernetes cluster and the supporting infrastructure.
Cluster Creation:
Below is the Terraform code to create a private Kubernetes cluster in GCP. A similar approach can be used to create private clusters in other cloud environments as well.
Note: In this setup, we are using a shared network from another project, so appropriate IAM permissions and network configurations must be applied.
GitHub code repo: https://github.com/pradeep-gaddamidi/Monitoring
Create a kubernetes cluster using the terraform:
cluster.tf
# google_client_config and kubernetes provider must be explicitly specified like the following.
data "google_client_config" "default" {}
provider "kubernetes" {
host = "https://${module.gke.endpoint}"
token = data.google_client_config.default.access_token
cluster_ca_certificate = base64decode(module.gke.ca_certificate)
}
# Use selected cluster configuration
module "gke" {
source = "terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster"
version = "30.2.0"
project_id = var.cluster_config[local.env].project_id
name = var.cluster_config[local.env].name
region = var.cluster_config[local.env].region
zones = var.cluster_config[local.env].zones
network = var.cluster_config[local.env].network
network_project_id = var.cluster_config[local.env].network_project_id
subnetwork = var.cluster_config[local.env].subnetwork
ip_range_pods = "${var.cluster_config[local.env].subnetwork}-pods"
ip_range_services = "${var.cluster_config[local.env].subnetwork}-services"
http_load_balancing = true
enable_l4_ilb_subsetting = true
network_policy = false
horizontal_pod_autoscaling = true
filestore_csi_driver = false
enable_private_endpoint = true
enable_private_nodes = true
remove_default_node_pool = true
master_ipv4_cidr_block = "172.16.0.0/28"
node_pools = [
{
name = "node-pool"
machine_type = var.cluster_config[local.env].machine_type
node_locations = join(",", var.cluster_config[local.env].zones)
min_count = 1
max_count = 1
local_ssd_count = 0
spot = false
disk_size_gb = var.cluster_config[local.env].disk_size_gb
disk_type = "pd-standard"
image_type = "COS_CONTAINERD"
enable_gcfs = false
enable_gvnic = false
logging_variant = "DEFAULT"
auto_repair = true
auto_upgrade = true
service_account = "${google_service_account.gke.email}"
preemptible = false
initial_node_count = 1
autoscaling = false
},
]
node_pools_oauth_scopes = {
all = [
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
]
}
cluster_resource_labels = {
environment = local.env
project = var.cluster_config[local.env].project_id,
resource_type = "gke",
resource_name = var.cluster_config[local.env].name
customer = "all"
}
node_pools_labels = {
all = {}
default-node-pool = {
default-node-pool = true
}
}
node_pools_metadata = {
all = {}
default-node-pool = {
node-pool-metadata-custom-value = "node-pool"
}
}
node_pools_taints = {
all = []
default-node-pool = [
{
key = "default-node-pool"
value = true
effect = "PREFER_NO_SCHEDULE"
},
]
}
node_pools_tags = {
all = []
default-node-pool = [
"default-node-pool",
]
}
master_authorized_networks = [
{
cidr_block = var.cluster_config[local.env].subnetwork_allow
display_name = "VPC"
}
]
}
resource "google_compute_subnetwork_iam_member" "network_user_service_account" {
for_each = { for user in var.cluster_config[local.env].network_user : user => user }
project = var.cluster_config[local.env].network_project_id
subnetwork = var.cluster_config[local.env].subnetwork
region = var.cluster_config[local.env].region
role = "roles/compute.networkUser"
member = "serviceAccount:${each.value}"
}
resource "google_project_iam_member" "hostServiceAgentUser_service_account" {
for_each = { for user in var.cluster_config[local.env].hostServiceAgent_user : user => user }
project = var.cluster_config[local.env].network_project_id
member = "serviceAccount:${each.value}"
role = "roles/container.hostServiceAgentUser"
}
resource "google_project_iam_member" "serviceAgent_service_account" {
for_each = { for user in var.cluster_config[local.env].serviceAgent_user : user => user }
project = var.cluster_config[local.env].network_project_id
member = "serviceAccount:${each.value}"
role = "roles/container.serviceAgent"
}
In the Terraform configuration above, we utilize the publicly available Google Terraform module terraform-google-modules/kubernetes-engine/google//modules/beta-private-cluster
. This approach allows us to leverage well-maintained, community-supported code, avoiding the need to develop and maintain complex infrastructure code from scratch.
The permissions required for the service accounts used in this Terraform configuration are detailed below:
Role |
Why it’s needed for GKE |
---|---|
roles/compute.networkUser |
Allow nodes and load balancers to use the subnetwork. |
roles/container.hostServiceAgentUser |
Allow GKE to configure networking (firewalls, IPs, etc.) in host/shared VPCs. |
roles/container.serviceAgent |
Allow GKE control plane to manage itself and use necessary GCP APIs. |
Terraform Variables:
Below are the variables I used in the terraform code
variables.tf
variable "cluster_config" {
description = "Cluster configuration per environment"
type = map(object({
project_id = string
name = string
description = string
regional = bool
region = string
zones = list(string)
network = string
subnetwork = string
network_project_id = string
machine_type = string
disk_size_gb = number
subnetwork_allow = string
bucket_names = list(string)
host_project = string
network_user = list(string)
hostServiceAgent_user = list(string)
serviceAgent_user = list(string)
static_ips = list(string)
# Add more attributes as needed
}))
default = {
nonprod-mon = {
project_id = "nonprod-monitoring"
name = "cluster-nonprod"
description = "nonprod cluster"
regional = true
region = "us-central1"
zones = ["us-central1-a", "us-central1-b", "us-central1-c"]
network = "nonprod-vpc"
subnetwork = "nonprod-us-central1-sb01"
subnetwork_allow = "10.226.0.0/22"
network_project_id = "nonprod-networking"
machine_type = "e2-custom-4-10240"
disk_size_gb = "50"
bucket_names = ["mon_blocks_storage", "mon_alertmanager_storage", "mon_ruler_storage"]
host_project = "nonprod-networking"
network_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com", "[email protected]"]
hostServiceAgent_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com"]
serviceAgent_user = ["service-123456789123@container-engine-robot.iam.gserviceaccount.com"]
static_ips = ["internal-ingress"]
}
prod-mon = {
project_id = "prod-monitoring"
name = "cluster-prod"
description = "prod cluster"
regional = true
region = "us-central1"
zones = ["us-central1-a", "us-central1-b", "us-central1-c"]
network = "prod-vpc"
subnetwork = "prod-us-central1-sb01"
subnetwork_allow = "10.227.0.0/22"
network_project_id = "prod-networking"
machine_type = "n2-custom-4-32768"
disk_size_gb = "100"
bucket_names = ["mon_blocks_storage", "mon_alertmanager_storage", "mon_ruler_storage"]
host_project = "prod-networking"
network_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com", "[email protected]"]
hostServiceAgent_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com"]
serviceAgent_user = ["service-123456789012@container-engine-robot.iam.gserviceaccount.com"]
static_ips = ["internal-ingress"]
}
}
}
Terraform state:
A GCS bucket is used to store the terraform state information.
backend.tf
terraform {
backend "gcs" {
bucket = "environments-state"
prefix = "terraform/state/gke"
}
}
Terraform workspace:
I am using the Terraform workspaces, so make sure to configure your workspace before running the Terraform code. For example, you can set the workspace with the following command:
terraform workspace set non-prod
In the main.tf file, I defined the workspace like this:
main.tf
locals {
env = terraform.workspace
}
This automatically sets the env
local variable to match the current Terraform workspace (e.g., non-prod
, prod
), allowing the configuration to dynamically adjust based on the selected environment.
Static IPs
We need static IP addresses to configure DNS records, allowing us to access services using domain names such as prometheus.company.com or grafana.company.com.
static_ips.tf
data "google_compute_subnetwork" "subnet" {
name = var.cluster_config[local.env].subnetwork
project = var.cluster_config[local.env].network_project_id
region = var.cluster_config[local.env].region
}
resource "google_compute_address" "static_ips" {
for_each = { for ip in var.cluster_config[local.env].static_ips : ip => ip }
name = each.value
address_type = "INTERNAL"
region = var.cluster_config[local.env].region
subnetwork = data.google_compute_subnetwork.subnet.self_link
project = var.cluster_config[local.env].project_id
}
Kuberenetes Service Account:
We are using a dedicated service account for the Kubernetes nodes to manage their permissions securely and follow best practices.
service_account.tf
resource "google_service_account" "gke" {
account_id = "gke-${local.env}"
project = var.cluster_config[local.env].project_id
display_name = "Service account for gke"
}
Mimir GCS buckets:
We need Google Cloud Storage (GCS) buckets for Mimir’s long-term metric storage, allowing us to efficiently scale and persist large volumes of time-series data.
gcs_buckets.tf
module "gcs_buckets" {
source = "terraform-google-modules/cloud-storage/google"
version = "~> 5.0"
project_id = var.cluster_config[local.env].project_id
location = "US"
storage_class = "STANDARD"
names = var.cluster_config[local.env].bucket_names
labels = {
environment = local.env
project = var.cluster_config[local.env].project_id
resource_type = "gcs"
customer = "all"
}
}
resource "google_storage_bucket_iam_binding" "buckets" {
for_each = { for bucket in var.cluster_config[local.env].bucket_names : bucket => bucket }
bucket = each.value
role = "roles/storage.objectAdmin"
members = [
"serviceAccount:${google_service_account.gke.email}"
]
depends_on = [module.gcs_buckets]
}
Namespaces (Kuberenetes):
Once the cluster is set up, create the following namespaces
- Promethues
- Mimir
- Grafana
- Nginx-Ingress
Installation (Helm Charts):
Use Helm charts to install the various monitoring software in their respective namespaces. Below is an example for Prometheus, but you can apply this approach to install other software such as Grafana, Mimir, and the NGINX Ingress Controller.
Promethues – https://github.com/prometheus-community/helm-charts
Grafana – https://github.com/grafana/helm-charts
Mimir – https://grafana.com/docs/helm-charts/mimir-distributed/latest/get-started-helm-charts/
Nginx controller- https://github.com/kubernetes/ingress-nginx/tree/main/charts/ingress-nginx.
Helm Commands:
First, add the Prometheus Helm repository and update it:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Next, pull the Prometheus chart:
helm pull prometheus-community/prometheus --untar
This will create a prometheus/
directory in your current working directory, containing the chart contents. You can modify the default prometheus/values.yaml
file before installing it, allowing you to set custom configurations such as the admin password, persistence settings, and service type.
Now, you can install Prometheus with the custom values_prod.yaml
file:
helm install prometheus ./prometheus -f prometheus/values_prod.yaml -n monitoring
Similarly, you can install the other components:
helm install grafana ./grafana -f grafana/values_prod.yaml -n monitoring
helm install mimir ./mimir -f mimir/values_prod.yaml -f mimir/capped-small.yaml -n mimir
helm install nginx-ingress ./nginx-ingress -f nginx/values_prod.yaml -n ingress
Configuration (values.yaml files):
Ingress:
An Ingress controller is required to manage Ingress resources. Simply creating an Ingress resource will have no effect unless there is an Ingress controller in place. While there are many Ingress controllers available, including GKE’s built-in Ingress, I’ve chosen the NGINX Ingress Controller for various reasons.
Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined within the Ingress resource.
In this tutorial, we are using an internal IP to expose the services. Make sure to configure the following in your helm values.yaml
to ensure proper routing and access:
Helm values.yaml
controller:
service:
internal:
enabled: true
annotations:
# Create internal LB. More information: https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing
# For GKE versions 1.17 and later
networking.gke.io/load-balancer-type: "Internal"
# For earlier versions
# cloud.google.com/load-balancer-type: "Internal"
# Any other annotation can be declared here.
Also, provide the static internal IP you created earlier via Terraform in the loadBalancerIP
field, like so:
Helm values.yaml
loadBalancerIP: "10.x.x.x"
Once the NGINX Ingress controller is installed, it will create a cloud load balancer with your cloud provider (e.g., GCP). Afterward, you need to create Ingress resources to route traffic to the appropriate destinations, such as Grafana and Prometheus.
The Ingress spec contains all the necessary information to configure a load balancer or proxy server. To ensure the traffic is routed correctly, you must include either the ingressClassName: nginx
spec field or the kubernetes.io/ingress.class: nginx
annotation in your Ingress resources.
Ingress resource:
my-ingress-prod.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: company-ingress
namespace: monitoring
spec:
ingressClassName: nginx
rules:
- host: grafana.company.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana
port:
number: 80
- host: prometheus.company.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-server
port:
number: 80
What it does:
- Ingress Controller: It uses the
nginx
Ingress controller to handle the routing of external traffic to the internal services. - Routing Rules:
grafana.company.com
: Traffic directed tografana.company.com
will be routed to thegrafana
service, specifically to port 80.prometheus.company.com
: Traffic directed toprometheus.company.com
will be routed to theprometheus-server
service, specifically to port 80.
- Path Handling: Both routes use
path: /
, meaning any URL that starts with/
will be forwarded to the respective services (Grafana or Prometheus).
This configuration ensures that incoming traffic to the specified domains is directed to the correct service inside your Kubernetes cluster, based on the hostname and path.
Prometheus:
If you’re using the pull model, Prometheus needs to collect metrics from your targets. To configure this, you can set up the scrape configuration as follows in your values.yaml
file:
Helm values.yaml
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
If Prometheus requires scaling, it needs to forward the metrics to Mimir for long-term storage. You can configure Prometheus to send a copy of the metrics to Mimir by using the remoteWrite
section in the Prometheus Helm values.yaml
file, like so:
Helm values.yaml
remoteWrite:
- url: http://company-mimir-nginx.mimir.svc.cluster.local:80/api/v1/push
The idea behind using Mimir is to provide long-term storage for Prometheus metrics. This setup allows you to scale Prometheus as needed while avoiding a single point of failure.
I recommend enabling persistent volumes (PVC) for Prometheus pods. This ensures that your data is not lost when remoteWrite
is enabled, or in case you’re only using a single Prometheus instance and want to have a fallback in the event Mimir encounters issues. Enabling persistent storage in the Helm values.yaml
file ensures that the data is retained through pod restarts.
Helm values.yaml
persistentVolume:
## If true, Prometheus server will create/use a Persistent Volume Claim
## If false, use emptyDir
##
enabled: true
accessModes:
- ReadWriteOnce ## Prometheus server data Persistent Volume mount root path
##
mountPath: /data
## Prometheus server data Persistent Volume size
##
size: 500Gi
# storageClass: "-"
storageClass: "persistent-disk-rwo"
Set retention time carefully (--storage.tsdb.retention.time
) in helm values.yaml
Helm values.yaml
## Prometheus data retention period (default if not specified is 15 days)
##
retention: "90d"
Adjust above values as per your needs.
Mimir:
Grafana Mimir is an open-source, horizontally scalable, multi-tenant time-series database and monitoring platform Mimir is fully compatible with Prometheus, meaning it supports the Prometheus data model, query language (PromQL), and scraping mechanism. It can serve as a backend to store Prometheus metrics, enabling you to scale beyond what a single Prometheus server can handle. With efficient data storage and compression techniques, Mimir helps reduce the cost of storing long-term metric data. Mimir is useful in
- Store large volumes of time-series data long-term.
- Scale Prometheus beyond a single instance.
- Use isolated storage with multi-tenancy support.
- Ensure distributed, fault-tolerant metric storage. Grafana Mimir’s architecture is based on the principles of distributed systems, using components such as:
- Distributor: Receives and writes data from Prometheus instances or any compatible scraper.
- Ingester: Stores and processes incoming data. Data is held temporarily in the ingester until it is flushed to long-term storage.
- Store Gateway: Handles retrieving data from persistent storage and serves queries.
- Query Frontend: Manages query execution and routing, ensuring that queries are distributed across the available Mimir instances.
- Storage Backend: Mimir In our tutorial uses GCS storage backends.
The GCS storage backends used by Mimir are mon_blocks_storage
, mon_alertmanager_storage
, and mon_ruler_storage
, which we have configured in our Terraform code.
In the Helm values.yaml
file, configure the GCS buckets for storage along with the necessary credentials to access these GCS storage buckets. This allows Mimir to interact with Google Cloud Storage for long-term metric storage.
Helm values.yaml
# -- Additional structured values on top of the text based 'mimir.config'. Applied after the text based config is evaluated for templates. Enables adding and modifying YAML elements in the evaulated 'mimir.config'.
# To modify the resulting configuration, either copy and alter 'mimir.config' as a whole or use the 'mimir.structuredConfig' to add and modify certain YAML elements.
structuredConfig:
limits:
out_of_order_time_window: 1h
max_label_names_per_series: 100
common:
storage:
backend: gcs
gcs:
service_account: |
{
"type": "service_account",
"project_id": "prod-monitoring",
"private_key_id": "50885800",
"private_key": "xxxxx-----PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "108488885",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/gke-prod%40prod-monitoring.iam.gserviceaccount.com",
"universe_domain": "googleapis.com"
}
blocks_storage:
backend: gcs
gcs:
bucket_name: mon_blocks_storage
alertmanager_storage:
gcs:
bucket_name: mon_alertmanager_storage
ruler_storage:
gcs:
bucket_name: mon_ruler_storage
Based on your requirements, use either the capped-small.yaml
or capped-large.yaml
values files to assign compute resources to the Mimir components. These files allow you to configure the CPU and memory limits for Mimir depending on the scale of your deployment.
Additionally, Mimir has an active community on Slack where you can seek help from other members while setting it up in your cluster.
Grafana:
In Grafana, add Mimir as a datasource for long-term metric storage. If you’re using only a single Prometheus instance, you can also add Prometheus as a datasource for backup purposes. Once the datasources are set up, you can visualize the metrics, configure dashboards, and create alerts in Grafana.
Additionally, enable Persistent Volume Claims (PVC) for Grafana to ensure that data is not lost if the pod restarts. This will help retain the configuration and data even through pod lifecycle changes.
Helm values.yaml
## Enable persistence using Persistent Volume Claims
## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
##
persistence:
type: pvc
enabled: true
storageClassName: persistent-disk-rwo
accessModes:
- ReadWriteOnce
size: 10Gi
DNS
Once everything is installed and configured, configure the DNS (e.g., prometheus.company.com or grafana.company.com) to point to the static IP you created earlier (10.x.x.x
) using Terraform.
After completing this configuration, you should be able to access the metrics in Grafana. From there, you can visualize the data, create custom dashboards, and set up alerts.
For more details on creating dashboards and visualizing data in Grafana, refer to the https://grafana.com/docs/grafana/latest/dashboards/
Good luck! Feel free to connect with me on LinkedIn.