Module 1: User workload monitoring

Your organization needs to improve visibility into microservices application health and performance. The current infrastructure monitoring does not provide insight into application-level metrics or business-critical functionality.

You will implement user workload monitoring to track application performance, latency, and error rates. Custom metrics and alerts will enable proactive detection of application issues before they impact users.

In this module, you’ll learn how to configure Prometheus monitoring for custom application metrics and create alerts that notify you when applications deviate from expected behavior.

Table of Contents

Learning objectives
Understanding observability foundations
Exercise 1: Explore the monitoring stack
Exercise 2: Deploy a sample application with metrics
Exercise 3: Configure ServiceMonitor to collect metrics
Exercise 4: Create custom dashboards with COO
Exercise 5: Configure alerting rules
Learning outcomes
Module summary

Learning objectives

By the end of this module, you’ll be able to:

Understand the 3 pillars of observability (metrics, logs, traces) and when to use each
Configure ServiceMonitor resources to collect custom application metrics
Write PromQL queries to analyze application performance
Create declarative Perses dashboards for application visibility in the OpenShift console
Configure Alertmanager rules to proactively notify you of application issues

Understanding observability foundations

Before implementing monitoring, you need to understand the observability landscape.

The 3 pillars of observability

Use all 3 signal types together:

Metrics: Numeric trends over time (request rate, latency percentiles, CPU)
- Best for: Detection, thresholds, alerting
Logs: Timestamped events with context (errors, transaction IDs, user actions)
- Best for: Root cause details and audit trails
Traces: End-to-end request flow across services
- Best for: Finding latency bottlenecks and dependency issues

Practical workflow:

Start with metrics to detect a problem
Use logs to identify what failed
Use traces to locate where it failed

Observability methodologies

Use these common methods to decide which metrics matter most:

RED (services): Rate, Errors, Duration
USE (infrastructure): Utilization, Saturation, Errors
Golden Signals: Latency, Traffic, Errors, Saturation

In this workshop, you’ll primarily use the RED method for monitoring the sample application’s HTTP requests (rate, errors, duration). The Prometheus metrics http_requests_total and http_request_duration_seconds directly support this methodology.

OpenShift monitoring architecture

OpenShift provides 3 complementary monitoring options:

Platform monitoring (CMO)
- Namespace: openshift-monitoring
- Purpose: Cluster and control-plane health
- Managed by: Cluster admins
User workload monitoring (CMO)
- Namespace: openshift-user-workload-monitoring
- Purpose: Cluster-wide application monitoring using ServiceMonitor
- Trade-off: Easy to use, limited customization
Cluster Observability Operator (COO)
- Purpose: Independent, namespace-scoped monitoring stacks via MonitoringStack
- Discovery: Label-based selectors (for example, monitoring.rhobs/stack: observability-stack)
- Components: Prometheus plus optional Thanos Querier, Alertmanager, and UI plugins (including Perses dashboards)
- Best for: Multi-tenant teams, custom dashboards, and flexible retention/configuration

CMO and COO can run together without conflict.

In this workshop, you’ll use both user workload monitoring (Exercises 1-3) and COO (Exercise 4 onward) to understand the full range of monitoring options available in OpenShift.

Exercise 1: Explore the monitoring stack

You need to verify that user workload monitoring is enabled and understand the components deployed in your cluster.

The observability stack was pre-configured via GitOps, so Prometheus and Alertmanager should already be running.

Steps

Log into the OpenShift console at OpenShift Console

Use the credentials provided in your lab interface:
- Username: %OPENSHIFT_USERNAME%

Verify user workload monitoring pods are running:

oc get pods -n openshift-user-workload-monitoring

Expected output

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-xxxxx              2/2     Running   0          1h
prometheus-user-workload-0             6/6     Running   0          1h
prometheus-user-workload-1             6/6     Running   0          1h
thanos-ruler-user-workload-0           3/3     Running   0          1h
thanos-ruler-user-workload-1           3/3     Running   0          1h

In the OpenShift console, navigate to the project openshift-user-workload-monitoring → Workloads and verify user workload monitoring is enabled:
Access the OpenShift console monitoring interface:
- Navigate to Observe → Metrics in the left navigation
- This opens the Prometheus query interface, make sure the namespace filter is set to openshift-user-workload-monitoring.
Run your first PromQL query to see cluster metrics:

In the query box, enter:
```
up{namespace="openshift-user-workload-monitoring"}
```
Click Run Queries

This shows all targets being scraped in the user workload monitoring namespace. Each target should show value: 1 (up and healthy).

User workload monitoring targets in Prometheus

Verify

Check that your monitoring stack is operational:

✓ All pods in openshift-user-workload-monitoring are Running
✓ User workload monitoring is enabled (enableUserWorkload: true)
✓ Prometheus query interface is accessible
✓ PromQL query returns results

What you learned: OpenShift includes CMO-managed platform and user workload monitoring, plus optional COO stacks for advanced use cases. In this exercise, you verified the CMO user workload stack.

Troubleshooting

Issue: PromQL query returns no data

Solution: Allow a few minutes for Prometheus to scrape targets. Prometheus scrapes metrics every 30 seconds.

Exercise 2: Deploy a sample application with metrics

To practice monitoring, you need an application that exposes metrics. You’ll deploy a sample application that provides Prometheus-compatible metrics.

Your workshop namespace has been pre-created with special permissions to create ServiceMonitor and PrometheusRule resources. These permissions are required for this workshop and are granted via a custom ClusterRole.

Steps

Create a new project for your sample application:
```
oc new-project %OPENSHIFT_USERNAME%-observability-demo
```
If the namespace already exists, you can switch to it with oc project %OPENSHIFT_USERNAME%-observability-demo. The namespace has been pre-configured with the necessary monitoring permissions.

Deploy a sample application that exposes metrics:

cat <<EOF | oc apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  namespace: %OPENSHIFT_USERNAME%-observability-demo
  labels:
    app: sample-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
      - name: app
        image: quay.io/brancz/prometheus-example-app:v0.3.0
        ports:
        - containerPort: 8080
          name: http
      - name: debug
        image: registry.access.redhat.com/ubi9/ubi-minimal:latest
        command: ["/bin/sh", "-c", "while true; do sleep 30; done"]
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "64Mi"
            cpu: "100m"
---
apiVersion: v1
kind: Service
metadata:
  name: sample-app
  namespace: %OPENSHIFT_USERNAME%-observability-demo
  labels:
    app: sample-app
spec:
  selector:
    app: sample-app
  ports:
  - port: 8080
    targetPort: 8080
    name: http
EOF

This creates a deployment with 2 replicas and a service to expose the application.

The deployment includes a debug sidecar container with UBI minimal. This sidecar provides debugging tools like curl without bloating the main application container. This is a common pattern for production debugging.

Verify the pods are running:

oc get pods -n %OPENSHIFT_USERNAME%-observability-demo

Expected output

NAME                          READY   STATUS    RESTARTS   AGE
sample-app-xxxxx-xxxxx        2/2     Running   0          30s
sample-app-xxxxx-xxxxx        2/2     Running   0          30s

Note the 2/2 ready status indicates both the application and debug sidecar containers are running.

Check what metrics the application exposes using the debug sidecar:

First, make a request to the application to initialize the metrics:

oc exec -n %OPENSHIFT_USERNAME%-observability-demo deployment/sample-app -c debug -- curl -s localhost:8080

Expected output

Hello from example application.

Now check the metrics endpoint:

oc exec -n %OPENSHIFT_USERNAME%-observability-demo deployment/sample-app -c debug -- curl -s localhost:8080/metrics | head -20

The -c debug flag specifies which container to exec into (the debug sidecar). Since containers in the same pod share the network namespace, the sidecar can access localhost:8080.

Sample output

# HELP http_request_duration_seconds Duration of all HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.005"} 1
http_request_duration_seconds_bucket{code="200",handler="found",method="get",le="0.01"} 1
...
# HELP http_requests_total Count of all HTTP requests
# TYPE http_requests_total counter
http_requests_total{code="200",method="get"} 1
# HELP version Version information about this binary
# TYPE version gauge
version{version="v0.3.0"} 1

The application exposes several metrics:

http_requests_total: Counter of all HTTP requests by status code and method
http_request_duration_seconds: Histogram of request latency
version: Application version information

Metrics are lazy-initialized and only appear after the first request. This is why we made an initial request before checking the metrics endpoint.

Verify

Check that your sample application is ready for monitoring:

✓ 2 pods running in %OPENSHIFT_USERNAME%-observability-demo namespace
✓ Service exposing port 8080
✓ Metrics endpoint returns Prometheus-formatted data
✓ Metrics include http_requests_total counter

Exercise 3: Configure ServiceMonitor to collect metrics

Now you’ll configure Prometheus to scrape metrics from your sample application using a ServiceMonitor resource.

A ServiceMonitor tells Prometheus which services to scrape for metrics. It uses label selectors to find services and defines scrape intervals and ports.

Steps

Create a ServiceMonitor resource:

cat <<EOF | oc apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sample-app-monitor
  namespace: %OPENSHIFT_USERNAME%-observability-demo
  labels:
    app: sample-app
spec:
  selector:
    matchLabels:
      app: sample-app
  endpoints:
  - port: http
    interval: 30s
    path: /metrics
EOF

This ServiceMonitor tells Prometheus to:

Find services with label app: sample-app
Scrape the http port
Collect metrics every 30 seconds
Use the /metrics path

Verify the ServiceMonitor was created:

oc get servicemonitor -n %OPENSHIFT_USERNAME%-observability-demo

Expected output

NAME                 AGE
sample-app-monitor   10s

Generate some traffic to create metrics:

oc exec -n %OPENSHIFT_USERNAME%-observability-demo deployment/sample-app -c debug -- sh -c \
  'for i in $(seq 1 60); do
     r=$(od -An -N1 -tu1 /dev/urandom | tr -d " ");
     case $(( r % 10 )) in
       7|8) ep="/err" ;;
       9)   ep="/internal-err" ;;
       *)   ep="/" ;;
     esac;
     curl -s localhost:8080$ep > /dev/null;
     sleep $(awk "BEGIN{srand(); printf \"%.2f\\n\", rand() * 0.4}");
   done'

This sends 60 HTTP requests using the debug sidecar: ~70% to / (200 OK), ~20% to /err (404), and ~10% to /internal-err (500). The mix populates http_requests_total with multiple code label values, producing visible error rates alongside the success rate in Prometheus.

Wait 1-2 minutes for Prometheus to scrape the metrics, then query them:

Go to Observe → Metrics in the OpenShift console.

Enter this PromQL query:
```
http_requests_total{}
```
Click Run Queries

You should see results showing the counter values for each pod.
Query the rate of requests over time, you can add a new query:
```
rate(http_requests_total{}[5m])
```
This shows requests per second averaged over 5 minutes.

Verify

Check that Prometheus is collecting your application metrics:

✓ ServiceMonitor exists in %OPENSHIFT_USERNAME%-observability-demo namespace
✓ PromQL query http_requests_total returns data
✓ Multiple time series (1 per pod) are visible
✓ Rate query shows request rate calculation

What you learned: ServiceMonitor resources configure Prometheus to scrape application metrics. Once configured, metrics are automatically collected and queryable via PromQL.

Troubleshooting

Issue: PromQL query returns no data

Solution: . Wait 1-2 minutes for Prometheus to scrape metrics . Verify ServiceMonitor exists: oc get servicemonitor -n %OPENSHIFT_USERNAME%-observability-demo . Check that service labels match ServiceMonitor selector . Verify pods are running and exposing metrics: oc exec -n %OPENSHIFT_USERNAME%-observability-demo deployment/sample-app -c debug — curl localhost:8080/metrics

Issue: Metrics show 0 requests

Solution: Generate traffic (run the curl loop again) and wait for Prometheus to scrape the updated values.

Exercise 4: Create custom dashboards with COO

Querying metrics via PromQL is effective, but visualizing trends over time provides better insights. You’ll use the Cluster Observability Operator (COO) to create custom dashboards.

Understanding COO vs user workload monitoring

Your cluster has 2 independent monitoring systems:

User Workload Monitoring (CMO) (Exercise 1-3):

Managed by Cluster Monitoring Operator (CMO)
Cluster-wide monitoring for all namespaces
Namespace: openshift-user-workload-monitoring
Automatically discovers ServiceMonitors in any namespace
Best for: Standard application monitoring with built-in OpenShift integration
Limited customization options
You’ve already used this for your sample-app

Cluster Observability Operator (COO) (Exercise 4):

Independent operator that functions alongside CMO (no conflicts)
Creates namespace-scoped monitoring stacks via MonitoringStack CR
Namespace: observability-demo (can deploy to any namespace)
Label-based ServiceMonitor discovery via resourceSelector
Best for: Multi-tenant environments, custom dashboards, longer retention, team-scoped metrics
Highly customizable (retention periods, storage, collection methods)
Optional components: Thanos Querier, Alertmanager, UI plugins (Perses dashboards)

Key difference: COO uses a label selector (resourceSelector.matchLabels) to discover ServiceMonitors. Your sample-app ServiceMonitor from Exercise 3 doesn’t have this label, so it’s only visible to user workload monitoring (CMO), not COO.

Why use both?:

CMO provides out-of-the-box monitoring for all applications
COO provides advanced features: multi-tenancy, custom dashboards, longer retention
They coexist independently - no conflicts

In this exercise, you’ll create a second ServiceMonitor with the COO-specific label to demonstrate multi-tenant monitoring and custom Perses dashboards.

Steps

Verify the COO MonitoringStack is deployed:

oc get monitoringstack -n observability-demo

Expected output

NAME                   AGE
observability-stack    1h

Check the label selector the MonitoringStack uses to discover ServiceMonitors:

oc get monitoringstack observability-stack -n observability-demo -o jsonpath='{.spec.resourceSelector.matchLabels}'

Expected output

{"monitoring.rhobs/stack":"observability-stack"}

ServiceMonitors must have this label to be discovered by the COO stack.

You can also see the COO stack in the console: click on Topology in the left navigation and select the observability-demo namespace. You should see the observability-stack operator-backed service managed by cluster-observability-operator, consisting of Prometheus (3 pods), Thanos Querier (1 pod), and Alertmanager (2 pods). You may also notice the opentelemetry application containing the central-collector deployment — this is the OpenTelemetry Collector configured in a later module.

Monitoring topology showing COO stack in observability-demo namespace

Create a new ServiceMonitor with the COO label, for the COO we need to use the monitoring.rhobs/v1 API group instead of the monitoring.coreos.com group used by CMO. This is because COO uses a different Prometheus operator under the hood that watches for ServiceMonitors with the monitoring.rhobs/stack: observability-stack label.
```
cat <<EOF | oc apply -f -
apiVersion: monitoring.rhobs/v1
kind: ServiceMonitor.monitoring.rhobs
metadata:
  name: sample-app-coo
  namespace: %OPENSHIFT_USERNAME%-observability-demo
  labels:
    app: sample-app
    monitoring.rhobs/stack: observability-stack
spec:
  selector:
    matchLabels:
      app: sample-app
  endpoints:
  - port: http
    interval: 30s
    path: /metrics
EOF
```
Note the monitoring.rhobs/stack: observability-stack label - this tells COO to scrape this service. We keep the original ServiceMonitor from Exercise 3 to show how CMO and COO can run independently with different discovery methods (label-based for COO vs namespace-based for CMO).

Verify the COO Prometheus stack has picked up your ServiceMonitor:

First, check that Prometheus pods are running:

oc get pods -n observability-demo -l app.kubernetes.io/name=prometheus

Expected output

NAME                                READY   STATUS    RESTARTS   AGE
prometheus-observability-stack-0    2/2     Running   0          1h
prometheus-observability-stack-1    2/2     Running   0          1h
prometheus-observability-stack-2    2/2     Running   0          1h

Wait 1-2 minutes for Prometheus to discover the new ServiceMonitor, then verify it’s being scraped:

oc exec -n observability-demo prometheus-observability-stack-0 -c prometheus -- \
  curl -s http://localhost:9090/api/v1/targets | \
  grep -E 'sample-app.*%OPENSHIFT_USERNAME%-observability-demo' | head -5

Expected output (showing your application is being monitored)

..."job":"sample-app-coo/sample-app/0","namespace":"%OPENSHIFT_USERNAME%-observability-demo"...

If you see your application and namespace, the COO Prometheus has successfully discovered and is scraping your ServiceMonitor. If not, verify the ServiceMonitor has the correct label:

oc get servicemonitor.monitoring.rhobs sample-app-coo -n %OPENSHIFT_USERNAME%-observability-demo --show-labels

You should see monitoring.rhobs/stack=observability-stack in the labels.

Verify the pre-created Perses DataSource for the COO Prometheus:

A PersesDatasource named prometheus has been provisioned in your namespace as part of the workshop setup. It provides a shared connection to the COO Prometheus instance that all dashboards in your namespace can reference.

oc get persesdatasource prometheus \
  -n %OPENSHIFT_USERNAME%-observability-demo -o yaml

Expected output

apiVersion: perses.dev/v1alpha1
kind: PersesDatasource
metadata:
  name: prometheus
  namespace: %OPENSHIFT_USERNAME%-observability-demo
spec:
  config:
    default: true
    display:
      name: "COO Prometheus"
      description: "COO MonitoringStack Prometheus in observability-demo"
    plugin:
      kind: PrometheusDatasource
      spec:
        proxy:
          kind: HTTPProxy
          spec:
            url: 'http://observability-stack-prometheus.observability-demo.svc.cluster.local:9090'

The default: true field means that dashboards in your namespace which don’t specify a datasource name will automatically use this one. Dashboards that do reference it use name: prometheus.

Create a Perses dashboard to visualize your metrics:

Now create a dashboard that references the datasource:

cat <<EOF | oc apply -f -
apiVersion: perses.dev/v1alpha1
kind: PersesDashboard
metadata:
  name: sample-app-dashboard
  namespace: %OPENSHIFT_USERNAME%-observability-demo
  labels:
    monitoring.rhobs/stack: observability-stack
spec:
  display:
    name: Sample Application Metrics
    description: HTTP request metrics for sample-app
  duration: 1h
  panels:
    httpRequestRate:
      kind: Panel
      spec:
        display:
          name: HTTP Request Rate
          description: HTTP request metrics for sample-app
        plugin:
          kind: TimeSeriesChart
          spec:
            yAxis:
              show: true
              label: ""
              format:
                unit: requests/sec
        queries:
          - kind: TimeSeriesQuery
            spec:
              plugin:
                kind: PrometheusTimeSeriesQuery
                spec:
                  datasource:
                    kind: PrometheusDatasource
                    name: prometheus
                  query: sum(rate(http_requests_total{namespace="%OPENSHIFT_USERNAME%-observability-demo"}[5m]))
                    by (pod)
    httpErrorRate:
      kind: Panel
      spec:
        display:
          name: Error Rate %
        plugin:
          kind: TimeSeriesChart
          spec:
            yAxis:
              show: true
              label: ""
              format:
                unit: percent-decimal
        queries:
          - kind: TimeSeriesQuery
            spec:
              plugin:
                kind: PrometheusTimeSeriesQuery
                spec:
                  datasource:
                    kind: PrometheusDatasource
                    name: prometheus
                  query: sum(rate(http_requests_total{namespace="%OPENSHIFT_USERNAME%-observability-demo",
                    code=~"[45].."}[5m])) by (pod)
    httpDuration:
      kind: Panel
      spec:
        display:
          name: Duration
        plugin:
          kind: TimeSeriesChart
          spec:
            legend:
              position: bottom
              mode: table
            yAxis:
              show: true
              label: ""
              format:
                unit: seconds
        queries:
          - kind: TimeSeriesQuery
            spec:
              plugin:
                kind: PrometheusTimeSeriesQuery
                spec:
                  datasource:
                    kind: PrometheusDatasource
                    name: prometheus
                  query: sum(rate(http_request_duration_seconds_sum{namespace="%OPENSHIFT_USERNAME%-observability-demo"}[5m]))
                    by (pod) /
                    sum(rate(http_request_duration_seconds_count{namespace="%OPENSHIFT_USERNAME%-observability-demo"}[5m]))
                    by (pod)
                  seriesNameFormat: Avg Latenct
          - kind: TimeSeriesQuery
            spec:
              plugin:
                kind: PrometheusTimeSeriesQuery
                spec:
                  datasource:
                    kind: PrometheusDatasource
                    name: prometheus
                  query: histogram_quantile(0.95,
                    sum(rate(http_request_duration_seconds_bucket{namespace="%OPENSHIFT_USERNAME%-observability-demo"}[5m]))
                    by (pod, le))
                  seriesNameFormat: P95 Latency
          - kind: TimeSeriesQuery
            spec:
              plugin:
                kind: PrometheusTimeSeriesQuery
                spec:
                  datasource:
                    kind: PrometheusDatasource
                    name: prometheus
                  query: histogram_quantile(0.99,
                    sum(rate(http_request_duration_seconds_bucket{namespace="%OPENSHIFT_USERNAME%-observability-demo"}[5m]))
                    by (pod, le))
                  seriesNameFormat: P99 Latency
  layouts:
    - kind: Grid
      spec:
        display:
          title: RED Metrics
          collapse:
            open: true
        items:
          - x: 0
            "y": 0
            width: 12
            height: 6
            content:
              \$ref: "#/spec/panels/httpRequestRate"
          - x: 12
            "y": 0
            width: 12
            height: 6
            content:
              \$ref: "#/spec/panels/httpErrorRate"
          - x: 0
            "y": 6
            width: 24
            height: 11
            content:
              \$ref: "#/spec/panels/httpDuration"
EOF

This creates a Perses dashboard with 2 panels that reference the shared datasource: request rate and total requests.

Generate traffic to populate the dashboard:

oc exec -n %OPENSHIFT_USERNAME%-observability-demo deployment/sample-app -c debug -- sh -c \
  'for i in $(seq 1 100); do
     r=$(od -An -N1 -tu1 /dev/urandom | tr -d " ");
     case $(( r % 10 )) in
       7|8) ep="/err" ;;
       9)   ep="/internal-err" ;;
       *)   ep="/" ;;
     esac;
     curl -s localhost:8080$ep > /dev/null;
     sleep $(awk "BEGIN{srand(); printf \"%.2f\\n\", rand() * 1.4 + 0.1}");
   done'

The mix of ~70% 200, ~20% 404, and ~10% 500 responses with random delays produces realistic traffic patterns — you will see separate time series per code label in both dashboard panels.

Access your custom dashboard in the OpenShift console:
- Navigate to Observe → Dashboards (Perses) in the left navigation
- In the dashboard dropdown, select Sample Application Metrics
- You should see a Request Metrics section with 2 panels: HTTP Request Rate and Total HTTP Requests, each showing a time-series graph for your sample-app pod
Verify the dashboard is using COO metrics:

The dashboard queries the COO Prometheus service in the observability-demo namespace through the Perses datasource (not user workload monitoring).
Explore and build dashboards with the standalone Perses UI:

A standalone Perses instance is available in the workshop environment for interactive dashboard creation. This is a great way to build and prototype dashboards visually before exporting them as YAML to apply to the cluster.

Navigate to the Perses UI at https://perses.%openshift_cluster_ingress_domain%

From the Perses UI you can:
- Browse existing dashboards and datasources
- Create new dashboards using the visual editor
- Build and test PromQL queries interactively against the COO Prometheus
- Export any dashboard as ready-to-apply Kubernetes YAML via Dashboard → Export → Download as YAML
  
  The exported YAML can be applied directly with oc apply -f, just remember to add the monitoring.rhobs/stack: observability-stack label so COO picks it up.
  
  The Perses standalone UI is a shared environment — use your %OPENSHIFT_USERNAME% prefix when naming dashboards to avoid conflicts with other workshop users.

Verify

Check that your COO dashboard is working:

✓ MonitoringStack exists in observability-demo namespace
✓ ServiceMonitor sample-app-coo has label monitoring.rhobs/stack: observability-stack
✓ PersesDatasource prometheus exists in your namespace with default: true
✓ PersesDashboard created successfully
✓ Dashboard appears in Observe → Dashboards
✓ Panels show request rate and total requests
✓ Graphs update with new data

What you learned:

COO is an independent operator that coexists with the Cluster Monitoring Operator (CMO)
MonitoringStack CR creates namespace-scoped monitoring with label-based ServiceMonitor discovery
COO provides multi-tenancy: different teams can have isolated monitoring stacks
Perses datasources are shared infrastructure—pre-created per namespace so dashboards can reference them by name
Perses dashboards are declarative (YAML) and versioned with your application
You can run multiple monitoring systems: cluster-wide (CMO) + team-scoped (COO)
COO offers more flexibility: longer retention, custom configurations, independent release cycles
Thanos Querier aggregates metrics from multiple Prometheus replicas for high availability

Troubleshooting

Issue: ServiceMonitor not picked up by COO

Solution: . Verify the label: oc get servicemonitor sample-app-coo -n %OPENSHIFT_USERNAME%-observability-demo --show-labels . Must have: monitoring.rhobs/stack=observability-stack . Wait 1-2 minutes for Prometheus to discover the new ServiceMonitor after creation

Issue: PersesDatasource prometheus not found

Solution: . The datasource is pre-provisioned by the workshop setup. If it is missing, contact the workshop facilitator. . To verify it exists: oc get persesdatasource prometheus -n %OPENSHIFT_USERNAME%-observability-demo

Issue: Dashboard not visible in console

Solution: . Verify PersesDashboard exists: oc get persesdashboard -n %OPENSHIFT_USERNAME%-observability-demo . Check for errors: oc describe persesdashboard sample-app-dashboard -n %OPENSHIFT_USERNAME%-observability-demo . Wait 1-2 minutes for the dashboard to appear in the console after creation

Issue: Dashboard shows "No data"

Solution: . Wait 2-3 minutes for Prometheus to scrape metrics . Verify Thanos Querier is running: oc get pods -n observability-demo -l app.kubernetes.io/name=thanos-querier . Generate traffic (run the curl loop again) . Check Prometheus targets: Port-forward to COO Prometheus and check /targets

Exercise 5: Configure alerting rules

Dashboards help you see current state, but alerts proactively notify you when problems occur. You’ll create an alerting rule that fires when request rates drop unexpectedly.

Steps

Create a PrometheusRule resource with an alerting rule:

cat <<EOF | oc apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: sample-app-alerts-%OPENSHIFT_USERNAME%
  namespace: %OPENSHIFT_USERNAME%-observability-demo
  labels:
    app: sample-app
spec:
  groups:
  - name: sample-app
    interval: 30s
    rules:
    - alert: LowRequestRate
      expr: sum(rate(http_requests_total{namespace="%OPENSHIFT_USERNAME%-observability-demo"}[5m])) < 0.1
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Low request rate detected"
        description: "Application in namespace {{ \$labels.namespace }} is receiving fewer than 0.1 requests/second for more than 2 minutes."
EOF

This alert fires if request rate drops below 0.1 requests/second for more than 2 minutes.

Verify the PrometheusRule was created:

oc get prometheusrule -n %OPENSHIFT_USERNAME%-observability-demo

Check alert status in the console:

Navigate to Observe → Alerting in the OpenShift console.
- Click on the Alerting rules tab
- Search for "LowRequestRate"
- You should see the alert in Inactive or Pending state
Trigger the alert by stopping traffic:

Wait 2-3 minutes without generating any traffic. The alert should transition from Inactive → Pending → Firing.

Generate traffic to resolve the alert:

oc exec -n %OPENSHIFT_USERNAME%-observability-demo deployment/sample-app -c debug -- sh -c \
  'for i in $(seq 1 200); do
     r=$(od -An -N1 -tu1 /dev/urandom | tr -d " ");
     case $(( r % 10 )) in
       7|8) ep="/err" ;;
       9)   ep="/internal-err" ;;
       *)   ep="/" ;;
     esac;
     curl -s localhost:8080$ep > /dev/null;
     sleep $(awk "BEGIN{srand(); printf \"%.2f\\n\", rand() * 0.9 + 0.1}");
   done'

After a few minutes, the alert should transition back to Inactive.

Verify

Check that your alerting rules are working:

✓ PrometheusRule exists in %OPENSHIFT_USERNAME%-observability-demo namespace
✓ Alert appears in Observe → Alerting → Alerting rules
✓ Alert transitions through states: Inactive → Pending → Firing
✓ Alert resolves when traffic increases

What you learned: PrometheusRule resources define alerting conditions. Alerts transition through states (Inactive → Pending → Firing) based on PromQL expressions and duration thresholds.

Troubleshooting

Issue: Alert not visible in console

Solution: . Verify PrometheusRule exists: oc get prometheusrule -n %OPENSHIFT_USERNAME%-observability-demo . Check syntax errors in PromQL expression . Wait 1-2 minutes for Prometheus to reload configuration

Issue: Alerting page shows Restricted access with prometheuses/api or alertmanagers/api forbidden

Solution: . This indicates missing RBAC for monitoring API subresources used by the OpenShift console . Ask a cluster admin to run make deploy (with cluster-admin credentials) so workshop monitoring API RoleBindings are applied . Verify access after RBAC update:

oc auth can-i get prometheuses.monitoring.coreos.com/api -n openshift-user-workload-monitoring
oc auth can-i get alertmanagers.monitoring.coreos.com/api -n openshift-user-workload-monitoring

Refresh the Alerting page after both checks return yes

Issue: Alert never fires

Solution: . Verify condition is met (request rate < 0.1/sec) . Check for duration (must be in low state for 2 minutes) . Generate traffic then stop to test alert firing

Learning outcomes

By completing this module, you should now understand:

✓ The 3 pillars of observability (metrics, logs, traces) and when to use each signal type
✓ Observability methodologies: RED (request-driven), USE (resources), Golden Signals
✓ OpenShift’s monitoring architecture: Platform (CMO), User Workload (CMO), and COO
✓ Cluster Observability Operator (COO): Independent multi-tenant monitoring with MonitoringStack CR
✓ How ServiceMonitor resources configure Prometheus to scrape application metrics
✓ Writing PromQL queries to analyze application performance and calculate rates
✓ Creating declarative Perses dashboards for custom visualizations with COO
✓ Configuring PrometheusRule resources to proactively alert on application issues

Business impact:

You’ve implemented the foundation of proactive monitoring. Instead of discovering issues through user complaints, you now have:

Real-time visibility into application request rates and patterns
Declarative dashboards versioned alongside your application code
Team-scoped monitoring (COO) isolated from cluster infrastructure monitoring
Automated alerts that notify you before problems impact customers

Next steps: Module 2 will add centralized logging with LokiStack, enabling you to correlate metrics with detailed log messages for faster root cause analysis.

Module summary

You successfully demonstrated how user workload monitoring provides application visibility.

What you accomplished:

Verified the user workload monitoring stack (Prometheus, Thanos, Alertmanager)
Deployed a sample application with Prometheus-compatible metrics
Configured ServiceMonitor to automatically collect application metrics
Created namespace-scoped monitoring with Cluster Observability Operator (COO)
Built declarative Perses dashboards to visualize application performance
Implemented alerting rules to proactively detect application issues

Key concepts mastered:

ServiceMonitor: Declaratively configures which services Prometheus scrapes
MonitoringStack: Namespace-scoped monitoring with label-based resource discovery
PromQL: Query language for analyzing time-series metrics data
PersesDashboard: Declarative YAML-based dashboard as code
PrometheusRule: Defines alerting conditions and notification thresholds
User workload monitoring vs COO: Cluster-wide vs namespace-scoped monitoring

Metrics collected:

http_requests_total: Counter of all HTTP requests by status code
up: Target availability (1 = healthy, 0 = down)
Custom application metrics from your deployed services

Continue to Module 2 to add centralized logging capabilities.