3. SRE-Practical work 3 Monitoring and Alerting Setup
3. SRE-Practical work 3 Monitoring and Alerting Setup
This practical work focuses on setting up monitoring and alerting using tools like
Prometheus and Grafana. Students will implement monitoring for a sample application, create
dashboards, and set up alerts to understand the application's performance and potential issues.
Practice Instruction:
• For those who have not completed tasks from practice 1-2, the first four (4)
tasks, please start with them.
• For those who have already successfully completed the initial four tasks, please
begin with Task 5.
Software requirements:
• Installation rights on a computer to configure tools such as Prometheus,
Grafana, Node Exporter, etc.
• A code editor (for example, Visual Studio Code, Atom) for editing
configuration files and scripts.
Documentation:
• Make sure that all settings are documented, including configuration changes, so
that they can be reproduced or fixed in the future.
• It is useful to work in teams or study groups(2 students) to share
ideasandsolutions.
• Always monitor system resources when installing and launching new software
to ensure optimal performance.Prepare a task completion report with
screenshots and description
• Write conclusions about what you have learned new for yourself in this practical
work
Save the file and reload/restart Prometheus for changes to take effect.
Instructions:
a. Access Prometheus Web UI
- Open your web browser.
- Navigate to http://localhost:9090/graph. This is the default Prometheus UI
where you can execute PromQL queries.
b. Execute Basic Queries
- In the "Expression" input box, type a metric name, for example:
- up: This will show you the uptime of all instances that Prometheus is scraping.
- node_cpu_seconds_total: This provides the total CPU time in seconds.
- Click "Execute" and view the raw metric values below.
c. Use PromQL for Complex Queries
- Explore the usage of functions and operators in PromQL. For instance:
- rate(node_cpu_seconds_total[5m]): This computes the per-second average rate
of time series in the last five minutes.
- Note: rate is particularly useful for metrics like counters which only go up over
time.
Objective 2: Use multi-dimensional data to filter and aggregate results.
Instructions:
a. Use Label Selectors
- You can refine your queries using label selectors. These allow you to filter
metrics based on their associated labels.
- Example: node_cpu_seconds_total{job="node-exporter",
instance="localhost:9100"}. This narrows down the metric to the node-exporter
job from the localhost:9100 instance.
b. Aggregation with PromQL
- PromQL supports various aggregation operators to provide summary
information.
- sum: This sums up data across all provided labels.
- avg: Computes the average across data points.
- Combine aggregation operators with by or without to specify which label
dimensions to consider.
Example: sum(rate(node_cpu_seconds_total[5m])) by (job): This aggregates the CPU
rate for each job separately.
Instructions:
a. Create/Edit a Dashboard
- From the Grafana main menu, click on the "+" icon and select "Dashboard".
- Alternatively, navigate to an existing dashboard that you'd like to edit.
b. Add a Variable
- On the dashboard screen, click on the cogwheel/settings icon on the top, then
select "Variables".
- Click on "New Variable".
- Choose a name for your variable, and for the Type, select "Query".
- Under "Data source", choose "Prometheus".
- In the "Query" box, you can type a request such as {job=~".+"} to fetch all jobs.
- Save your changes.
c. Update a Panel with the Variable
- Return to your dashboard and edit a panel.
- In your metric query, you can now reference the variable by using
$VariableName (replace "VariableName" with the name you gave to your
variable). For instance, if you're looking to filter metrics by job, it could look
like node_cpu_seconds_total{job="$JobName"}.
- Save the panel.
Objective 2: Explore Grafana's transformations and overrides
Instructions:
a. Add a Panel with Multiple Metrics
- Click on "Add Panel" in your dashboard.
- In the query section, add multiple metrics, such as node_cpu_seconds_total and
node_memory_MemAvailable_bytes.
b. Use Transformations
- With the panel still in edit mode, click on the "Transform" tab.
- Explore different transformations such as:
- Reduce: To consolidate a group of series.
- Inner join: To combine multiple queries.
- Add field from calculation: To create a new field based on a calculation between
others.
- For a simple exercise, you can use the "Add field from calculation" to calculate
the percentage of used memory based on total and available memory.
c. Apply Overrides
- Move to the "Overrides" tab in the panel edit mode.
- Click "Add Override".
- You can then specify conditions, like "For field with name" or "For series with
name" and adjust properties like color, display name, etc.
As an example, you can set a different color for CPU and Memory in the same graph.
Task 8: Advanced Alerting and Recording Rules
By completing these advanced tasks, you will not only deepen your understanding of
Prometheus and Grafana's features and capabilities but also be better prepared to tackle real-
world monitoring and alerting challenges. Remember to always refer to official documentation
for detailed configurations and best practices.