0% found this document useful (0 votes)
4 views

3. SRE-Practical work 3 Monitoring and Alerting Setup

This document outlines a practical work assignment focused on setting up monitoring and alerting using Prometheus and Grafana. It includes detailed tasks for installing and configuring both tools, creating dashboards, setting up alerts, and expanding monitoring capabilities with Node Exporter. Additionally, it covers advanced topics such as querying in Prometheus, dynamic dashboarding in Grafana, and backup and restore procedures.

Uploaded by

endofetta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

3. SRE-Practical work 3 Monitoring and Alerting Setup

This document outlines a practical work assignment focused on setting up monitoring and alerting using Prometheus and Grafana. It includes detailed tasks for installing and configuring both tools, creating dashboards, setting up alerts, and expanding monitoring capabilities with Node Exporter. Additionally, it covers advanced topics such as querying in Prometheus, dynamic dashboarding in Grafana, and backup and restore procedures.

Uploaded by

endofetta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Practical work 3 - Monitoring and Alerting Setup

This practical work focuses on setting up monitoring and alerting using tools like
Prometheus and Grafana. Students will implement monitoring for a sample application, create
dashboards, and set up alerts to understand the application's performance and potential issues.

Practice Instruction:
• For those who have not completed tasks from practice 1-2, the first four (4)
tasks, please start with them.
• For those who have already successfully completed the initial four tasks, please
begin with Task 5.

Software requirements:
• Installation rights on a computer to configure tools such as Prometheus,
Grafana, Node Exporter, etc.
• A code editor (for example, Visual Studio Code, Atom) for editing
configuration files and scripts.

Documentation:
• Make sure that all settings are documented, including configuration changes, so
that they can be reproduced or fixed in the future.
• It is useful to work in teams or study groups(2 students) to share
ideasandsolutions.
• Always monitor system resources when installing and launching new software
to ensure optimal performance.Prepare a task completion report with
screenshots and description
• Write conclusions about what you have learned new for yourself in this practical
work

Task 1: Install and Configure Prometheus


1. Objective: Install Prometheus and verify its web UI.
• Download and extract the latest version of Prometheus.
• Run Prometheus with the default configuration.
• Visit http://localhost:9090 to verify Prometheus is running.
2. Objective: Configure Prometheus to scrape a target.
• Edit prometheus.yml to add a new scrape target.
• Reload Prometheus configuration.
• Visit http://localhost:9090/targets to verify the target is being scraped.

Task 2: Install and Configure Grafana


1. Objective: Install Grafana and verify its web UI.
• Install Grafana using the appropriate method for your OS.
• Start Grafana and verify it's running at http://localhost:3000.
2. Objective: Add Prometheus as a data source in Grafana.
• Log into Grafana using the default credentials (admin/admin).
• Add a new data source, selecting Prometheus.
• Configure the data source to point to your Prometheus instance at
http://localhost:9090.
Task 3: Create a Grafana Dashboard
1. Objective: Create a new dashboard and add a panel.
• Create a new dashboard in Grafana.
• Add a panel that displays the up-time of your Prometheus instance.
• Adjust the time range and refresh interval for your dashboard.
2. Objective: Import a pre-existing dashboard.
• Visit the Grafana dashboards page at https://grafana.com/grafana/dashboards.
• Search for a popular Prometheus dashboard, e.g., "Node Exporter Full".
• Import this dashboard into your Grafana instance and select the Prometheus data
source.

Task 4: Setup Alerts in Grafana


1. Create an alert rule for a panel.
Objective: Understand how to use Grafana's built-in alerting system to monitor specific
conditions on a dashboard panel.
Steps:
a. Edit a panel in your dashboard.
Navigate to your desired dashboard.
Choose the panel you want to attach an alert to.
Click on the panel title and select "Edit" from the dropdown menu.
b. Navigate to the "Alert" section.
Inside the panel editor, find the "Alert" tab.
Click on "Create Alert".
c. Set up a rule.
Name your alert rule for easy identification.
Set evaluation intervals.
Define conditions: For example, if you are monitoring Prometheus uptime (using the
metric up), you can set a condition like avg() of query(A, 5m, now) is below 0.9. This checks
if the average uptime over the last 5 minutes is below 90%, which would indicate Prometheus
was down for part of that time.
d. Test the alert rule.
Once you have defined the condition, click on the "Test Rule" button. Grafana will
evaluate the rule against the recent data and show if the alert would be firing or not.
2. Configure a notification channel.
Objective: Ensure Grafana can notify you or your team when specific conditions are
met.
Steps:
a. Set up a notification channel in Grafana.
From the Grafana side menu, click on the bell icon (🔔) which represents the alerting
menu.
Select "Notification channels" and click on "Add channel".
Choose your preferred notification method. Grafana supports various channels such as
Email, Slack, Webhook, and others.
For example, for Slack:
Name the channel (e.g., "Team Slack Alerts").
Type should be Slack.
In the URL, provide the incoming webhook URL from your Slack workspace.
Optionally, mention channel, user, or add an icon.
b. Test the notification channel.
Once all details are filled out, click on the "Test" button. This will send a test
notification to the channel you have configured.
Check the target channel (e.g., Slack or Email) to verify you've received the test alert.
Note: To effectively use the alerting system, ensure that the alert is linked to the
notification channel. In the panel's alert settings, under the Notifications section, select the
notification channel you created. This way, when the alert is triggered, it will send a
notification to the configured channel.

Task 5: Expand Monitoring Capabilities


Task 5: Expand Monitoring Capabilities
1. Install and configure Node Exporter.
Objective: Broaden your monitoring scope by capturing OS and hardware metrics using
Node Exporter.
Steps:
a. Download and run Node Exporter on your machine.
- Navigate to the official Prometheus download page and get the appropriate
Node Exporter binary for your OS.
- Extract the downloaded tarball.
- Navigate to the extracted directory and run Node Exporter:
bash
./node_exporter
b. Configure Prometheus to scrape metrics from Node Exporter.
- Edit your prometheus.yml configuration file.
- Add a new job under the scrape_configs section:
yaml
Copy code
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']

Save the file and reload/restart Prometheus for changes to take effect.

c. Create or import a Grafana dashboard for Node Exporter metrics.


- Navigate to your Grafana instance.
- You can manually create a new dashboard and add panels for Node Exporter
metrics (e.g., node_cpu_seconds_total, node_memory_MemAvailable_bytes,
etc.)
- Alternatively, use the Grafana dashboard repository to find a pre-made Node
Exporter dashboard. Once found, import it into your Grafana instance.

2. Set up an alert based on Node Exporter metrics.


Objective: Ensure proactive monitoring by setting alerts on critical OS or hardware
metrics.
Steps:
a. Choose a critical metric from Node Exporter.
- Metrics like node_cpu_seconds_total, node_memory_MemFree_bytes, or
node_filesystem_free_bytes are common candidates.
- For this example, let's consider high CPU usage, represented by the metric: 100
- (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

b. Create an alert in Grafana based on the chosen metric.


- Navigate to (or create) the panel that visualizes the chosen metric.
- Click on the panel title and select "Edit".
- Go to the "Alert" tab and create a new alert.
- Define a condition, e.g., alert when the average CPU usage over the past 5
minutes is greater than 90%.
- Under the "Notifications" section, select your notification channel (created in
the previous task).
- Save the dashboard.
By following these steps, you'll be well-prepared to monitor OS and hardware metrics
using Node Exporter and Grafana. This is a critical aspect of observability, allowing for timely
response to system issues and proactive infrastructure management.

Task 6: Querying and Exploration in Prometheus


Objective 1: Familiarize yourself with Prometheus' query language (PromQL).

Instructions:
a. Access Prometheus Web UI
- Open your web browser.
- Navigate to http://localhost:9090/graph. This is the default Prometheus UI
where you can execute PromQL queries.
b. Execute Basic Queries
- In the "Expression" input box, type a metric name, for example:
- up: This will show you the uptime of all instances that Prometheus is scraping.
- node_cpu_seconds_total: This provides the total CPU time in seconds.
- Click "Execute" and view the raw metric values below.
c. Use PromQL for Complex Queries
- Explore the usage of functions and operators in PromQL. For instance:
- rate(node_cpu_seconds_total[5m]): This computes the per-second average rate
of time series in the last five minutes.
- Note: rate is particularly useful for metrics like counters which only go up over
time.
Objective 2: Use multi-dimensional data to filter and aggregate results.
Instructions:
a. Use Label Selectors
- You can refine your queries using label selectors. These allow you to filter
metrics based on their associated labels.
- Example: node_cpu_seconds_total{job="node-exporter",
instance="localhost:9100"}. This narrows down the metric to the node-exporter
job from the localhost:9100 instance.
b. Aggregation with PromQL
- PromQL supports various aggregation operators to provide summary
information.
- sum: This sums up data across all provided labels.
- avg: Computes the average across data points.
- Combine aggregation operators with by or without to specify which label
dimensions to consider.
Example: sum(rate(node_cpu_seconds_total[5m])) by (job): This aggregates the CPU
rate for each job separately.

Task 7: Advanced Grafana Dashboarding


Objective 1: Use variables in Grafana for dynamic dashboards

Instructions:
a. Create/Edit a Dashboard
- From the Grafana main menu, click on the "+" icon and select "Dashboard".
- Alternatively, navigate to an existing dashboard that you'd like to edit.
b. Add a Variable
- On the dashboard screen, click on the cogwheel/settings icon on the top, then
select "Variables".
- Click on "New Variable".
- Choose a name for your variable, and for the Type, select "Query".
- Under "Data source", choose "Prometheus".
- In the "Query" box, you can type a request such as {job=~".+"} to fetch all jobs.
- Save your changes.
c. Update a Panel with the Variable
- Return to your dashboard and edit a panel.
- In your metric query, you can now reference the variable by using
$VariableName (replace "VariableName" with the name you gave to your
variable). For instance, if you're looking to filter metrics by job, it could look
like node_cpu_seconds_total{job="$JobName"}.
- Save the panel.
Objective 2: Explore Grafana's transformations and overrides
Instructions:
a. Add a Panel with Multiple Metrics
- Click on "Add Panel" in your dashboard.
- In the query section, add multiple metrics, such as node_cpu_seconds_total and
node_memory_MemAvailable_bytes.
b. Use Transformations
- With the panel still in edit mode, click on the "Transform" tab.
- Explore different transformations such as:
- Reduce: To consolidate a group of series.
- Inner join: To combine multiple queries.
- Add field from calculation: To create a new field based on a calculation between
others.
- For a simple exercise, you can use the "Add field from calculation" to calculate
the percentage of used memory based on total and available memory.
c. Apply Overrides
- Move to the "Overrides" tab in the panel edit mode.
- Click "Add Override".
- You can then specify conditions, like "For field with name" or "For series with
name" and adjust properties like color, display name, etc.
As an example, you can set a different color for CPU and Memory in the same graph.
Task 8: Advanced Alerting and Recording Rules

Objective 1: Set up recording rules in Prometheus.


- Identify a frequently used or complex query in Grafana.
- Configure a recording rule in prometheus.yml to pre-compute and store the
result of this query.
- Verify the new recorded metric in the Prometheus web UI.
Objective 2: Configure multi-stage alerting.
- Create a multi-condition alert in Grafana (e.g., alert on high CPU usage only if
high memory usage is also detected).
- Configure the alert to have multiple levels (e.g., warning and critical) with
different thresholds.

Task 9: Blackbox Monitoring with Prometheus

Objective 1: Install and configure the Blackbox Exporter.


- Download and run the Blackbox Exporter.
- Configure Prometheus to use the Blackbox Exporter to check the availability of
a set of web pages or services.
Objective 2: Create a Grafana panel visualizing the availability and response
times.
- Add panels that display the up status and response times for the targets being
monitored with the Blackbox Exporter.

Task 10: Backup and Restore

Objective 1: Backup your Prometheus data.


- Stop the Prometheus service.
- Create a backup of the Prometheus data directory.
- Start the Prometheus service.
Objective 2: Restore Prometheus from a backup.
- Stop the Prometheus service.
- Restore the data directory from the backup.
- Verify that the data is intact after starting Prometheus.

By completing these advanced tasks, you will not only deepen your understanding of
Prometheus and Grafana's features and capabilities but also be better prepared to tackle real-
world monitoring and alerting challenges. Remember to always refer to official documentation
for detailed configurations and best practices.

You might also like