Kathy Mitton, Tivoli Storage Manager Server Manager TSM Symposium, September 2011
TSM Administration Revisited
2011 IBM Corporation
Abstract
This session will explore TSM server operations, daily maintenance, and best practices to optimize your TSM server. The speaker will also discuss the administrative and reporting capabilities for the server along with examples and rationale for managing and scheduling server maintenance tasks.
2011 IBM Corporation
Agenda
Revisiting TSM practices Lifecycle Best Practices Workflow Scripts and Sequencing Schedules Operational Limits Monitoring References
2011 IBM Corporation
Revisit TSM Practices Periodically To Keep Pace With Change
TSM environments are impacted by data growth changing business needs people/personnel changes Many TSM installations propagate new TSM instances based on when the administrator first learned the product when administrator last had to develop a solution for the challenges faced by the organization at that time Frequently newly deployed servers will duplicate existing servers in order to simplify administration, the more uniform things are the easier it is to administer leverage the time/effort spent in figuring out an earlier server deployment and configuration Revisiting your TSM administrative setup may allow you to further streamline your operations identify and exploit newer TSM functionality Goals for TSM practices simplify administration reduce potential for errors
4 2011 IBM Corporation
TSM And Your Business Are Dynamic
Database
DB2
TSM Server
TSM Server and Database Change based on: Growth Changes to H/W and vendors
Storage Hierarchy
Storage hierarchy changes based on: Capacity requirements Changes to H/W and vendors Performance and cost needs of an organization
Client Workloads change over time: More clients More data Different types of clients Network/infrastructure changes
Take a Look at Your TSM Environment: Are you using and exploiting the best functions
and features TSM offers? Has your environment changed such that TSM is not being used optimally?
2011 IBM Corporation
Have you looked at reporting lately?
Reporting and monitoring introduced in V6.1. Release to release improvements in: Install Configuration Deployment More Reports
An aggregated view of reporting and monitoring for the entire TSM environment
6
6
2011 IBM Corporation
Have you looked at Administration Center lately?
With TSM 6.2, the administration center can be used to orchestrate the push of updates to windows clients
7 7
2011 IBM Corporation
Utilizing Best Practice Work Flow Will Minimize Problems
Best practice concepts for Tivoli Storage Manager are based on: Field Experiences Observations from customer implementations Feedback through problem reports, market requirements, business partner discussions, and other communication channels Development insight based on the design and implementation of the algorithms, processes, and code Generally, the topics discussed are applicable to both V5.x and V6.x TSM servers There may be command syntax shown that is specific to a newer release Some topics or discussion points may be applicable only to newer release For example, the importance of BACKUP VOLHIST is much more significant for V6.x.
2011 IBM Corporation
Overview of Server Workflow Daily Cycle
Client Workload Time
Server Workload
Data Ingest (Backup, Archive, HSM)
Server Maintenance Activities
2011 IBM Corporation
TSM Wheel of Life Overview
SAN
Whether viewed as a sine wave cycle or the Wheel of Life, some view of the cyclic nature of TSM operations and the daily support for these operations is helpful
10 2011 IBM Corporation
Observing System Resource Relative to Workflow Cycle Will Help Provide Guidance For Changes
The peaks during workload are limited by total available resource on the machine (CPU, Memory, I/O throughput, etc) The client workloads are usually done using schedules Most often, the main data ingest is through a nightly backup window which may be one or more schedules initiating the backup of various groups of clients The server actions are the back-end maintenance actions necessary to protect the client data by performing backup storage pools position the data appropriately in the hierarchy based on policies, storage management, and the data flow through the server perform the other server operations to keep the database, storage hierarchy, and system healthy and ready for the next set of actions Client operations may happen (and often do) throughout the day For example, archive operation for DB logs can occur as needed as opposed to limited only to the nightly ingest window Resources such as mount points need to be considered for these always possible operations
11
2011 IBM Corporation
Identify Peak Workload and Task Overlap For Workflow Improvement
During the client workload phase, the server resources (storage, CPU, memory, and I/O bandwidth) should be devoted to supporting the workload At the peak of the client workload, the majority of the server resources should be in support of the client workload At 6.x, weve tested over 500 concurrent client sessions and have seen that server performance can degrade between 500 and 700 concurrent sessions The number of concurrent sessions any single server might achieve will be highly dependent on server resource During the server workload phase, the server resources are being dedicated to managing the recently received data from the client workload These resources are necessary for the storage, policy management, and maintenance of the server Optimal server size is based on whether all operations can complete in 24hour period When workloads overlap or are not given sufficient resource impacts may occur: Less CPU and memory available to support a given operation Performance degradation Insufficient space Data placement may be sub-optimal Operations may fail
12
2011 IBM Corporation
Server Workflow Goals and Priorities
Database
DB2
TSM Server
Storage Hierarchy
Disaster Recovery and Availability: Onsite recovery through DB restore or clustering (where available). Offsite recovery through DB restore + copy storage pools. Other offsite recovery techniques
Protect the server: Data movement activities (reclamation, migration) Expiration Identify processing for deduplication enabled environments. Protect the client data: Storage pool backup Copy active Database backup
13
2011 IBM Corporation
Illustration and Best Practice Sequencing of Server Workflows
Time
Protecting the Client Data: STORAGE POOL BACKUP COPY ACTIVEDATA DATABASE BACKUP BACKUP VOLHIST/DEVCLASS
Protecting the Server: EXPIRE INVENTORY RECLAMATION MIGRATION
Prepare and Execute for Disaster Recovery: DELETE VOLHIST MOVE DRMEDIA PREPARE
Identify
Table Reorganization
14
2011 IBM Corporation
Other Workflow Observations
Flows and sequencing are not absolute There may be reasons to sequence things differently Proof is in providing capabilities specific to your environment Database backup for V6 changes the paradigms compared to V5 TYPE=FULL being used predominantly This causes the pruning of the ARCHIVE log space Necessary for proper care and feeding of server health Helps manage or mitigate amount of storage needed for archive logs In practice, TYPE=INCR may backup close to the same amount and take almost the same amount of time as TYPE=FULL Take extra precautions to protect the volume history file for restore purposes. Make multiple copies of the file Store to many different locations Critical for RESTORE processing for the server Unlike V5, V6 volume history cant be built by hand and without volume history TSM server is NOT restorable Deduplication exhibits different characteristics than traditional server processes: IDENTIFY can be configured as ALWAYS on We do see environments that use it as an always on process Reclamation workloads increase based on having to reclaim based on policy deletion (expiration) and also deduplicate chunk elimination V6 Reorganization will require some isolated cycles
15 2011 IBM Corporation
Employing Disruptive Technologies May Change Workflows
Disruptive technologies are being used to bring new capabilities to TSM environments TSM processes and activities may be eliminated or require re-sequencing as result of disruptive technology External events or capabilities may affect what needs to be done within TSM One example of disruptive technology is offsite disaster recovery/availability: Disk subsystem Replication of TSM db/log/homeDir to target site Replication of FILE or other DISK storage pools to target site Critical consideration: consistency groups VTL Replication of storage pool to target site Critical consideration: how to reconcile against available server database at target site TSM V6 HADR
16
2011 IBM Corporation
Disruptive Technologies Example
Server A
(Primary)
Server A
(Disaster Recovery)
Replication of server database (DB, log) using either: Device level replication with consistency groups. V6.2 server using database HADR.
DB DB
Replication of storage pool(s) using: Disk device level replication with consistency groups. VTL to VTL system replication.
17
2011 IBM Corporation
Validate Your Scripts Implement Best-Practice Workflow
We still see scripts not exploiting PARALLEL and SERIAL sequencing enhancements even though these script enhancements have been around for many years Many scripts may not be running optimally because they arent exploiting the hardware Or they may not be sequenced optimally based on the total workload that needs to be accomplished This workflow can be implemented via: Scheduled administrative actions Schedule of TYPE=ADMIN Scripts Important structure/sequencing commands: PARALLEL Starts individual threads for each action within a parallel execution block Synchronizes progress by awaiting results from each parallel execution task SERIAL Indicates commands should be run serially, one after the other on the server thread Used to re-converge or synchronize from a PARALLEL set of operations to a SERIAL (single threaded) set of operations To Exploit PARALLEL and SERIAL script constructs commands should: Use WAIT=YES where available Also consider using DURATION=NN where necessary to better manage time
18
2011 IBM Corporation
Illustration of PARALLEL and SERIAL
Single Command (thread) Until PARALLEL encountered.
PARALLEL
5 Commands run in parallel.
SERIAL
Re-converge to single when SERIAL keyword encountered.
19
2011 IBM Corporation
Example Script
PARALLEL BACKUP STGPOOL X WAIT=YES BACKUP STGPOOL Y WAIT=YES BACKUP STGPOOL Z WAIT=YES SERIAL PARALLEL MIGRATE STGPOOL X HIGHMIG=nn LOWMIG=mm RECLAIM=NO WAIT=YES MIGRATE STGPOOL Y HIGHMIG=nn LOWMIG=mm RECLAIM=NO WAIT=YES MIGRATE STGPOOL Z HIGHMIG=nn LOWMIG=mm RECLAIM=NO WAIT=YES EXPIRE INVENTORY DURATION=qq RESOURCE=nn WAIT=YES SERIAL PARALLEL RECLAIM STGPOOL X THRESHOLD=nn DURATION=qq WAIT=YES RECLAIM STGPOOL Y THRESHOLD=nn DURATION=qq WAIT=YES RECLAIM STGPOOL Z THRESHOLD=nn DURATION=qq WAIT=YES SERIAL BACKUP DB TYPE=FULL WAIT=YES BACKUP VOLHIST FILENAMES=/path1/volhist,/path2/volhist,/path3/volhist BACKUP DEVCONFIG FILENAMES=/path1/dc,/path2/dc,/path3/dc
20
2011 IBM Corporation
Script Illustrated
Parallel
Storage Pool Backups (x3)
Serial Parallel
Migration (x3) and Expiration
Serial Parallel
Reclamation (x3)
Serial
BACKUP DB, BACKUP VOLHIST, BACKUP DEVCONFIG
21
2011 IBM Corporation
Use Simple Visualization to Validate Your Schedule Optimization
Using visualization techniques to depict schedules and their relationships one to the other can be useful Helpful to: Determine what is running when What are the overlaps between schedules? When schedules are running, what resources are needed and what load or constraints does this put on the server?
22
2011 IBM Corporation
Administrative Schedules with Overlaps
Administrative Schedule Overlap and Sequencing (Current)
22:00 20:00 18:00 16:00 14:00 12:00 10:00 8:00 6:00 4:00 2:00 0:00 0 0.5 1 1.5 2 2.5 3 3.5 DB Backup Backup VolHist Expiration Reclaim_Copy Reclaim_Tape StgPoolBk_Start Migr_Start
Overlap 1
Overlap 2
An example of using a spreadsheet and schedule window information to visualize schedule sequencing and overlaps.
23 2011 IBM Corporation
Administrative Schedules Reconfigured to Reduce Overlap
Administrative Scheduling Overlap and Sequencing (Proposed)
22:00 20:00 18:00
Time of Day
16:00 14:00 12:00 10:00 8:00 6:00 4:00 2:00 0:00 0 0.5 1 1.5 2 2.5
Backup Stgpool Migration Expiration Reclamation DB Backup Prepare
Proposed adjustments: Eliminate most overlap Only remaining overlap is expiration and migration which generally contend for different resources
24 2011 IBM Corporation
Improve Reliability By Staying Within TSM Operational Limits
The server operational limit is the point at which One or more critical server health and maintenance operations can not be contained within the time available to do it Available system resources (CPU, RAM, etc) are exhausted at or prior to peak load being satisfied and supported Many different indicators exist that may signal a server is at its operational limits. Depending upon how/why the limit is reached, symptoms may be: Degraded performance Failed operations Or no overt sign and not known until disaster recovery of server or clients is needed
25
2011 IBM Corporation
TSM Operational Limits: Database Backup
Backing up the server database is critical to the health and maintenance of the server Required in order to prune space from ARCHIVE log directories Needed in order to restore the database and recover from local device failure for database or active log TSM 6.1 tests database size to 1 TB, for V6.2 it is tested to 2 TB Operational limits due to DB size may be encountered PRIOR to these upper end tested value Will vary by customer based on workload, infrastructure, and organizational requirements (RTO) DB backup duration should periodically be evaluated to determine if limit is reached and another server is needed Time needed to backup the database exceeds time available Time to restore database is longer than recovery time objectives (RTO) for a disaster recovery (real or simulated)
26
2011 IBM Corporation
TSM Operational Limits: Saturation
TSM is limited to the hardware and resources available to it
Operational limit may be reached when: Server overruns/saturates available CPU on system at peak workload or less then peak workload Server overruns available RAM on system and drives high pagefile use I/O bandwidth is saturated: DB or active log performance degraded because I/O cant keep up Storage pool actions performance degraded because I/O cant keep up Saturation or overrun of CPU, RAM, I/O bandwidth achieved at or prior to achieving peak workload For example, in the lab weve demonstrated that more then 1500 concurrent client sessions to the SERVER pushed it to saturation with available memory and CPU such that performance degraded significantly
27
2011 IBM Corporation
Process Rule of Thumb
Many server processes can be run explicitly or implicitly in parallel (concurrently). Example of explicit: Scheduling migration and expiration to run at the same time Example of implicit: Expiration (V6) provides the RESOURCE parameter which indicates the number of threads that will be used to perform the process As a general rule of thumb, do not specify more concurrent processes than there are CPU cores on the machines For example, a 16-way box running expiration and deduplication identify could be configured to run with: Expiration RESOURCE= set to a value in the range of 4 to 8 4-6 Identify processes running In this case, if 8 and 6 were used, that would provide 8 expiration threads and 6 identify processes These 14 tasks would roughly align with 14 of the available 16 CPU cores on the box This would leave 2 CPU cores for support of other operations and tasks This is not an absolute and will vary by processor architecture and system workload. This is a starting point for server configuration and how to avoid overrunning the available resources Tied to a monitoring strategy where I/O, CPU, and memory are being watched, this will assist with managing an appropriate workload and sequencing on a server
28
2011 IBM Corporation
TSM Operational Limits: Possibilities
Many operational limits are resource or time based
Taking steps to improve infrastructure may result in faster operations and may mitigate or remedy the operational limit
For example, if the operational limit is database backup: Using a faster device for the db backup may eliminate the limit Improving I/O subsystem and bandwidth for DB and logs may address the issue
In cases where it is not possible or practical to resolve via improved or changed infrastructure, this may represent a cap to the existing server and need to implement and balance workload to another TSM server
29
2011 IBM Corporation
Monitoring: Whats Available
TSM Provides Messages Logged to local server activity log Event routing to many different targets Summary Information Accounting Records V6 Reporting and Monitoring Administration Center Third party vendor tools are also available: Reporting Tools Monitoring Capabilities and Tools
30
2011 IBM Corporation
TSM Best Practice Monitoring May Involve More than Simply TSM
TSM is a large, multi-threaded software application. It exploits or has dependencies on: CPU the application and database perform many calculation/instruction intensive operations I/O to Disk: This relates to the database, active log, and archive log Bottlenecks such as not enough parallel I/O capability or insufficient bandwidth (small channels) can affect server performance, scalability, and throughput I/O to Storage Hierarchy: This can be disk (TSM device classes of type DISK or FILE) and sequential media (Real tape and VTLs) Often controllers or other virtualized appliances used for storage devices. (SVC, VTL, etc) Devices may be locally attached (SCSI) or fiber attached (SAN) Network: TSM is a client/server application with its client operations almost entirely network driven
31
2011 IBM Corporation
Consider Monitoring Items Internal and External to TSM
External to TSM TSM is application software. Relies upon: Operating system Drivers Devices Firmware Internal to TSM Client Operations Server Processes Other Server tasks such as memory management, scheduling, etc.
32
2011 IBM Corporation
TSM Topology: External to TSM
Direct Attach Devices (Disk, Tape Server/Host
NIC
HBA AIX (TSM Server) SCSI
LAN/WAN
SAN
33
2011 IBM Corporation
Monitoring the Server/Host and Local Devices
Example: System is AIX Most host hardware (NIC, HBA, SCSI, Planar, etc) logs to the system errpt a to see device reported errors Monitor system resources: Tools like topas or others can be used to periodically look at the system and assess health Topas can be configured to run periodically (AIX 6.1) Inittab entry: /usr/bin/topasrec -L -s 300 -R 1 -r 6 -o /etc/perf/daily/ -ypersistent=1 -r 6 indicates how many to keep around Raw data viewed/formatted using topasout command Stats can be collected over time to see historical trends Check for filesystems running low on space In particular /, /opt, Monitor paging space Growth of page file use without corresponding increases in load or activity may indicate a resource leak and can affect performance or ultimately lead to a server termination
34
2011 IBM Corporation
TSM Topology: External to TSM
Direct Attach Devices (Disk, Tape Server/Host
NIC
HBA AIX (TSM Server) SCSI
LAN/WAN
SAN
35
2011 IBM Corporation
Monitoring the LAN/WAN
Usually monitored during exception or periods of degradation
Network teams/owner typically have monitoring tools in place to: Identify and alert to outages Identify and alert to degradation
From TSM perspective: Symptoms would be failed client operations due to communication issues. (socket error, send error, receive error) Not usually evaluated or investigated unless issues are occurring
36
2011 IBM Corporation
TSM Topology: External to TSM
Direct Attach Devices (Disk, Tape Server/Host
NIC
HBA AIX (TSM Server) SCSI
LAN/WAN
SAN
37
2011 IBM Corporation
Monitoring for a SAN and SAN attached devices
Errors not centrally logged. Usually logged to specific device that is generating the event Often times individual error logs need to be accessed and evaluated Errors can be anywhere in the chain from the HBA to the device Fiber Channel Switch, router gateway Library Drive Disk Array SAN controller (SVC or equivalent)
Virtualization can hide/mask errors VTL, SAN Controller, etc are systems unto themselves running: Embedded host, OS, drivers, devices, etc. Evaluation of health may require vendor involvement as the relationship between logs, devices, and errors or symptoms may not be surfaced to end-user
38
2011 IBM Corporation
TSM Monitoring: Client Operations
Client Operations: TSM clients (backup/archive, TDP, API, etc) are the core end-user representing the data being protected by TSM Operations may be: Scheduled Manually initiated Automatically initiated (such as archiving log files from DB2/UDB client) Monitor for: Failed sessions or schedules Schedule issues such as missed or failed schedule events Unusual or unexpected session termination Clients log ANExxxx messages to the server If the client encounters a local error (error on the client system) while an operation is in-flight, many of these will be reported and logged to the server Client operations record summary information (SELECT * FROM SUMMARY) as well as logging messages to the server activity log
39
2011 IBM Corporation
TSM Monitoring: Server Processes
Server Processes: Server maintenance tasks such as Expiration, Storage Pool Backup, Migration, Reclamation and such are done as server processes Many processes can be WAIT=YES (synchronous) or WAIT=NO (asynchronous) Server processes ALL issue process started and ended messages End messages report statistics as applicable End messages report SUCCESS or FAILURE of an operation Monitor for: Failed processes Cancelled processes will report as failed, there should be other messages logged to activity log indicating the cancellation If processes are not succeeding, evaluate: Was this an appropriate time/reason for process to be run? If an insufficient resource issue, can additional resources be made available? Or can the process be initiated at a different time when resources are available? Server processes record summary information (SELECT * FROM SUMMARY) as well as logging messages to the server activity log
40
2011 IBM Corporation
TSM Monitoring: Analysis Using Message Tokens
Activity log messages are tagged with session and process tokens. Session token example: ANR1234E xxxxx (SESSION: 985) Process token example: ANR2345E xxxx (PROCESS: 38291) Session and process token example: ANR3456E xxxx (SESSION: 765 PROCESS: 998512) Use: If a process or session encounters an error, query the activity log using the session or process number to see all the messages for that action Other messages before and after the failure may be found easily in this fashion. For example: QUERY ACTLOG SEARCH=(SESSION: 985) to search for all the messages for session 985 More details about the events leading up to or the actual error itself may be seen by doing this
41
2011 IBM Corporation
Conclusion
Server Workflow Priorities: Protect Client Data Maintain the Server Protect the Server Priorities then provide sequencing of actions which can Be orchestrated via scheduled (type=admin) scripts Scripts structured using PARALLEL and SERIAL semantics to sequence actions and manage resources while satisfying the workflow priority actions Operational limits Have been defined Steps to identify and possible actions have been discussed Monitoring considerations have been discussed for: Server topology And the server itself
42
2011 IBM Corporation
A few useful links
V6 Deployment best practices: http://www-01.ibm.com/support/docview.wss?uid=swg21421060
Database Reorganization: http://www-01.ibm.com/support/docview.wss?uid=swg21452146
Memory requirements for V6: http://www-01.ibm.com/support/docview.wss?uid=swg21450229
TSM configured for HADR: http://www.ibm.com/developerworks/wikis/display/tivolistoragemanager/Electronic+v aulting+using+deduplicated+remote+copy+storage+pools
43
2011 IBM Corporation