0% found this document useful (0 votes)
16 views149 pages

HCIE-Storage V2.5 Learning Guide

The Huawei HCIE-Storage Learning Guide provides comprehensive information on Huawei's storage solutions, including hybrid flash, all-flash, and distributed storage systems. It covers technical deep dives, storage design, performance tuning, solutions for backup and disaster recovery, and maintenance and troubleshooting practices. The document emphasizes the advanced architectures and technologies that enhance performance, reliability, and scalability for enterprise storage needs.

Uploaded by

Dmitry Borisov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views149 pages

HCIE-Storage V2.5 Learning Guide

The Huawei HCIE-Storage Learning Guide provides comprehensive information on Huawei's storage solutions, including hybrid flash, all-flash, and distributed storage systems. It covers technical deep dives, storage design, performance tuning, solutions for backup and disaster recovery, and maintenance and troubleshooting practices. The document emphasizes the advanced architectures and technologies that enhance performance, reliability, and scalability for enterprise storage needs.

Uploaded by

Dmitry Borisov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 149

Huawei Storage Certification Training

HCIE-Storage
Learning Guide

V2.5

HUAWEI TECHNOLOGIES CO., LTD.


Copyright © Huawei Technologies Co., Ltd. 2020. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any
means without prior written consent of Huawei Technologies Co., Ltd.

Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of
their respective holders.

Notice
The purchased products, services and features are stipulated by the contract made
between Huawei and the customer. All or part of the products, services and features
described in this document may not be within the purchase scope or the usage scope.
Unless otherwise specified in the contract, all statements, information, and
recommendations in this document are provided "AS IS" without warranties,
guarantees or representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has
been made in the preparation of this document to ensure accuracy of the contents, but
all statements, information, and recommendations in this document do not constitute
a warranty of any kind, express or implied.

Huawei Technologies Co., Ltd.


Address: Huawei Industrial Base Bantian, Longgang Shenzhen 518129
People's Republic of China

Website: http://e.huawei.com

Huawei Proprietary and Confidential


Copyright © Huawei Technologies Co., Ltd
HCIE-Storage Learning Guide Page 2

Contents

1 Storage Technical Deep Dive ............................................................................................. 1


1.1 Hybrid Flash Storage Technical Deep Dive................................................................................................................1
1.1.1 Product Positioning .......................................................................................................................................................1
1.1.2 Software and Hardware Architectures ....................................................................................................................2
1.1.3 Key Technologies ...........................................................................................................................................................5
1.1.4 Application Scenarios ...................................................................................................................................................8
1.2 All-Flash Storage Technical Deep Dive .......................................................................................................................9
1.2.1 Product Positioning .......................................................................................................................................................9
1.2.2 Software and Hardware Architectures ................................................................................................................. 10
1.2.3 Key Technologies ........................................................................................................................................................ 13
1.2.4 Application Scenarios ................................................................................................................................................ 21
1.3 Distributed Storage Technical Deep Dive ............................................................................................................... 22
1.3.1 Product Positioning .................................................................................................................................................... 22
1.3.2 Software and Hardware Architectures ................................................................................................................. 23
1.3.3 Application Scenarios ................................................................................................................................................ 32
2 Storage Design and Performance Tuning .................................................................... 33
2.1 Storage Resource Planning and Design................................................................................................................... 33
2.1.1 Overview ....................................................................................................................................................................... 33
2.1.2 Methods and Processes ............................................................................................................................................ 35
2.1.3 Application Cases........................................................................................................................................................ 40
2.2 Storage System Performance Tuning....................................................................................................................... 57
2.2.1 Overview ....................................................................................................................................................................... 57
2.2.2 Application-Layer Performance Tuning................................................................................................................ 63
2.2.3 Host-Layer Performance Tuning ............................................................................................................................ 67
2.2.4 Network-Layer Performance Tuning..................................................................................................................... 81
2.2.5 Storage-Layer Performance Tuning ...................................................................................................................... 84
3 Storage Solution ............................................................................................................... 97
3.1 Application Practices of the Storage Backup Solution ........................................................................................ 97
3.1.1 Overview ....................................................................................................................................................................... 97
3.1.2 Backup System ............................................................................................................................................................ 97
3.1.3 Solution Planning ....................................................................................................................................................... 98
3.2 Application Practices of the Active-Standby Storage DR Solution ................................................................... 97
3.2.1 Overview ..................................................................................................................................................................... 101
3.2.2 Planning ...................................................................................................................................................................... 102
HCIE-Storage Learning Guide Page 3

3.3 Application Practices of the HyperMetro Storage DR Solution ...................................................................... 109


3.3.1 Overview ..................................................................................................................................................................... 109
3.3.2 Technologies .............................................................................................................................................................. 110
3.3.3 Planning ...................................................................................................................................................................... 115
3.4 Application Practices of the Storage Data Migration Solution ......................................................................... 97
3.4.1 Overview ..................................................................................................................................................................... 119
3.4.2 Technologies .............................................................................................................................................................. 120
3.4.3 Planning ...................................................................................................................................................................... 127
4 Storage Maintenance and Troubleshooting .............................................................. 130
4.1 Storage System O&M Management ...................................................................................................................... 130
4.1.1 Overview ..................................................................................................................................................................... 130
4.1.2 Tools............................................................................................................................................................................. 131
4.1.3 Management ............................................................................................................................................................. 136
4.2 Storage System Troubleshooting ............................................................................................................................ 141
4.2.1 Overview ..................................................................................................................................................................... 141
4.2.2 Preparations ............................................................................................................................................................... 142
4.2.3 Case Analysis ............................................................................................................................................................. 143
Page 1

1 Storage Technical Deep Dive

1.1 Hybrid Flash Storage Technical Deep Dive


1.1.1 Product Positioning
Huawei storage can be classified into all-flash storage, hybrid flash storage, and
distributed storage.
Huawei all-flash storage is built on the next-generation Kunpeng hardware platform and
uses SmartMatrix to establish a full-mesh, end-to-end NVMe architecture. It supports
multiple advanced protection technologies such as RAID-TP to tolerate failure of seven
out of eight controllers. In addition, the use of FlashLink and intelligent chips accelerates
service processing from end to end.
Huawei hybrid flash storage leverages a brand-new hardware architecture and intelligent
processors to accelerate services. It supports flexible scale-out and balances loads among
controllers for hot backup to ensure system reliability. Storage faults are transparent to
hosts. In addition, it converges SAN and NAS on a unified platform for easy resource
management.
Huawei distributed storage integrates HDFS, block, object, and file storage services. It
supports erasure coding and FlashLink and allows coexistence of x86 and Kunpeng
platforms. It also provides service acceleration and intelligent I/O scheduling.
Business development leads to an increasing amount of service data, which poses ever
high demands on storage systems. Traditional storage systems are unable to meet these
demands and encounter bottlenecks such as inflexible storage performance expansion,
complex management of various devices, difficult to reuse legacy devices, and increasing
maintenance costs occupying a large part of the total cost of ownership (TCO). To
address these problems, Huawei launches OceanStor hybrid flash storage systems.
The storage systems incorporate file-system- and block-level data and storage protocols
and provide industry-leading performance and a variety of efficiency improvement
mechanisms. All these advantages provide customers with comprehensive high-
performance storage solutions, maximize customers' return on investment (ROI), and
meet the requirements of large-scale databases for online transaction processing (OLTP)
or online analytical processing (OLAP), high-performance computing, digital media,
Internet operation, central storage, backup, disaster recovery, and data migration.
Featuring the cutting-edge hardware structure and block-and-file unified software
architecture, Huawei OceanStor hybrid flash storage systems combine advanced data
application and protection technologies to meet the storage requirements of medium-
and large-sized enterprises for high performance, scalability, reliability, and availability.
Page 2

Brand-new architecture: The latest-generation multi-core CPU and SmartMatrix 3.0


architecture enable the storage systems to support up to 32 controllers and 192 PB of all-
flash capacity for linear performance increase.
Ultimate convergence: SAN and NAS are converged to provide elastic storage, simplify
service deployment, improve storage resource utilization, and reduce TCO.
Outstanding performance: The flash-optimized technology gives full play to SSD
performance. Inline deduplication and compression are supported. Loads are balanced
among controllers that serve as hot backup for each other, delivering higher reliability.
Resources are centrally stored and easily managed.
For more information, log in to https://support.huawei.com to obtain the relevant
product documentation.

1.1.2 Software and Hardware Architectures


1.1.2.1 Hardware Architecture
The 7-nanometer Arm processors with high performance and low power consumption
simplify the design of the storage printed circuit board (PCB), occupy less internal space,
and offer better heat dissipation. With a compact hardware design, the storage systems
can provide more interface modules for customers in smaller footprints and less power
consumption.
The CPUs and control modules are switched to the Huawei-developed Kunpeng
architecture. Onboard fan modules and BBUs are smaller in size. Two hot-swappable
interface modules are added, but FCoE and IB ports are not supported. Back-end disk
enclosures support SAS 3.0 and Huawei-developed RDMA high-speed ports.
For details about product forms, log in to the Huawei Data Storage Infocenter, and
access 3D Interactive Multimedia to obtain 3D images.
For more hardware information, see the corresponding product documentation.
 Front-end and back-end full interconnection architecture
The front-end and back-end interconnect I/O modules balance data flows among
controllers for load balancing. The new interface modules are fully shared by
controllers and can be flexibly deployed and provide high bandwidth.
Full interconnection among four controllers: The FC front-end interface modules and
back-end interface modules are fully interconnected with all controllers to avoid I/O
forwarding at the front end and back end.
Upgrade over a single link: When a host connects to a single controller and the
controller is upgraded, front-end interface modules automatically forward I/Os to
other controllers without affecting the host.
No link disconnection upon reset. In the event of a controller reset or fault, interface
modules automatically forward I/Os to other controllers without affecting the host.
Multi-controller redundancy: The system tolerates the failures of three out of four
controllers.
Next-generation power protection: BBUs are built into controllers. Controller removal
causes the BBU to provide power for flushing cache data to system disks. Concurrent
removal of multiple controllers does not lead to data loss.
Page 3

 Full hardware redundancy


All components and channels of the system adopt full redundancy design,
eliminating single points of failure. Each component and channel can detect, repair,
and isolate faults independently to ensure stable system running.
 Guard-style data encryption
Guard-style data encryption consists of a Huawei storage system, guard-style
encryption machine, and key management center (KMC). The guard-style encryption
machine implements data encryption and decryption. The storage system stores
encrypted data. The KMC manages the keys and controls life cycles.
Note: The guard-style encryption machine supports only Fibre Channel links. It does
not support NAS protocol encryption or value-added features such as snapshot,
clone, remote replication, and LUN migration.
 SED
The self-encrypting drives (SEDs) work with the internal key manager of the storage
system or the external key manager to implement static data encryption and ensure
data security.
1. Internal key manager: SEDs use the AES 256 encryption algorithm to encrypt
data without affecting storage performance. The internal key manager is a built-
in key management application in the storage system. Huawei hybrid flash
storage supports TPM for key protection. The internal key manager is easy to
deploy, configure, and manage. Therefore, you do not need to deploy an
independent key management system. It manages the lifecycle of AKs of SEDs.
The internal key manager provides rich functions, such as key generation,
update, destruction, backup, and restoration. If there is no higher security
requirement and the key management of the entire data center is used only for
the storage system, you are advised to use the internal key manager.
2. External key manager: uses the standard KMIP and TLS protocols. The storage
system can use a third-party key manager server (KMS) to manage keys for data
encryption. If centralized key management is required when multiple data
centers exist, you are advised to use the external key manager. The external key
manager provides rich functions, such as key generation, update, destruction,
backup, and restoration, and supports the high availability (HA) mode. Keys are
synchronized between two KMSs in real time to ensure key reliability.
The data encryption feature uses the AES 256 algorithm to encrypt user data on
storage to ensure the confidentiality, integrity, and availability of user data. The AES
algorithm is based on substitution and permutation. AES adopts several different
methods to perform substitution and permutation. It is an iterative, symmetric key
block cipher that can use 128-bit, 192-bit, and 256-bit keys. In addition, it can use
128-bit (16-byte) blocks to encrypt and decrypt data. Different from a public key
password that uses a key pair, a symmetric key password uses the same key to
encrypt and decrypt data. The number of bits of the encrypted data returned using
the block cipher is the same as that of the input data. Iterative encryption uses a loop
structure in which the input data is repeatedly replaced.
SEDs provide two-layer security protection by using an authentication key (AK) and a
data encryption key (DEK).
Page 4

1. AK mechanism: After data encryption has been enabled, the storage system
activates the AutoLock function of SEDs and uses AKs assigned by a key
manager. Access to the SEDs is protected by AutoLock and only the storage
system itself can access its SEDs. When the storage system accesses an SED, it
acquires an AK from the key manager. If the AK is consistent with the SED's, the
SED decrypts the data encryption key (DEK) for data encryption/decryption. If
the AKs are inconsistent, read and write operations will fail.
2. DEK mechanism: After AutoLock authentication has passed, the SED uses its
hardware circuits and internal DEK to encrypt or decrypt the data that is written
or read. The DEK encrypts data after it has been written into disks. The DEK
cannot be acquired separately, which means the original information on an SED
cannot be recovered in a mechanical way after it is removed from the storage
system.

1.1.2.2 Software Architecture


 RAID 2.0+
Different from traditional RAID that has fixed member disks, RAID 2.0+ enables block
virtualization of data on disks. All disks in a storage system are divided into chunks at
a fixed size. Multiple chunks from disks are automatically selected at random to form
a chunk group (CKG) based on the RAID algorithm. A CKG is further divided into
extents at a fixed size. These extents are allocated to different volumes. Volumes are
presented as LUNs or file systems.
Underlying media virtualization and upper-layer resource virtualization implement
rapid data reconstruction and intelligent resource allocation, which shorten the data
reconstruction time from 10 hours to 30 minutes, improve the reconstruction speed
by 20 times, and significantly reduce the impact on services and the probability of
multi-disk failures during the reconstruction process. All disks in a storage pool
participate in reconstruction and only service data is reconstructed. The
reconstruction mode is changed from the traditional many-to-one to many-to-many.
 Convergence of SAN and NAS
For hybrid flash storage, the software protocol stacks of NAS and SAN can be used at
the same time, and NAS and SAN are converged on the resource allocation and
management plane.
 SmartMatrix 3.0 for full load balancing
The SmartMatrix 3.0 architecture of the OceanStor Kunpeng series features full
switching, full virtualization, full redundancy, and service load balancing. Also, it
integrates advanced technologies such as E2E data integrity field (DIF), memory error
checking and correcting (ECC), and transmission channel cyclic redundancy check
(CRC). By combining these features and technologies, SmartMatrix 3.0 enables linear
performance growth, promises the maximum scalability, and delivers 24/7 high
availability. Therefore, the OceanStor Kunpeng series are capable of satisfying critical
service demands of large data centers.
First, it balances load among four controllers. The cache of each controller's LUNs
evenly and persistently mirrors the cache of the other three controllers.
Second, the four controllers implement balanced takeover. Failure of one controller
triggers takeover of all of its LUNs by the remaining three controllers.
Page 5

Finally, the home controller of a LUN or file system achieves switchover among four
controllers within a few seconds.
 End-to-end data integrity protection
The ANSI T10 Protection Information (PI) standard provides a way to check data
integrity during access to a storage system. The check is implemented based on the
PI field defined in the T10 standard. This standard adds an 8-byte PI field to the end
of each data block to implement data integrity check. In most cases, T10 PI is used to
ensure data integrity within a storage system.
Data Integrity Extensions (DIX) further extends the protection scope of T10 PI and
implements data integrity protection from applications to host HBAs. Therefore, a
combination of DIX and T10 PI can implement complete end-to-end data protection
from applications to disks.
Huawei hybrid flash storage system not only uses T10 PI to ensure the integrity of
internal data, but also supports a combination of DIX and T10 PI to protect end-to-
end data integrity from applications to disks. The storage system validates and
delivers the PI field in real time. If the host does not support PI, the storage system
adds the PI field to a host interface and delivers the field. In a storage system, PI
fields are forwarded, transmitted, and stored together with user data. Before user
data is read by a host again, the storage system uses PI fields to check the
correctness and integrity of user data.
 Controller faults transparent to hosts
If any storage controller is faulty, the front-end interface modules quickly redirect the
I/Os of the faulty controller to other controllers. The host is unaware of the fault. The
Fibre Channel links remain connected, and services are running properly. No alarm or
event is reported.

1.1.3 Key Technologies


Huawei hybrid flash storage supports concurrent access of SAN and NAS, providing
optimal access paths for different services to achieve optimal access performance.
Integration of block and file storage eliminates the need for NAS gateways and reduces
procurement costs. It applies to industries such as government, transportation, finance,
and carriers, and scenarios such as databases, video surveillance, and VDI.
The block and file service features of Huawei hybrid flash storage support many
advanced functions. For details, see the training materials.
 SmartTier
1. Block data tiering: Based on the performance requirements of applications,
SmartTier separates SSDs, SAS disks, and NL-SAS disks into the high-
performance, performance, and capacity storage tiers. Each tier can be used
independently, or two or three tiers can be combined to provide data storage
space. SmartTier goes through three phases: data monitoring, data placement
analysis, and data relocation. Data monitoring and data placement analysis are
automated by the storage system, and data relocation is initiated manually or by
a user-defined policy. SmartTier improves storage system performance and
reduces storage costs to meet enterprises' requirements for both performance
and capacities. By preventing historical data from occupying expensive storage
Page 6

media, SmartTier ensures effective investment and eliminates energy


consumption caused by useless capacities, reducing the TCO and optimizing the
cost-effectiveness.
2. File data tiering: A storage pool may be composed of multiple media, such as
SSDs and HDDs. SmartTier automatically promotes files to high-performance
media (SSDs) and demote files to large-capacity media (HDDs, including SAS
and NL-SAS disks) based on user-configured tiering policies. Users can specify
tiering policies by file name, file size, file type, file creation time, and SSD usage.
SmartTier is applicable when file lifecycle management is required, such as
financial check images, medical images, semiconductor simulation design, and
reservoir analysis. The services in these scenarios have demanding requirements
for performance in the early stage and have low requirements for performance
later. The following describes an example. In the reservoir analysis scenario,
small files are imported to the storage system for the first time, which are
frequently accessed and have high performance requirements. After the small
files are processed by professional analysis software, large files are generated.
These large files have low performance requirements. Users can configure tiering
policies based on file sizes. To be specific, small files are stored in SSDs and large
files are stored in HDDs (such as low-cost NL-SAS disks). In this way, SmartTier
helps reduce customer's cost while meeting the performance requirements.
 SmartCache
SmartCache, which serves as a basic read cache module of a storage system, uses
SSDs as the expansion of the RAM cache in the storage system to store clean hotspot
data that the RAM cache cannot hold.
SmartCache enhances the performance of storing hotspot data by LUN or file system.
After SmartCache is enabled for a LUN or file system, the RAM cache delivers hotspot
data to SmartCache. SmartCache maps the data in the memory to SSDs and saves
the data to SSDs. When new I/Os are read from the storage system, the system
preferentially searches for the requested data in the RAM cache. If no data is found
in the RAM cache, the system attempts to find the data in SmartCache. If the data is
found, the system reads the data from the SSDs and returns the data to the host.
When the amount of data cached in SmartCache reaches the upper limit, the least
recently accessed cache blocks are selected based on the Least Recently Used (LRU)
algorithm. The mapping entries in the lookup table are cleared to evict data. Data
writing and data eviction are repeated regularly. In this way, SmartCache stores data
with relatively high popularity.
SmartCache applies to services that have hotspot areas and intensive random read
I/Os, such as databases, OLTP applications, web services, and file services.
 SmartMulti-Tenant
SmartMulti-Tenant allows tenants to create multiple virtual storage systems in one
physical storage system. With SmartMulti-Tenant, tenants can share hardware
resources and safeguard data security and confidentiality in a multi-protocol unified
storage architecture.
SmartMulti-Tenant isolates the management operations, services, and networks
among tenants. Data is inaccessible among tenants, achieving secure isolation.
Page 7

1. Management isolation: Each tenant has its own administrators. The tenant
administrators can configure and manage their own storage resources only
through the GUI or RESTful APIs. Tenant administrators support role-based
permission control. When creating a tenant administrator, you must select the
role corresponding to the permission.
2. Service isolation: Each tenant has its own file systems, users/user groups, and
shares/exports. Users can access the file systems of the tenant only through the
tenant LIFs. The service data (mainly file systems, quotas, and snapshots), service
access, and service configurations (mainly NAS protocol configurations) are
isolated among multiple tenants.
3. Network isolation: Tenant networks are separated by VLANs and LIFs, preventing
unauthorized hosts from accessing the tenants' storage resources. Tenants use
logical interfaces (LIFs) to configure services. A LIF belongs only to one tenant to
achieve port isolation in a logical sense.
 SmartQoS
SmartQoS is also called intelligent service quality control. It dynamically allocates
storage system resources to meet specific performance goals of certain applications.
SmartQoS enables you to set upper limits on IOPS or bandwidth for certain
applications. Based on the upper limits, SmartQoS can accurately limit performance
of these applications, thereby preventing them from contending for storage resources
with critical applications.
SmartQoS uses the I/O priority scheduling and I/O traffic control technologies based
on LUN, file system, or snapshot to ensure the service quality of data services.
 SmartDedupe and SmartCompression
SmartDedupe and SmartCompression provide the data simplification service for file
systems and thin LUNs. They not only save storage space for customers but also
reduce the total cost ownership (TCO) of the enterprise IT architecture.
SmartDedupe deletes duplicate data blocks from a storage system to save storage
space. Huawei hybrid flash storage system supports inline deduplication. That is, only
the newly written data is deduplicated.
SmartCompression reorganizes data to save storage space and improve the data
transfer, processing, and storage efficiency without any data loss. Huawei hybrid
flash storage system supports inline compression. That is, only the newly written data
is compressed.
 HyperReplication
1. Synchronous remote replication: Data is synchronized in real time to ensure data
consistency and minimize data loss in the event of a disaster.
2. Asynchronous remote replication: Data is periodically synchronized to minimize
service performance deterioration caused by the latency of long-haul data
transmission.
 HyperMetro
HyperMetro (also called active-active feature) is a key technology of the active-active
data center solution. It ensures high-level data reliability and service continuity for
users. HyperMetro is an array-level active-active technology. Two active-active
Page 8

storage systems can be deployed in the same equipment room, the same city, or two
places that are 100 km away from each other.
 HyperMirror
HyperMirror is the volume mirror software of Huawei hybrid flash storage system.
HyperMirror allows users to create two physical copies of a LUN. Each LUN copy can
reside in a local resource pool or be an external LUN. Each LUN copy has the same
virtual capacity as the mirror LUN. When a server writes data to a mirror LUN, the
storage system simultaneously writes the data to each copy of the mirror LUN. When
a server reads data from a mirror LUN, the storage system reads data from one copy
of the LUN. Even if one copy of a mirror LUN is temporarily unavailable (for
example, when the storage system that provides the storage pool is unavailable), the
server can still access the mirror LUN. The system memorizes LUN areas where data
has been written and synchronizes the changed data with the LUN copy after the
copy is available again.
 HyperLock
With the development of technologies and explosive increase of information, secure
access and application of data are attached great importance. As required by laws
and regulations, important data such as case documents of courts, medical records,
and financial documents can only be read but cannot be written within a specific
period. Therefore, measures must be taken to prevent such data from being
tampered with. In the storage industry, write once read many (WORM) is the most
commonly used method to archive and back up data, ensure secure data access, and
prevent data tampering.
The WORM feature of Huawei hybrid flash storage systems is also called HyperLock.
After data is written to a file, the write permission of the file is removed so that the
file enters the read-only state. In the read-only state, the file can be read but cannot
be deleted, modified, or renamed. The WORM feature can prevent data from being
tampered with, meeting data security requirements of enterprises and organizations.
A file system with the WORM feature can be configured only by the administrator.
WORM modes are classified into regulatory compliance WORM (WORM-C) and
enterprise WORM (WORM-E) based on the administrator's permissions. The WORM-
C mode is mainly used in archiving scenarios where data protection mechanisms are
implemented in compliance with laws and regulations. The WORM-E mode is mainly
used for internal management of enterprises.
 HyperVault
Huawei hybrid flash storage systems provide the HyperVault feature to implement
intra-system or inter-system file system data backup and restoration.
HyperVault supports local backup and remote backup.

1.1.4 Application Scenarios


 Multi-site disaster recovery
Hybrid flash storage can be used in the cascading and parallel topologies of the geo-
redundant DR solution. The highlights of the geo-redundant DR solution are the
Page 9

interoperability among entry-level, mid-range, and high-end storage arrays, second-


level RPO and minute-level RTO for asynchronous replication, and DR Star.
If the DR center fails, the remaining sites automatically establish the replication
relationship to provide continuous data protection. After the standby replication
relationship is activated, incremental data is replicated without changing the RTO.
Configuration of DR Star* can be done at a single site for simplified management.
 Application scenarios of storage tiering
The requirements for performance and reliability vary with service applications. For
example, the CRM system and bill transaction system are hot data applications, and
the backup system is a cold data application. Huawei all-flash storage, hybrid flash
storage, and distributed storage can be used together to implement data integration
and tiering and provide support for storage with different SLAs.

1.2 All-Flash Storage Technical Deep Dive


1.2.1 Product Positioning
Based on the current application status of storage products and future development
trend of storage technologies, Huawei launches the next-generation all-flash series high-
end storage systems for medium- and large-sized enterprise data centers. By focusing on
core services of medium- and large-sized enterprises (such as enterprise-level data
centers, virtual data centers, and cloud data centers), Huawei all-flash storage series can
meet their service requirements for strong performance, robust reliability, and high
efficiency.
Huawei all-flash series storage systems adopt the next-generation SmartMatrix
architecture, which ensures service continuity in the event that one out of two controller
enclosures fails or seven out of eight controllers fail. This architecture can meet the
requirements of medium- and large-sized enterprises for the reliability of core services. In
addition, all-flash storage systems of some models are equipped with AI chips to meet
the requirements of different service applications, such as large database Online
Transaction Processing (OLTP)/Online Analytical Processing (OLAP), high-performance
computing, digital media, Internet operation, centralized storage, backup, disaster
recovery, and data migration in data centers.
Huawei all-flash series storage systems provide not only excellent storage services for
data centers, but also various data backup and disaster recovery solutions to ensure
smooth and secure service operations. In addition, all-flash storage systems of some
models provide easy-to-use management modes and convenient local or remote
maintenance modes, significantly reducing device management and maintenance costs.
For more information, log in to http://support.huawei.com to obtain the relevant product
documentation.
Page 10

1.2.2 Software and Hardware Architectures


1.2.2.1 Hardware Architecture
For details about product forms, log in to the Huawei Data Storage Infocenter, and
access 3D Interactive Multimedia to obtain 3D images.
For more hardware information, see the corresponding product documentation.
 Full interconnection among controllers
1. Full interconnection of controllers in a controller enclosure: The controller
enclosures used by Huawei all-flash storage systems of some models can contain
four controllers. Each controller is an independent hot-swappable service
processing unit and provides three pairs of RDMA high-speed links. A controller
is fully cross-connected to the other three controllers through the passive
backplane. Thanks to the full interconnection of controllers, data transmitted
between controllers does not need to be forwarded by a third component,
achieving balanced, fast, and efficient access. Controllers reside in the same
controller enclosure and do not require external cables or switches. Therefore,
the deployment of controllers is simple and easy to operate. In addition, the
passive backplane uses only passive lines, providing high reliability. (The
controller enclosures used by Huawei all-flash storage systems of some models
accommodate two controllers that use non-shared interface modules.)
2. Full interconnection of controller enclosures: If Huawei all-flash storage systems
of some models need to be expanded to eight controllers, use the direct-
connection networking mode. Each controller enclosure provides four scale-out
shared interface modules for networking, as shown in the following figure. The
scale-out shared interface module is directly connected to the four controllers in
the local controller enclosure. At the same time, it receives and sends data from
the four controllers in another controller enclosure. In this way, the eight
controllers are fully interconnected. Compared to switched networking, this
switch-free networking mode saves half of cables and two switches, which
reduces the cost and simplifies management.
 E2E NVMe design
NVMe provides reliable NVMe commands and data transmission. NVMe over Fabrics
extends NVMe to various storage networks to reduce the overhead for processing
storage network protocol stacks and achieve high concurrency and low latency.
NVMe over Fabrics can map NVMe commands and data to multiple fabric links,
including Fibre Channel, InfiniBand, RoCE v2, iWARP, and TCP.
Huawei all-flash storage systems support end-to-end NVMe (some models do not
support NVMe-based back-end networking), including:
1. NVMe over FC, NVMe over RoCE v2 (planned), and NVMe over TCP/IP (planned)
are supported for the networks between hosts and storage systems.
2. 32 Gbit/s NVMe over FC and 100 Gbit/s NOF ports are supported.
3. iSCSI over TCP/IP reduces CPU consumption and latency from network protocol
stacks.
Page 11

4. NVMe multi-queue polling designed for multi-core Kunpeng 920 CPUs enables
lock-free processing of concurrent I/Os, fully utilizing the computing capacities of
processors.
5. Read requests to NVMe SSDs are prioritized, accelerating response to read
requests when data is being written into NVMe SSDs.
6. With the end-to-end NVMe design, the Huawei all-flash storage system offers a
minimum access latency of less than 100 μs.
 Chipset
With long-term continuous accumulation and investment in the chipset field, Huawei
has developed some key chipsets for storage systems, such as the front-end interface
chipset (Hi1822), Kunpeng 920 chipset, Ascend AI chipset (Ascend 310), SSD
controller chipset, and baseboard management controller (BMC) chipset (Hi1710).
They are also integrated into Huawei all-flash storage systems.
1. Interface module chip: The Hi182x (IOC) chip is independently developed by
Huawei in the storage interface chip field. It integrates multiple protocol
interfaces, such as ETH interfaces of 100 Gbit/s, 40 Gbit/s, 25 Gbit/s, and 10
Gbit/s, and Fibre Channel interfaces of 32 Gbit/s, 16 Gbit/s, and 8 Gbit/s. This
chip features high interface density, rich protocol types, and flexible ports,
creating unique values for storage.
2. Kunpeng 920 chipset: The Kunpeng 920 chipset is a processor chipset
independently developed by Huawei. It features strong performance, high
throughput, and high energy efficiency to meet diversified computing
requirements of data centers. It can be widely used in scenarios such as big data
and distributed storage. The Kunpeng 920 chipset supports various protocols
such as DDR4, PCIe 4.0, SAS 3.0, and 100 Gbit/s RDMA to meet the requirements
of a wide range of scenarios.
3. Ascend AI chipset: The Ascend chipset is the first AI chipset independently
developed by Huawei. It features ultra-high computing efficiency and low power
consumption. Currently, it is the most powerful AI SoC for computing scenarios.
Ascend 310 is the first chipset of the Ascend series. It is manufactured by using
the 12 nm process and has a computing power of up to 16 TFLOPS.
4. SSD controller chip: HSSDs use Huawei-developed next-generation enterprise-
level controllers, which provide SAS 3.0 and PCIe 3.0 interfaces and feature high
performance and low power consumption. The chip uses enhanced ECC and
built-in RAID technologies to extend the SSD service life to meet enterprise-level
reliability requirements. In addition, this chip supports the latest DDR4, 12 Gbit/s
SAS, and 8 Gbit/s PCIe rates as well as Flash Translation Layer (FTL) hardware
acceleration to provide stable performance at a low latency for enterprise
applications.
5. BMC chipset: Hi1710 is a BMC chipset, including the A9 CPU, 8051 co-processor,
sensor circuit, control circuit, and interface circuit. It supports the intelligent
platform management interface (IPMI), which monitors and controls the
hardware components of the storage system. It provides various functions,
including system power control, controller monitoring, interface module
monitoring, power supply and BBU management, and fan monitoring.
Page 12

1.2.2.2 Software Architecture


Huawei all-flash storage supports multiple advanced features, such as HyperSnap,
HyperMetro, and SmartQoS. Maintenance terminal software such as SmartKit and
eService can access the storage system through the management network port or serial
port. Application server software such as OceanStor BCManager and UltraPath can access
the storage system through iSCSI or Fibre Channel links.
 Fully balanced architecture
The full-balanced (active-active) architecture balances the service load and data
distribution of the entire storage system, simplifying storage resource planning.
Customers only need to consider the total storage capacity and performance
requirements of the storage system. LUNs are not owned by any controller, but
evenly distributed. That is, data on a LUN is divided into slices based on the
granularity of 64 MB. Slices are distributed to different vNodes (each vNode matches
a CPU) based on the hash (LUN ID + LBA) result.
Global load balancing: The hash algorithm is used to determine which controller of
the storage system should process a read/write request from each host based on the
LBA of the request. Huawei-developed multipathing software UltraPath, front-end
interconnect I/O modules (FIMs), and controllers negotiate the same hash method
and parameters to implement intelligent distribution of read and write requests.
UltraPath and FIMs work together to directly distribute a read/write request to the
optimal processing controller, avoiding forwarding between controllers. If UltraPath
and FIMs are not used, after receiving a host request, a controller forwards the
request to the corresponding processing controller based on the hash result of
request's LBA, ensuring that host requests are balanced between controllers.
Global write cache balancing: Both the data amount and data hotspots are balanced.
Global storage pool balancing: Disk utilization, wear and service life, data
distribution, and hotspot data are all balanced.
 SmartMatrix full-interconnection architecture
The SmartMatrix full-interconnection architecture uses a high-speed, fully-
interconnected passive backplane to connect to multiple controllers. Interface
modules (Fibre Channel and back-end expansion) are connected to the backplane in
a fully shared manner, allowing hosts to access any controller through any port. The
SmartMatrix architecture allows close coordination between controllers, simplifies
software models, and achieves active-active fine-grained balancing, high efficiency,
and low latency.
1. Full sharing of front-end interface modules: High-end storage systems use front-
end interconnect I/O modules (FIMs), which can be accessed by the four
controllers in the controller enclosure at the same time. After a host I/O request
reaches the FIM, the FIM directly distributes the I/O request to the processing
controller.
2. Full interconnection of controllers: Controllers in a controller enclosure are
interconnected through the 100 Gbit/s RDMA links on the backplane. (Only
OceanStor Dorado 3000 V6 uses the 40 Gbit/s RDMA links.) For scale-out to
multiple controller enclosures, any two controllers are directly connected to
avoid data forwarding.
Page 13

3. Disk enclosure interconnection across controller enclosures: Huawei all-flash


storage systems of some models support back-end interconnect I/O modules
(BIMs), which allow a smart disk enclosure to be connected to two controller
enclosures and accessed by eight controllers simultaneously. This technology,
together with continuous mirroring, allows the system to tolerate failure of 7 out
of 8 controllers. However, Huawei all-flash storage systems of some models do
not support BIMs. For these models, a disk enclosure can be accessed by only
one controller enclosure and continuous mirroring is not supported.
 FlashLink®
Huawei all-flash storage systems use the FlashLink® technology to provide high
input/output operations per second (IOPS) at a steady low latency.
FlashLink ® associates storage controllers with SSDs by using a series of technologies
for flash media, maximizing the performance of flash storage. FlashLink® provides the
following key technologies designed for flash media: intelligent multi-core
technology, large-block sequential write, multi-stream partitioning, smart disk
enclosure offload, end-to-end I/O priority, AI, and end-to-end NVMe. It ensures
steady low latency and high IOPS of Huawei all-flash storage systems.

1.2.3 Key Technologies


The key technologies of all-flash storage are described from three aspects: high reliability,
high performance, and high security.

1.2.3.1 High Reliability


 Multiple cache copies
To ensure the low latency of write I/Os on the host, Huawei all-flash storage system
stores the data written by the host in the cache (memory) and then writes the
cached data to disks in the background. To ensure the reliability of cache data,
Huawei all-flash storage system provides up to three copies of cache data.
By default, the storage system provides two copies of cache data. For data with the
same LBA, the storage system creates a pair relationship between two controllers
within a controller enclosure. Upon receiving write requests from the host, the
system writes the data to both the caches on the paired controllers and returns
acknowledgement to the host after data is successfully written to both caches to
ensure data consistency. In the event that a controller is faulty, data cached on the
other controller can be destaged or accessed, ensuring zero data loss.
 Intra-disk dynamic RAID
For data on an HSSD, RAID groups are created by physical dies to reduce the
probability of data loss caused by silent corruption.
Page 14

Each package on an HSSD consists of multiple physical dies. RAID 4 is used for
redundancy of data written to the dies, preventing data loss in the event of a single
die failure. The uncorrectable bit error rate (UBER) of SSDs in the industry is about
10-17. With the support of intra-disk RAID, the UBER of HSSDs is reduced to 10-18
(decreasing by one order of magnitude). For example, if intra-disk RAID is not used,
a bad block occurs when 11 PB of data is written to an SSD. If intra-disk RAID is
used, a bad block occurs when 110 PB of data is written to an SSD.
 Wear leveling and anti–wear leveling
The SSD controller uses software algorithms to monitor and balance the P/E cycles
on blocks in the NAND flash. This prevents over-used blocks from failing and extends
the service life of the NAND flash.
1. Intra-disk wear leveling: HSSDs support dynamic and static wear leveling.
Dynamic wear leveling enables the SSD to write data preferentially to less-worn
blocks to balance P/E cycles. Static wear leveling allows the SSD to periodically
detect blocks with fewer P/E cycles and reclaim their data, ensuring that blocks
storing cold data can participate in wear leveling. HSSDs combine the two
solutions to ensure wear leveling.
2. Global wear leveling: The biggest difference between SSDs and HDDs lies in that
the amount of data written to SSDs is no longer unlimited, and the service life of
an SSD is inversely proportional to the amount of data written to the SSD.
Therefore, an all-flash storage system requires load balancing between SSDs to
prevent overly-used SSDs from failing. FlashLinkTM uses controller software and
disk drivers to regularly query the SSD wear degree from the SSD controller. In
addition, FlashLinkTM evenly distributes data to SSDs based on LBAs/fingerprints
to level the SSD wear degree.
3. Anti-wear leveling: When SSDs are approaching the end of their service life as
their wear degrees have reached 80% or above, multiple SSDs may fail
simultaneously if global wear leveling is still in use, resulting in data loss. In this
case, the system enables anti-global wear leveling to avoid simultaneous
failures. The system selects the most severely worn SSD and writes data to it as
long as it has idle space. This reduces that SSD's life faster than others, and you
are prompted to replace it sooner, avoiding simultaneous failures.
 LDPC and FSP algorithms
HSSDs use the LDPC algorithm and the FSP technology to ensure data reliability.
Page 15

Enhanced Low-Density Parity-Check (LDPC) algorithm: provides higher error


correction capability than that required by flash chips to ensure device reliability.
LDPC refers to a kind of linear codes defined through a check matrix. LDPC consists
of four modules: Encode, Decode, Soft-bit Logic, and DSP Logic. When data is written
to the NAND flash, the system generates the LDPC parity data and writes it to the
NAND flash with the raw data. When data is read from the NAND flash, the LDPC
parity data is used to check and correct the data. In case an error occurs in data in
the NAND flash, the HSSD enables LDPC hard decoding to correct the error. If the
error cannot be corrected, the HSSD enables shift read to save data. If shift read fails,
the HSSD attempts to enable soft read to save data. If soft read also fails, the HSSD
enables read retry to recover data. If the data still fails to be recovered, the data on
other dies is used to perform the XOR operation to recover user data.
Intelligent FSP algorithm: Based on the characteristics of 3D TLC media, the
intelligent FSP algorithm provides faster and more reliable data storage services.
 DIF
DIF stands for Data Integrity Field. Data path protection refers to the protection of
data integrity and correctness on I/O paths. In addition to ECC and CRC for key
memories (such as DDR), SSDs can enable LBA check and DIF to protect data. DIF is
the protection information that is added to the end of the user data sector. This
information ensures the data consistency between the host and the SSD.
 Data inspection algorithm
After data has been stored in NAND flash for a long term, data errors may occur due
to read interference, write interference, or random failures. Risks can be detected and
handled in advance through inspection, preventing data loss. HSSDs use read
inspection and write inspection to prevent data retention errors. Read inspection
traverses data in the NAND flash quickly and observe data transitions. Data
experiencing high BIT transitions will be relocated in time. Read inspection prevents
data errors due to random failure or read interference. The interval between write
inspections changes with the temperature. Write inspection checks the data retention
and relocates data that has been stored for an excessively long time. Write inspection
prevents errors due to long-time storage. Background inspection helps to identify
risks in advance, preventing most NAND flash errors and improving data reliability.
 E2E protection information (PI)
During data transmission within a storage system, data passes through multiple
components over various channels and undergoes complex software processing. Any
problem during this process may cause data errors. The scenario where a problem is
not immediately detected but found in subsequent data access is called silent data
corruption.
ANSI T10 Protection Information (PI) provides a method of verifying data integrity
during access to a storage system. It ensures that errors occurred during transmission
are detected and rectified in time, preventing silent data corruption. The check is
implemented based on the PI field defined in the T10 standard. This standard adds
an 8-byte PI field to the end of each data block to implement data integrity check. In
most cases, T10 PI is used to ensure data integrity within a storage system.
Page 16

Huawei all-flash storage system supports T10 PI. Upon reception of data from a host,
the storage system inserts an 8-byte PI field to every 512 bytes of data before
performing internal processing such as forwarding to other nodes or saving the data
to the cache. After the data is written to disks, the disks verify the PI fields of the
data to detect any change to the data between reception and destaging to the disks.
As shown in the figure in the training materials, the green point indicates that a PI is
inserted into the data. The blue points indicate that a PI is calculated for the 512-
byte data and compared with the saved PI to verify data correctness.
When the host reads data, the disks verify the data to prevent changes to the data. If
any error occurs, the disks notify the upper-layer controller software, which then
recovers the data by using RAID. To prevent errors on the path between the disks
and the front end of the storage system, the storage system verifies the data again
before returning it to the host. If any error occurs, the storage system reads the data
from the disks in transparent mode or recovers the data using RAID to ensure end-
to-end data reliability from the front-end interface modules to the back-end disks.
 Multi-layer redundancy and fault tolerance design
Huawei all-flash storage system uses the SmartMatrix multi-controller architecture
that supports linear expansion of system resources. The controller enclosure uses the
IP interconnection design and supports linear IP scale-out between controller
enclosures.
The management plane, control plane, and service plane are physically separated (in
different VLANs), and served by different components. Each plane can independently
detect, rectify, and isolate faults. Faults on the management plane and control plane
do not affect services. Service plane congestion does not affect system management
and control.
All components in Huawei all-flash storage system work in redundancy mode,
eliminating single points of failure. Huawei all-flash storage system provides multiple
redundancy protection mechanisms for the entire path from the host to the storage
system. Service continuity is guaranteed even if multiple field replaceable units
(FRUs) allowed by the redundancy scheme are faulty simultaneously or successively.
 Key technologies for high reliability
1. Three copies across controller enclosures: For data with the same LBA, Huawei
all-flash storage system creates a pair between two controllers to form a dual-
copy relationship and creates a third copy in the memory of another controller. If
there is only one controller enclosure, three copies are stored on different
controllers in this controller enclosure, preventing data loss when any two
controllers become faulty at the same time. If there are two or more controller
enclosures, the third copy can be stored on a controller in another controller
enclosure. This prevents data loss when any two controllers are faulty at the
same time or a single controller enclosure (with four controllers) is faulty.
Page 17

2. Continuous cache mirroring: This prevents data loss or service interruption when
a controller is faulty and a new controller fault occurs before that controller
recovers. This ensures service continuity to the maximum extent. Generally, data
blocks 1 and 2 on controller A and data blocks 1 and 2 on controller B form are
mutually mirrored, respectively. If controller A is faulty, data blocks on controller
B are mirrored to controllers C and D, ensuring data redundancy. If controller D
becomes faulty before controller A recovers, data blocks on controllers B and C
are mirrored to each other to ensure data redundancy. If two controller
enclosures (housing eight controllers) are deployed, continuous mirroring is
implemented within a controller enclosure as long as two or more controllers in
the controller enclosure are working properly. If only one controller in a
controller enclosure is normal, the system mirrors data to a controller in the
other controller enclosure until only one controller is available in the storage
system. Continuous mirroring and back-end full interconnection allow up to
seven out of eight controllers to fail at the same time, achieving high service
availability.
 Non-disruptive upgrade within seconds
Huawei all-flash storage systems of some models impose protection measures
against component failures and power failures. In addition, advanced technologies
are used to reduce risks of disk failures and data loss, ensuring high reliability of the
system. Huawei all-flash storage systems provide multiple advanced data protection
technologies to protect data against catastrophic disasters and ensure continuous
system running.
 High availability architecture
Tolerating simultaneous failure of two controllers: The global cache provides three
cache copies across controller enclosures. If two controllers fail simultaneously, at
least one cache copy is available. A single controller enclosure can tolerate
simultaneous failure of two controllers with the three-copy mechanism.
Tolerating failure of a controller enclosure: The global cache provides three cache
copies across controller enclosures. A smart disk enclosure connects to 8 controllers
(in 2 controller enclosures). If a controller enclosure fails, at least one cache copy is
available.
Tolerating successive failure of 7 out of 8 controllers: The global cache provides
continuous mirroring to tolerate successive failure of 7 out of 8 controllers (on 2
controller enclosures).
 Zero interruption upon controller failure
Page 18

The front-end ports are the same as common Ethernet ports. Each physical port
provides one host connection and has one MAC address.
Local logical interfaces (LIFs) are created for internal links. Four internal links
connect to all controllers in an enclosure. Each controller has a local LIF.
IP addresses are configured on the LIFs of the controllers. The host establishes IP
connections with the LIFs.
If the LIF goes down upon a controller failure, the IP address automatically fails over
to the LIF of another controller.

1.2.3.2 High Performance


 Protocol offload with DTOE
Traditional NIC: The CPU must spend great resources in processing each MAC frame
and the TCP/IP protocol (checksum and congestion control).
TOE: The NIC offloads the TCP/IP protocol. The system only processes the actual TCP
data flow. High latency overhead still exists in kernel mode interrupts, locks, system
calls, and thread switching.
DTOE's advantages: Each TCP connection has an independent hardware queue to
avoid the lock overhead. The hardware queue is operated in user mode to avoid the
context switching overhead. In addition, the polling mode reduces the latency, and
better performance and reliability can be achieved.
 I/O acceleration
Huawei all-flash storage systems support end-to-end NVMe and provide high-
performance I/O channels.
NVMe over FC, NVMe over RoCE v2 (planned), and NVMe over TCP/IP (planned) are
supported for the networks between hosts and storage systems.
NVMe over RoCE v2 is supported for the networks between storage controllers and
disk enclosures.
NVMe provides reliable NVMe commands and data transmission. NVMe over Fabrics
extends NVMe to various storage networks to reduce the overhead for processing
storage network protocol stacks and achieve high concurrency and low latency.
Huawei uses self-developed ASIC interface modules, SSDs, and enclosures for high-
speed end-to-end NVMe channels. This takes full advantage of the unique designs in
protocol parsing, I/O forwarding, service priority, and hardware acceleration.
Huawei-developed ASIC interface module offloads TCP/IP protocol stack processing
for 50% lower latency. It directly responds to the host from its chip for fewer I/O
interactions and evenly distributes I/Os. In addition, it supports lock-free processing
with multi-queue polling.
Huawei-developed ASIC SSD and enclosure prioritize read requests on SSDs for
prompt response to hosts. Smart disk enclosures have CPUs, memory, and hardware
acceleration engines to offload data reconstruction for a lower latency. They also
support lock-free processing with multi-queue polling.
 Intelligent multi-level cache
Data IQ identifies the access frequency of metadata and data, and uses the DRAM
cache to accelerate reads on LUN and pool metadata. Reads on file system metadata
Page 19

and data are accelerated by using DRAM for the hottest data and SCM cache for the
second hottest data. This reduces 30% of latency.
 FlashLink
1. Multi-core technology: Huawei-developed CPUs are used to provide the most
number of CPUs and CPU cores in the same controller in the industry. Host I/O
requests are distributed to vNodes based on the intelligent distribution
algorithm. Services are processed in vNodes in an end-to-end manner, avoiding
cross-CPU communication overheads, cross-CPU remote memory access
overheads, and CPU conflicts. In this way, the storage performance increases
linearly as the number of CPUs grows. All CPU cores are grouped in the vNode.
Each service group corresponds to a CPU core group. The CPU cores in a service
group run only the corresponding service code. In this way, different service
groups do not interfere with each other. Different services are isolated and run
on different cores through service grouping, avoiding CPU contention and
conflicts between service groups. In a service group, each core uses an
independent data structure to process service logic. This prevents the CPU cores
in a service group from accessing the same memory structure, and implements
lock-free design between CPU cores.
2. Large-block sequential write: Flash chips in SSDs can withstand a limited number
of erase times. In traditional RAID overwrite mode, if data on an SSD becomes
hotspot data and is frequently modified, the number of erase times of the
corresponding flash chip will be quickly used up. Huawei all-flash storage
systems provide the large-block sequential write mechanism. In this mechanism,
controllers detect the data layouts on Huawei-developed SSDs and aggregate
discrete writes of multiple small blocks into sequential writes of large blocks.
Disk-controller collaboration is implemented to write data to SSDs in sequence.
This technology enables RAID 5, RAID 6, and RAID-TP to perform only one I/O
operation and avoid multiple read and write operations caused by discrete writes
of multiple small blocks. This makes the write performance of RAID 5, RAID 6,
and RAID-TP almost the same.
3. Hot and cold data separation: Hot data and cold data are identified in the
storage system. The cooperation between SSDs and controllers improves the
garbage collection performance, reduces the number of erase times on SSDs,
and extends the service life of SSDs. Data with different change frequencies is
written to different SSD blocks, which can reduce garbage collection. Metadata is
modified more frequently than user data, so the metadata and user data are
written into different SSD areas. The data in garbage collection is also different
from the newly written data in terms of coldness and hotness, and they are also
written into different SSD areas. In an ideal situation, garbage collection would
expect all data in a block to be invalid so that the whole block could be erased
without data movement. This would minimize write amplification.
4. I/O priority adjustment: Resource priorities are assigned to different I/O types to
ensure I/O processing based on the SLAs. A highway has normal lanes for
general traffic, but it also has emergency lanes for vehicles which need to travel
faster. Similarly, priority adjustment lowers latency by setting different priorities
for different types of I/Os by their SLAs for resources.
Page 20

5. Smart disk enclosure: The latest generation of Huawei-developed smart disk


enclosures is used. Each smart disk enclosure provides CPU and memory
resources, which can be used to offload tasks such as disk reconstruction. This
reduces the load of controllers and ensures that reconstruction does not affect
service performance in case a disk is faulty. Using RAID 6 (21+2) as an example.
During the reconstruction of a traditional disk enclosure, if disk D1 is faulty, the
controller must read disks D2 to D21 and column P, and then calculate D1. A
total of 21 data blocks must be read from the disks. The calculation and
reconstruction processes consume a large number of CPU resources of the
controller. During the reconstruction of a smart disk enclosure, data read
commands are sent to the smart disk enclosure. The smart disk enclosure reads
data locally and uses the data to calculate parity data. Then, the smart disk
enclosure only needs to transmit the parity data to the controller, significantly
saving the network bandwidth. Each smart disk enclosure houses two expansion
modules with built-in Kunpeng chipsets and memory resources. The smart disk
enclosures offload data reconstruction from controller enclosures to save
controller resources.

1.2.3.3 High Security


 Software integrity protection
The software package uses an internal digital signature and a product package
digital signature. After the software package is sent to the customer over the
network, the upgrade module of the storage system verifies the digital signatures
and performs the upgrade only after the verification is successful. This ensures the
integrity and uniqueness of the upgrade package and internal software modules.
 Trusted and secure boot of hardware
Secure boot is to establish a hardware root of trust (which is tamperproofing) to
implement authentication layer by layer. This builds a trust chain in the entire system
to achieve predictable system behavior.
Huawei all-flash series storage system builds secure boot based on the hardware root
of trust (RoT) to ensure that the software loaded during the boot process is not
tampered with by hackers or malware.
Software verification and loading process for secure boot:
Verifying the signed public key of Grub: BootROM verifies the integrity of the signed
public key of Grub. If the verification fails, the boot process is terminated.
Verifying and loading Grub: BootROM verifies the Grub signature and loads Grub if
the verification is successful. If the verification fails, the boot process is terminated.
Verifying the status of the software signature certificate: Grub verifies the status of
the software signature certificate based on the certificate revocation list. If the
certificate is invalid, the boot process is terminated.
Verifying and loading the OS: Grub verifies the OS signature and loads the OS if the
verification is successful. If the verification fails, the boot process is terminated.
 Data encryption
SEDs and internal or external key managers are configured to work with storage
systems to implement static data encryption, ensuring data security.
Page 21

1. Internal Key Manager, which is the built-in key management system of the
storage system, is designed with the best practices of NIST SP 800-57. It
generates, updates, backs up, restores, and destroys keys, and provides
hierarchical key protection. Internal Key Manager is easy to deploy, configure,
and manage. It is recommended if FIPS 140-2 certification is not required and
the key management system is only used by the storage systems in a data
center.
2. External Key Manager uses the standard KMIP and TLS protocols and complies
with the key security requirements of FIPS 140-2 SafeNet. External Key Manager
is recommended if FIPS 140-2 certification is required or multiple systems in a
data center require centralized key management.
 Role-based permission management
Preset default roles (as listed the following table): Default roles of system
management users and tenant management users are preset in the system.

Role
Default Role Permission
Group

Super administrator, who has all


super_administrator
permissions of the system.

Administrator, who has all permissions


administrator except user management and security
configuration.

Security administrator, who has the


security_administrator security configuration and policy
management permissions of the system.
System
group SAN resource administrator, who has the
san_resource_administrator permission to manage and configure SAN
resources in the system.

Data protection administrator, who has


the data protection permissions of the
dataProtection_administrator
system, including remote replication and
HyperMetro.

Remote device administrator, who has


Remote device administrator
the remote configuration permission.

User-defined roles: The system allows such roles to define permissions as required. Create
a role and select the function permissions and object permissions required by the role.

1.2.4 Application Scenarios


 Storage virtualization
Huawei all-flash storage system incorporates server virtualization optimization
technologies such as vStorage APIs for Array Integration (VAAI), VMware vStorage
APIs for Storage Awareness (VASA), and Site Recovery Manager (SRM). It also
Page 22

employs numerous key technologies in virtual machines (VMs) to deploy VMs fast,
enhance VMs' bearing capability and operation efficiency, and streamline storage
management in virtual environments, helping you easily cope with storage in virtual
environments.
 Application scenarios of multi-protocol access
The storage system allows NFS sharing and CIFS sharing to be configured for the
same file system concurrently. Huawei all-flash storage systems can support both
SMB and NFS services at the same time.

1.3 Distributed Storage Technical Deep Dive


1.3.1 Product Positioning
Designed for mass storage data scenarios, Huawei distributed storage series provides
diversified storage services for various applications, such as virtualization/cloud resource
pools, mission-critical databases, big data analysis, high-performance computing (HPC),
video, content storage, backup, and archiving, helping enterprises release the value of
mass data.
Huawei scale-out file storage adopts a fully symmetric distributed architecture. With its
industry-leading performance, large-scale scale-out capability, and ultra-large single file
system, Huawei scale-out file storage provides users with shared unstructured data
storage resources. It is a scale-out file storage system oriented to massive unstructured
data storage applications. It improves the storage efficiency of IT systems and simplifies
the workload and migration process to cope with the growth and evolution of
unstructured data.
Huawei intelligent distributed storage is a large-scale scale-out intelligent distributed
storage product developed by Huawei. A cluster provides standard interfaces for upper-
layer applications, such as block service, HDFS service, and object file service, eliminating
complex operation problems caused by siloed storage systems. Its diverse and adaptable
range of features provide stable bearer for complex services, maximized efficiency for
diversified data, and cost-effective storage for massive data.
1. Block service supports SCSI and iSCSI interfaces and provides upper-layer applications
with massive storage pools that can be obtained on demand and elastically
expanded, significantly improving the preparation efficiency of application
environments. It is an ideal storage platform for private clouds, containers,
virtualization, and database applications.
2. HDFS service provides a decoupled storage-compute big data solution based on
native HDFS. The solution implements on-demand configuration of storage and
compute resources, provides consistent user experience, and helps reduce the total
cost of ownership (TCO). It can coexist with the legacy coupled storage-compute
architecture. Typical application scenarios include big data for finance, Internet log
retention, and government sectors.
3. Object service supports a single bucket carrying a maximum of 100 billion objects
without performance deteriorated. This eliminates the trouble of bucket
reconstruction for large-scale applications. Typical scenarios include production,
Page 23

storage, backup, and archiving of financial electronic check images, audio and video
recordings, medical images, government and enterprise electronic documents, and
Internet of Vehicles (IoV).
For more information, log in to http://support.huawei.com to obtain the relevant product
documentation.

1.3.2 Software and Hardware Architectures


For details about product forms, log in to the Huawei Data Storage Infocenter, and
access 3D Interactive Multimedia to obtain 3D images.
For more hardware information, see the corresponding product documentation.
Industry-leading distributed architecture: Huawei distributed storage adopts a fully
distributed architecture, including distributed management cluster, hash data routing
algorithm, stateless engine, and smart cache. Such architecture frees the entire storage
system of SPOFs.
High performance and reliability: Huawei distributed storage implements load balancing
among all disks. Distributed data storage eliminates hot data. Effective routing
algorithms and the distributed cache technology ensure high performance.
Rapid parallel data restoration upon failures: Data is fragmented in the resource pool. By
simultaneously restoring data copies in the resource pool efficiently, these data
fragments can be automatically reconstructed in case a disk is faulty.
Easy expansion and ultra-large capacity: The distributed stateless engines in Huawei
distributed storage support scale-out, smooth and parallel capacity expansion of storage
and computing resources, and non-silo ultra-large capacity expansion.
The overall software architecture supports various enterprise features, such as second-
level asynchronous replication of block services and active-active. The architecture is
microservice-oriented, in which the block service, HDFS service, and object service can
share the Persistence service.
The meanings of some terms in the software architecture in some teaching materials are
as follows:
Protocol: indicates a storage protocol layer. The block, object, HDFS, and file services
support local mounting access over iSCSI or VSC, S3/Swift access, HDFS access, and NFS
access, respectively.
VBS: indicates a block access layer of the block service. User I/Os are delivered to VBS
through iSCSI or VSC.
EDS-B: provides the block service with enterprise-level features, and receives and
processes I/Os from VBS.
EDS-F: provides the HDFS service.
OBS service: provides the object service.
DP: provides data protection.
Persistence layer: provides persistent storage, EC, and multi-copy capabilities. Plog clients
provide Append Only access of plogs.
Infrastructure: provides infrastructure capabilities for the storage system, such as
scheduling and memory allocation.
Page 24

OAM: indicates the storage management plane, which provides functions such as
deployment, upgrade, capacity expansion, monitoring, and alarming.

1.3.2.1 High Performance


 DHT routing technology
In Huawei distributed storage, the block service uses the distributed hash table (DHT)
routing algorithm. Each storage node stores a small proportion of data, and the data
is routed and stored using the DHT routing algorithm. This DHT algorithm features
balance and monotonicity.
1. Balance: Data is distributed to all nodes as evenly as possible, thereby balancing
loads among nodes.
2. Monotonicity: When new nodes are added to the system, the system
redistributes data among nodes. Data migration is implemented only on the new
nodes, and the data on the existing nodes is not significantly adjusted.
Traditional storage systems typically employ the centralized metadata management
mechanism, which allows metadata to record the disk distribution of the LUN data
with different offsets. For example, the metadata may record that the first 4 KB of
data in LUN1+LBA1 is distributed on LBA2 of the 32nd disk. Each I/O operation
initiates a query request for the metadata service. As the system scale grows, the
metadata size also increases. However, the concurrent operation capability of the
system is subject to the capability of the server accommodating the metadata service.
In this case, the metadata service may become a performance bottleneck of the
system.

During system initialization, Huawei distributed storage system sets partitions for
each disk based on the value of N and the number of disks. For example, the default
value of N is 3600 for two-copy backup. If the system has 36 disks, each disk has 100
partitions. The partition-disk mapping is configured during system initialization and
dynamically adjusted based on the number of disks. The partition-disk mapping table
occupies only a small space, and Huawei distributed block storage nodes store the
mapping table in the memory for rapid routing. Huawei distributed block storage
Page 25

does not employ the centralized metadata management mechanism and therefore
does not have performance bottlenecks incurred by the metadata service.
Huawei distributed block storage logically divides a LUN by every 1 MB of space. For
example, a LUN of 1 GB space is divided into 1024 slices of 1 MB space.
When an application accesses block storage, the SCSI command carries the LUN ID,
LBA ID, and I/O data to be read/written. The OS forwards the message to the VBS of
the local node. The VBS generates a key based on the LUN ID and LBA ID. The key
contains rounding information of the LBA ID based on the unit of 1 MB. The result
calculated using DHT hash indicates the partition. The specific disk is located based
on the partition-disk mapping recorded in the memory. The VBS forwards the I/O to
the OSD to which the disk belongs. For example, if an application needs to access the
4 KB data identified by an address starting with LUN1+LBA1, Huawei distributed
storage first constructs "key=LUN1+LBA1/1M", calculates the hash value for this key,
performs modulo operation for the value N, gets the partition number, and then
obtains the disk of the data based on the partition-hard disk mapping.
Each OSD manages a disk. During system initialization, the OSD divides the disk into
slices of 1 MB and records the slice allocation information in the metadata
management area of the disk. After receiving an I/O from the VBS, the OSD searches
for the data fragment information on the disk based on the key, reads or writes the
data, and returns the data to the VBS. In this way, the entire data routing process is
completed.
The DHT routing technology helps Huawei distributed storage quickly locate the
specific location where data should be stored based on service I/Os, avoiding
searching and computing in massive data. This technology uses Huawei-developed
algorithms to ensure that data is balanced among disks. In addition, when hardware
is added or removed (due to faults or capacity expansion), the system automatically
and quickly adjusts the hardware to ensure the validity of data migration, automatic
and quick self-healing, and automatic resource balancing.
 Dynamic intelligent partitioning and static disk selection algorithms
During data persistence, Huawei distributed storage adopts two-layer algorithms to
optimize the performance and reliability of distributed storage. One is to create Plogs
and select the dynamic intelligent partitioning algorithm of PT for Plogs. The other is
to select the local static disk selection algorithm of OSDs for PT.
1. The dynamic intelligent partitioning algorithm introduces an adaptive negative
feedback mechanism to achieve superb reliability and performance. Its major
improvements and objectives are as follows:
a. Write reliability is not degraded. If the partition corresponding to a Plog falls
into a faulty disk, the Plog is discarded and a new Plog is selected to write
data.
b. Loads are balanced and hotspots are eliminated. In random access
scenarios, polling or distributed hash algorithms cannot fully ensure
balanced data layouts and disk access performance. For example, in some
storage systems based on the CRUSH hash algorithm, the utilization
difference between OSDs reaches 20%, causing continuous occurrence of
hotspot disks. In addition, disk fault recovery, slow disk hotspots, and QoS
Page 26

traffic control affect system performance. The Plog Manager of Huawei


distributed storage periodically collects the available capacity and I/O load
of disks, nodes, and partitions, and intelligently identifies hotspots and fast
and slow disks.
2. The local static disk selection algorithm aims to optimize the local balancing of
the mapping between a partition and an OSD. Optimization is achieved in the
following scenarios:
a. Data balancing performance is optimized without affecting reliability when
nodes are added, reducing invalid data migration. In the algorithm, cyclic
selection of partitions by partition group rather than conventional disk-
based balancing is used. Intra-group balancing significantly reduces invalid
data migration without degrading reliability. The invalid data migration
ratio is only 1% to 3% of that of the CRUSH algorithm, and the actual
invalid data migration ratio is only 0.05% to 0.8%.
b. In multi-copy scenarios, the balancing of primary partitions is optimized.
Common algorithms only focus on the balancing of initial primary copies.
Once a node or a disk becomes faulty, primary copies are selected again for
many data blocks. Whether the new primary copies are evenly distributed
affects the current service performance and the performance of subsequent
incremental data synchronization. The existing solution is static fixed
selection of primary copies, that is, selecting secondary copy 1 first and then
secondary copy 2 after a primary copy becomes faulty. Huawei distributed
storage adopts the dynamic primary selection solution. In the solution, it
selects a primary disk based on the workloads of the disks where the
secondary copies reside to prevent some disks from becoming hotspots after
a fault occurs. The improved algorithm enhances the data balancing speed
by up to 28.8% in actual tests.
c. In EC scenarios, the number of supported concurrent data reconstruction
tasks is improved and performance is increased. The correlation between PT
groups and global disks is also used for small-scale calculation. The smaller
the number of data blocks read from a single normal disk, the more the
disks involved in load balancing. With the improvement, the disk read
latency is shortened by up to 45% during system restoration.
 Intelligent CPU partitioning algorithms
Intelligent CPU partitioning algorithms are categorized into the CPU grouping
algorithm and CPU core splitting algorithm. The algorithms group CPUs and divide
cores based on service types, identify I/O types (host I/Os, replication I/Os, and
background I/Os), and intelligently schedule I/Os based on priorities to ensure stable
read and write latency.
CPU grouping algorithm: Read and write I/Os and other I/Os are deployed in
different groups to avoid mutual interference. Dedicated cores are used for key
services to reduce latency. Multiple cores are shared and allocated for load balancing
among multiple services.
CPU core splitting algorithm: A request is executed on the same core to implement
the "Run-to-Complete" scheduling principle. No thread switchover is performed on
Page 27

the same core to ensure operation atomicity. This avoids frequent multi-core
switchovers and improves the CPU cache hit ratio.
 EC intelligent aggregation technology
Erasure coding (EC) increases the computing overhead, and a poor EC design brings
more write penalties. Therefore, the performance of products using EC may be
significantly lower than that of products using multi-copy storage.
Write penalty of EC: This section uses 4+2 redundancy as an example. If the size of
data to be written is less than 32 KB (for example, only 16 KB), 16 KB of data is
written for the first time. During the second write, the 16 KB data written earlier
must be read and combined before being written to disks. This causes the read
overhead. This problem does not occur when full stripes are delivered. The intelligent
aggregation EC based on append write ensures EC full-stripe write at any time,
reducing read/write network amplification and disk amplification by several times.
Data is aggregated at a time, reducing the CPU computing overhead and providing
ultimate peak performance. In the cache, data of multiple LUNs is aggregated into a
full stripe, reducing write amplification and improving performance.
Huawei distributed storage provides intelligent I/O aggregation to use different
policies for different I/Os, ensuring the read/write performance.
1. Large I/Os are formed into EC stripes and directly written to disks without being
cached, saving cache resources. When SSDs are used as cache media, the service
life of SSDs can be extended.
2. Small I/Os are written to the cache and an acknowledgement is returned
immediately.
3. In the log cache, small I/Os of different LUNs are aggregated into large I/Os to
significantly increase the probability of aggregation and improve performance.
 Adaptive global deduplication and compression
Huawei distributed storage systems support adaptive inline and post-process global
deduplication and compression, which can provide ultimate space reduction and
reduce users' TCO with proper resource consumption. Adaptive inline and post-
process deduplication indicates that inline deduplication automatically stops when
the system resource usage reaches the threshold. Data is directly written to disks for
persistent storage. When system resources are idle, post-process deduplication starts.
After the deduplication is complete, the compression process starts. The compression
is 1 KB aligned. The LZ4 algorithm is used to support HZ9 deep compression to
obtain a better compression ratio. Deduplication and compression can be enabled or
disabled as required.
Huawei distributed storage supports global deduplication and compression, as well
as adaptive inline and post-process deduplication. Deduplication reduces write
amplification of disks before data is written to disks.
Huawei distributed storage adopts the opportunity table and fingerprint table
mechanism. After data enters the cache, the data is sliced into 8 KB data fragments.
The SHA-1 algorithm is used to calculate 8 KB data fingerprints. The opportunity
table is used to reduce invalid fingerprint space, thereby reducing cost in memory.
 Multi-level cache technology
Page 28

1. The write cache process is as follows:


Step 1: Data is written to the RAM-based write cache (Memory Write Cache).
Step 2: Data is written to the SSD WAL cache (for large I/Os, data is directly
written to HDDs) and a message is returned to the host indicating that the write
operation is complete.
Step 3: When the memory write cache reaches a specified watermark, data starts
to be flushed to disks.
Step 4: Data of large I/Os is written to the HDDs directly. Data of small I/Os is
first written to the SSD write cache and then to the HDDs after the small I/Os
have been aggregated into large I/Os.
Note: If the data written in step 1 exceeds 512 KB, it is directly written to the
HDDs, as described in step 4.
2. The read cache process is as follows:
Step 1: Data is read from the memory write cache. If the read I/O is hit, data is
returned. Otherwise, the system proceeds to step 2.
Step 2: Data is read from the memory read cache. If the read I/O is hit, data is
returned. Otherwise, the system proceeds to step 3.
Step 3: Data is read from the SSD write cache. If the read I/O is hit, data is
returned. Otherwise, the system proceeds to step 4.
Step 4: Data is read from the SSD read cache. If the read I/O is hit, data is
returned. Otherwise, the system proceeds to step 5.
Step 5: Data is read from the HDDs.
Note: Pre-fetched data, such as sequential data, is cached to the memory read
cache.
Hotspot data identified during the read process is cached to the SSD read cache.
 Append only plog technology
Huawei distributed storage supports HDD and SSD media at the same time. HDDs
and SSDs have great differences in technical parameters such as bandwidth, IOPS,
and latency. Therefore, I/O patterns applicable to both media are different. Huawei
distributed storage leverages Append Only Plog technology for unified management
of HDDs and SSDs. The Append Only Plog technology provides the optimal disk
writing performance model for media. Small I/O blocks are aggregated into large
ones, and then large I/O blocks are written to disks in sequence. This write mode
complies with the characteristics of disks.
 Intelligent load balancing technology
Huawei distributed storage evenly distributes data and metadata on nodes and
automatically balances loads, eliminating metadata access bottlenecks and ensuring
system performance during capacity expansion. Load can also be automatically
balanced between existing nodes in the event of node failures or capacity expansion.
Larger capacity and higher performance can be achieved without modifying the
application configuration.
Page 29

Intelligent load balancing is used for access based on domain names (in active-
standby mode). It supports partitioning. Each partition can be configured with a
unique domain name and a customized load balancing policy.

1.3.2.2 High Reliability


 Data reconstruction
Generally, distributed storage uses all disks to participate in data reconstruction.
However, Huawei distributed storage uses the self-developed high-performance
algorithm to ensure the validity and balance of data migration through one-off
computing and minimize the data migration workload. In addition, Huawei
distributed storage combines the write cache processing to shorten the data
reconstruction time and reduce the impact of data reconstruction on service
performance.
 Erasure coding turbo
Huawei distributed storage can also use erasure coding (EC) to ensure data
reliability. Compared with the three-copy scheme, EC provides higher disk utilization
without compromising storage reliability.
The EC-based data protection technology is based on distributed and inter-node
redundancy. Huawei distributed storage uses the self-developed Low Density Erasure
Code (LDEC) algorithm. It is an MDS array code based on the XOR and Galois field
multiplication. The minimum granularity is 512 bytes. It supports Intel instruction
acceleration and various mainstream ratios. Data written into the system is divided
into N data strips, and then M redundant data strips are generated (both N and M
are integers). These data strips are stored on N+M nodes.
EC Turbo drives higher space utilization and provides a data redundancy solution that
ensures stable performance and reliability when faults occur.
 E2E data integrity check
Distributed storage provides high-end enterprise storage reliability and has little
impact on system performance.
End-to-end data integrity check: Write requests are verified on the access point of the
system (the VBS process). Host data is re-verified on the OSD process before being
written to disks. If the integrity check is successful, the data is written into the disks.
Otherwise, the data is returned to the EDS or VBS node for verification and must be
written again. When the host reads data, the data integrity check is performed on
the VBS. If the integrity check fails, the host retries reading data or performs self-
healing.
Huawei distributed storage provides excellent data integrity check capabilities. It uses
the CRC32 algorithm to protect user data blocks (4 KB). In addition, it supports host
logical block addressing (LBA) check and disk LBA check, perfectly addressing
problems related to silent data corruption, such as bit changes and read and write
skew. (High-end enterprise storage reliability capability)
Improved self-healing capability: If the data integrity check fails, other redundant
data is used to automatically repair incorrect data. If local redundancy fails to repair
the data, HyperMetro redundant data can be used to repair the data.
Page 30

Minimum impact on performance: Data I/Os and verification metadata are written to
disks as an atom.
Periodic verification: When the system service load is light, the system automatically
starts periodic background verification.
Both real-time verification and periodic background verification are supported. Real-
time verification: Write requests are verified on the access point of the system (the
VBS process). Host data is re-verified on the OSD process before being written to
disks. Data read by the host is verified on the VBS process. Periodical background
data integrity check: The system automatically enables periodical data integrity check
and self-healing when the workload is light.
Huawei distributed storage provides three verification mechanisms. It uses the CRC32
algorithm to protect user data blocks (4 KB). In addition, it supports host logical
block addressing (LBA) check and disk LBA check, perfectly addressing problems
related to silent data corruption, such as bit changes and read and write skew.
Huawei distributed storage provides two self-healing mechanisms: If faulty data fails
to be recovered in the local redundancy mechanism, the HyperMetro redundant data
will be used to recover the faulty data.
 Sub-health management technology
Sub-health may cause slow system response or service interruption. Huawei
distributed storage supports fast sub-health check and isolation. It uses the fast-fail
function to control the impact on system performance within 5 seconds, ensuring
that latency-sensitive services are not interrupted in case a sub-health fault occurs.
1. Disk sub-health management: includes intelligent detection, diagnosis, isolation,
and warning. Huawei distributed storage can generate alarms and isolate disks
for mechanical disk faults, slow disks, Smart information, and UNC errors. It can
also cope with problems such as SSD card faults, slow cards, high temperature
faults, capacitor failures, and excessive number of erase times exceeding the sub-
health threshold. Through intelligent detection, Huawei distributed storage
collects information about Smart messages, statistical I/O latency, real-time I/O
latency, and I/O errors. Clustering and slow-disk detection algorithms are used to
diagnose abnormal disks or RAID controller cards. Through isolation and
warning, Huawei distributed storage notifies the MetaData Controller (MDC) to
isolate disks and report alarms after diagnosis. The MDC controls the distributed
cluster node status, data distribution rules, and data rebuilding rules.
2. Network sub-health management: If the NIC failure, rate decrease, link packet
loss, port fault, intermittent disconnection, or packet loss occurs or exceeds the
corresponding subhealth threshold, Huawei distributed storage locates the fault
and generates an alarm. It isolates network resources from the fault through
multi-level detection, intelligent diagnosis, and level-by-level isolation and
warning. Multi-level detection: The local network of a node quickly detects
exceptions such as intermittent disconnections, packet errors, and negotiated
rates. In addition, nodes are intelligently selected to send detection packets in an
adaptive manner to identify link latency exceptions and packet loss. Smart
diagnosis: Smart diagnosis is performed on network ports, NICs, and links based
on networking models and error messages. Level-by-level isolation and warning:
Page 31

Network ports, links, and nodes are isolated based on the diagnosis results and
alarms are reported.
3. Sub-health management for processes/services: includes cross-process/service
detection, intelligent diagnosis, and isolation and warning. Cross-process/service
detection: If the I/O access latency exceeds the specified threshold, an exception
is reported. Smart diagnosis: Huawei distributed storage diagnoses processes or
services with abnormal latency using the majority voting or clustering algorithm
based on the reported abnormal I/O latency of each process or service. Isolation
and warning: Abnormal processes or services are reported to the MDC for
isolation and alarms are reported. If CPU resources are used up or memory faults
occur, the system can locate sub-health faults and isolate related nodes.
4. Fast-Fail: ensures that the I/O latency of a single sub-healthy node is
controllable. Average I/O latency detection is used to check whether the average
I/O latency exceeds the threshold and whether a response is returned for the
I/O. If no response is returned, path-switching retry is triggered. Path switching
retry indicates that read I/Os are read from other copies or recalculated based
on EC, and write I/Os are written to Plogs to allocate space to other disks.
 Fast failover
Cluster management (CM): manages clusters.
Service node management (SNM): provides functions such as process monitoring and
fault recovery.
If a node is temporarily faulty or subhealthy, a switchover is performed. The cluster
controller generates a temporary view based on the sub-healthy node, switches
services to other healthy nodes, and writes the replication I/Os to the temporary data
storage node.
The system checks whether the access latency of the sub-healthy node has recovered
every 5 minutes. If the latency has recovered, the temporary data is pushed to the
original sub-healthy node through the background process. After the data
transmission is complete, the temporary view is deleted and the services are switched
back to the original node. If not, the system removes the faulty node from the cluster
and performs global reconstruction.
 Hardware faults transparent to the system
1. Memory protection is enabled upon power failures, ensuring data security.
2. Hot-swappable SAS disks in RAID 1 are used as system disks.
3. Power and fan modules are redundant.
4. Swappable mainboards and cable-free design are adopted, significantly
increasing node reliability while reducing 80% of replacement time.
 Cabinet-level reliability
In multi-copy mode, different data copies are distributed in different cabinets. For
example, if the 3-copy storage mode is configured for a storage pool containing eight
cabinets, the system can still provide services when two cabinets become faulty.
In EC mode, data and parity fragments are distributed in different cabinets. For
example, if the 4+2 EC scheme is configured for a storage pool containing eight
cabinets, the system can still provide services when two cabinets become faulty.
Page 32

1.3.3 Application Scenarios


 Private cloud and virtualization
Huawei distributed storage provides mass data storage resource pools featuring on-
demand resource provisioning and elastic capacity expansion for virtualization and
cloud. It improves storage deployment, expansion, and operation and maintenance
(O&M) efficiency using general-purpose servers. Typical scenarios include Internet-
finance channel access clouds, development and testing clouds, cloud-based services,
B2B cloud resource pools in carriers' BOM domains, and e-Government cloud.
 Mission-critical database
Huawei distributed storage delivers enterprise-grade capabilities, such as distributed
active-active storage and consistent low latency, to ensure efficient and stable
running of data warehouses and mission-critical databases, including online
analytical processing (OLAP) and online transaction processing (OLTP).
 Big data analysis
Huawei distributed storage provides a decoupled storage-compute solution for big
data, which integrates traditional data silos and builds a unified big data resource
pool for enterprises. It also leverages enterprise-grade capabilities, such as elastic
large-ratio erasure coding (EC) and on-demand deployment and expansion of
decoupled compute and storage resources, to improve big data service efficiency and
reduce TCO. Typical scenarios include big data analytics for finance, carriers (log
retention), governments, and Safe City.
 Content storage and backup archiving
An enterprise-level object service resource pool with high performance and reliability
can meet requirements of real-time online services such as Internet data, online
audio and video data, and enterprise web disks. It delivers large throughput, enables
frequent access to hotspot data, and implements long-term storage and online
access. Typical scenarios include storage, backup, and archiving of financial electronic
check images, audio and video recordings, medical images, government and
enterprise electronic documents, and Internet of Vehicles (IoV).
For example, Huawei OceanStor 100D distributed storage block service can be used
in scenarios such as BSS, MSS, OSS, and VAS. In object storage scenarios, its
advantages are as follows:
Stable and low latency for the customer access process: The stable latency is less
than 80 ms, meeting the stability requirements of continuous video write latency and
improving the access experience of end users.
High concurrent connections: Millions of video connections are supported, ensuring
stable performance.
On-demand use: Storage resources can be dynamically used and paid on demand
based on service growth at any time, reducing the TCO.
Page 33

2 Storage Design and Performance


Tuning

2.1 Storage Resource Planning and Design


2.1.1 Overview
2.1.1.1 Planning and Design
Proper planning and design of storage resources can help you improve the fault tolerance
capability and reduce the impact of risks.
Actually, the boundary between planning and design is not obvious. They have similar
objectives and face similar challenges.
 The overall objective of planning and design is to better meet storage requirements
with the minimum investment.
 The key challenge that you will face in planning and design is the uncertainty of
requirements.
You can understand planning and design in this way: Macro planning is often regarded as
the beginning of a project. You need to follow the existing rules and framework to find
the best choice. However, design focuses more on the details of the specific problems and
solutions.
Planning usually consists of two modules, strategy and design.
Strategy specifies design principles, such as business objectives, development
requirements, and technology selection. A good design usually begins with clear and
explicit objectives, requirements, assumptions, and restrictions. Such information can
make it easy for a design project to be implemented without problems. It is often
necessary to balance the best practices of technology and the ultimate objectives and
requirements of the organization. Even a perfect technical design cannot be delivered if it
does not meet the requirements of the organization.
The design should be easily understood, deployed, and used. This includes naming rules'
standardization, port group switching standards, and storage standards. Defining too
complex standards will increase management and maintenance costs in deployment and
future use. Generally, manageability, high availability, scalability, and data security must
be considered during design. It is also necessary to consider the cost and reduce the cost
as much as possible under reasonable circumstances. Usability and scalability must also
be considered during design. A scalable environment can reduce the scaling cost of an
organization when the organization grows in the future. Data security must be
considered. For users, data security does not allow any accident. Design is implemented
Page 34

based on the service level of the user. For important services, high availability should be
considered when hardware damage occurs and during maintenance.

2.1.1.2 Planning and Design Principles


Planning and design are the basis of implementation. The involved principles and factors
to be considered may change according to the actual situations.
This slide lists some important planning and design principles:
 Functionality: The planned and designed system must meet basic requirements for
functions.
 Feasibility: The planned and designed solution must be feasible.
 Compliance: The planned and designed system complies with related laws and
regulations.
 Cost-effectiveness: The system construction cost and the operation and maintenance
(O&M) cost must be considered.
 Sustainable development: The development and expansion of services must be
considered. For example, the elastic expansion of the system needs to be planned.
 Changeability and scalability: For example, modular planning and design can help
cope with possible service requirement changes in the future.
 Security: Data security of the system must be ensured.
 Standardization, modularization, energy saving, environmental protection, and so on
The principles listed in the training materials are related to each other. In actual
application, other principles and factors may be involved in the planning and design
process to ensure smooth project implementation based on the project background.

2.1.1.3 Storage Planning, Design, and Implementation


Information survey and collection: Collect service information (such as network topology
and host information) and detailed storage requirements (involving disk domains and
storage pools) to know the service running status and predict the service growth trend.
Based on the collected information and specific requirements, analyze storage capacity,
IOPS, storage pool, management, and advanced functions, examine the feasibility of the
implementation solution, and determine the implementation roadmap of key
requirements.
Engineering survey: Onsite survey plays an important role in a project. An accurate and
detailed survey report is very valuable. It not only lays an important foundation for
correct product delivery, but also provides important data and reference information for
the subsequent project design/data settings and project construction. Survey engineers
have responsibilities for providing helpful suggestions and measures and assisting
customers actively in project preparations.
Compatibility check: Check the compatibility based on the host operating system version,
host multipathing information, host application system information, Huawei storage
version, storage software, and multipathing software information provided by the
customer.
Page 35

LLD planning and design: Output the LLD solution and document. Submit the LLD
document to the customer for review. Modify the document based on the customer's
comments.
Value-added feature planning and design: Plan and design value-added features based
on the customer's requirements and purchased value-added features.

2.1.2 Methods and Processes


2.1.2.1 Basic Phases
 Survey
The survey phase consists of determining the project scope and collecting design
data. Understanding customer needs and project details and identifying key
stakeholders are key success factors in providing designs that meet customer needs.
In this phase, the project scope and business objectives are determined.
The most effective way to get information before design is to listen as much as
possible. In the early discussions with the business unit, objectives, restrictions, and
risks of all projects must be covered.
If objectives, requirements, or restrictions cannot be met or risks are generated,
discuss with key personnel of the business unit about the problems and
recommended solutions as soon as possible to avoid project delay.
When making analysis and other design choices with business units, consider the
performance, availability, scalability, and other factors that may affect the project.
Also consider how to reduce risks during design, and discuss the costs brought
thereof.
Record all discussion points during design for future acceptance.
An excellent design must meet several final goals, which are influenced by some
strategic principles of the organization.
 Conceptual Design
Conceptual design emphasizes content simplification, differentiates priorities, and
focuses on long-term and overall benefits.
Conceptual design is to determine service objectives and requirements of a project. It
determines the entities affected by the project, such as business units, users,
applications, processes, management methods, and physical machines. It also
determines how project objectives, requirements, and constraints apply to each
entity. A system architecture that meets both service objectives and restrictions must
be output. For example, the availability, scalability, performance, security, and
manageability must meet the requirements, the cost can be controlled, and other
restrictions are met.
 HLD
HLD is short for high level design. HLD is to determine the network scope and
function deployment solution based on the network architecture. It includes the
interaction process of important services within the scope of the contract, function
allocation of all services on each NE, peripheral interconnection relationships of NEs,
Page 36

interconnection requirements and rules, and quantitative calculation of


interconnection resources.
 LLD
Based on the network topology, signaling route, and service implementation solution
output in the network planning phase, low level design (LLD) defines the design
contents, provides data configuration principles, and guides data planning activities.
In this way, the storage solution can be successfully implemented to meet customer
requirements. LLD covers detailed hardware information and implementation
information. For example:
 Distribution of data centers and layout of equipment rooms
 Server quantity, server type, service role, network configuration, and account
information
 Storage hardware quantity and type, RAID name and level, and disk size and
type
 Network topology, switch configuration, router configuration, access control, and
security policy

2.1.2.2 Storage Planning and Design Content


2.1.2.2.1 Project Information
 Information Collection
Information collection is the first step of planning and design and the basis for
subsequent activities. Comprehensive, timely, and accurate identification, filtering,
and collection of raw data are necessary for ensuring information correctness and
effectiveness.
The information to be collected in a storage project includes the project background,
service information, network information, and detailed customer requirements. In
addition, the project schedule must be collected.
Live network information includes device information, network topology, and service
information.
When collecting information about customer requirements, focus on information
about the customer's current pain points of services, whether the storage product
(involving storage capacity and concurrency) meets the service growth requirements,
and system expansion analysis for the future.
The project schedule includes the project delivery time and key time points. In the
schedule, you must clarify the time required to complete the work specified in the
delivery scope of a certain phase, tasks related to the delivery planned in each time
period, and milestones as well as time points of events.
 Requirement Analysis
Based on the collected information and specific requirements, analyze storage
capacity, IOPS, storage pool, management, and advanced functions, examine the
feasibility of the implementation solution, and determine the implementation
roadmap of key requirements.
Page 37

For storage resource planning and design, requirement analysis refers to the process
of analyzing and sorting out the requirements or needs involved in a project to form
a complete and clear conclusion.
Requirement analysis involves functional requirements, non-functional requirements,
and standards and constraints.
In addition to core functions, availability, manageability, performance, security, and
cost must also be considered in requirement analysis.
Availability: indicates the probability and duration of normal system running during a
certain period. It is a comprehensive feature that measures the reliability,
maintainability, and maintenance support of the system.
Manageability: Storage manageability includes integrated console, remote
management, traceability, and automation.
1. Integrated console: integrates the management functions of multiple devices and
systems and provides end-to-end integrated management tools to simplify
administrators' operations.
2. Remote management: manages systems through the network on the remote
console. These devices or systems do not need to be managed by personnel at
the deployment site.
3. Traceability: ensures that the management operation history and important
events can be recorded.
4. Automation: The event-driven mode is used to implement automatic fault
diagnosis, periodic and automatic system check, and alarm reporting when the
threshold is exceeded.
Performance: Indicators of a physical system are designed based on the service level
agreement (SLA) for the overall system and different users. Performance design
covers not only performance indicators required by normal services, but also
performance requirements in abnormal cases, such as the burst peak performance,
fault recovery performance, and DR switchover performance.
Security: Security design must provide all-round security protection for the entire
system. The following aspects must be included: physical layer security, network
security, host security, application security, virtualization security, user security,
security management, and security service. Multiple security protection and
management measures are required to form a hierarchical security design.
Cost: The cost is always considered. An excellent design should always focus on the
total cost of ownership (TCO). When calculating the TCO, consider all associated
costs, including the purchase cost, installation cost, energy cost, upgrade cost,
migration cost, service cost, breakdown cost, security cost, risk cost, reclamation cost,
and handling cost. The cost and other design principles need to be coordinated based
on balance principles and best practices.
2.1.2.2.2 Hardware Planning
 Device Selection
After receiving a device selection requirement, conduct industrial comparison and
communication to determine the appropriate technical standards. Then, select more
Page 38

than one supplier to perform device tests. Finally, output the device selection report
to provide technical basis for device procurement and acceptance.
The key principles of device selection are high product quality and cost-effectiveness
Storage devices are selected based on the capacity, throughput, and IOPS. Business
requirements vary with scenarios. Therefore, costs must be considered during the
evaluation.
In actual applications, device selection may involve other indicators, including but not
limited to those listed in the following table.

Indicator Description

The reliability of the equipment requires a longer mean time


Reliability
between failures (MTBF) for its main components.

The maintainability of the equipment requires that it be easily


Maintainability
repaired once a fault occurs.

The man-machine interface of the equipment must be user-


Operability
friendly and easy to operate.

On the premise of meeting the requirements, the technologies


Cutting-edge
adopted by the equipment should be kept at an advanced level
technologies
to prolong the technical lifecycle of the equipment.

Compatibility The compatibility with legacy devices must be considered.

The equipment should have a reasonable price and controllable


Cost-effectiveness cost. The cost includes the initial purchase cost, power
consumption cost, and maintenance cost.

Energy conservation
The equipment should have a low power consumption and be
and environmental
environment friendly.
protection

In practice, brand reputation is also used as a reference for


Brand reputation
device selection.

 Compatibility Check
Check the compatibility based on the host operating system version, host
multipathing information, host application system information, Huawei storage
version, storage software, and multipathing software information provided by the
customer.
Use Huawei storage interoperability navigator to query the compatibility between
storage systems and application servers, switches, and cluster software, and evaluate
whether the live network environment meets the storage compatibility requirements.
If Huawei storage devices are used, you are advised to use Huawei storage
interoperability navigator to plan compatibility.
Page 39

2.1.2.2.3 Network Planning


Network planning is a process of analyzing requirements based on the network (storage
network) construction objectives, designing the logical and physical structures of the
network based on the requirement analysis result, and preparing technical documents for
network installation and configuration.
Network planning requires iteration and optimization. It is difficult to meet all future
requirements through one-time planning and design.
During network planning, it is critical to draw the network topology. The networking
modes of hybrid flash storage and all-flash storage are as follows, while the networking
modes of distributed storage are more complex.
1. Direct-connection network: An application server is connected to different controllers
of a storage system to form two paths for redundancy. The path between the
application server and the LUN's controller is the optimum one and the other path is
standby.
2. Single-switch network: Switches can increase the number of host ports to improve
the access capability of the storage system. Moreover, switches extend the
transmission distance by connecting remote application servers to the storage
system. As only one switch is available in this mode, a single point of failure may
occur. There are four paths between the application server and storage system. The
two paths between the application server and LUN's controller are the optimum
paths, and the other two paths are standby. In normal cases, the two optimum paths
are used for data transmission. If one optimum path is faulty, UltraPath selects the
other optimum path for data transmission. If both paths are faulty, UltraPath uses
the two standby paths for data transmission. After an optimum path recovers,
UltraPath switches data transmission back to the optimum path again.
3. Dual-switch networking: With two switches, single points of failure can be prevented,
boosting the network reliability. There are four paths between the application server
and storage system. The UltraPath software works in the same way as that in the
multi-path single-switch networking mode.
When designing the network topology, you may need to consider the interconnection
between different storage networks or even between sites. In addition, IP address
planning, network port planning, and VLAN planning must be considered during network
planning.
2.1.2.2.4 Service Planning
 Basic Service Planning
When planning basic services, use different planning items for block services and file
services. The planning of basic services can be divided into storage resource planning
and access permission planning.
Using Huawei all-flash storage devices as an example, when planning block services,
you need to consider storage pools, LUNs, mapping views, and management users.
To prevent misoperations from affecting the stability of the storage system and the
security of service data, the storage system sets user roles (and levels) to control the
operation permissions and scope of users. When planning local users of a storage
system, you should properly plan and allocate roles and levels to different users.
Page 40

Using Huawei all-flash storage as an example, when planning file services, you need
to consider storage pools, file systems, network planning, and NFS/CIFS shares.
In addition, you need to consider the planning of the host side (queue depth and I/O
alignment), network side (switch configuration, zone or VLAN division), and
application side (database parameters, files, and file groups) based on the
application type.
 Advanced Feature Planning
Plan and design advanced storage features based on customer requirements and
purchased advanced features.
In addition, some advanced features must be planned and designed before being
used.
For example, before creating a remote replication task, you need to plan the network
and data. Remote replication involves the primary and secondary storage systems.
Therefore, before creating a remote replication task, you need to plan the
networking mode of remote replication and replication links between storage
systems. Data planning includes capacity planning and bandwidth planning. The
requirements of remote replication for the system capacity and network bandwidth
must be considered.
 Solution Planning
Understand common solution types and plan and design solutions based on user
requirements.

2.1.3 Application Cases


2.1.3.1 All-Flash Storage Planning and Design
2.1.3.1.1 Planning Available Capacity
The available capacity of storage systems must be properly planned to ensure sufficient
capacity for service data.
You are advised to use eDesigner to plan the available capacity of Huawei centralized
storage devices and calculate the capacity to be purchased.
When planning the available capacity, you need to consider factors including but not
limited to the nominal capacity of a single disk, hot spare capacity, and RAID usage.
Disk type: A disk type in a disk domain corresponds to a storage tier of a storage pool. If
the disk domain does not have a specific disk type, the corresponding storage tier cannot
be created for a storage pool.
Nominal capacity: The disk capacity defined by the vendor is different from that defined
by the operating system. As a result, the nominal capacity of a disk is different from that
displayed in the operating system.
 Disk capacity defined by disk vendors: 1 GB = 1000 MB, 1 MB = 1000 KB, and 1 KB =
1000 bytes
 Disk capacity calculated by operating systems: 1 GB = 1024 MB, 1 MB = 1024 KB, and
1 KB = 1024 bytes
Page 41

Hot spare capacity: The storage system provides hot spare space to take over data from
failed member disks.
RAID usage: indicates the capacity used by parity data at different RAID levels.
Disk bandwidth performance: The total bandwidth provided by the back-end disks of a
storage device is the sum of the bandwidth provided by all disks. The minimum value is
recommended during device selection.
RAID level: A number of RAID levels have been developed, but just a few of them are still
in use.
I/O characteristics: Write operations consume most of disk resources. The read/write ratio
describes the ratio of read and write requests. The disk flushing ratio indicates the ratio
of disk flushing operations when the system responds to read/write requests.
2.1.3.1.2 Project Information
 Collecting Application Information
During engineering survey, you are advised to collect application information,
including but not limited to data disaster recovery, virtualization platform, and
database platform. The following table is an example of data disaster recovery
information.

Data Disaster Recovery Information

Information Category Name Description

The type of the service requiring disaster


Service type recovery, such as databases and stream
media

Total amount of service data requiring


Data volume disaster recovery/Periodic data increments
(unit: TB)

Disaster recovery link type (FC/iSCSI) and


Link
bandwidth (unit: MB/s)
Data disaster recovery
and backup Network distance from the disaster recovery
information (Fill this Distance
center to the production center (unit: km)
field only if disaster
recovery is involved.) Network
construction Self-constructed network or leased lines
method

Expected shortest period from service


RTO
interruption to service recovery

RPO Longest period allowing data loss

Backup Server's native backup software name and


software version

Whether it is necessary to connect to third-party arrays or


Page 42

Data Disaster Recovery Information

Information Category Name Description


not
Others Whether user data needs to be migrated between the
existing and new storage systems

The collection items related to the virtualization platform and database platform are
more complex. For details, see Installation and Initialization > Site Planning
Guide > Collecting Live Network Information in the corresponding product
documentation.
 Collecting Compatibility Evaluation Information
During the engineering survey, you are advised to collect compatibility evaluation
information, including but not limited to software and hardware information used for
SAN function compatibility evaluation and backup software compatibility evaluation.

Compatibility Evaluation Information

Category Item

Host and host model

OS type and version

Patch version

HBA type and model (FC/iSCSI)

Number of HBAs

Multipathing software name and version

Switch vendor and model


Software and
hardware Maximum rate of switch and firmware version
information used for HostAgent version
SAN function
compatibility Vendor and model of the cascaded switch (if involved)
evaluation
Maximum rate and firmware version of the cascaded switch (if
involved)

(Optional) Cluster software and version (filled in based on


client installation conditions)

Volume management software

File system format/Raw device

Back-end storage model (if involved)

Back-end storage version (if involved)


Page 43

Compatibility Evaluation Information

Category Item

Storage product model and version

Backup and archiving software

Storage product role (example: data sources or medium for


backup and archiving)

Backup and archiving networking mode

Software and Tape library model


hardware
Tape library drive model and quantity
information used for
backup software Firmware version of the tape library drive
compatibility
evaluation Model and firmware version of the Fibre Channel switch
connected to the tape library

Type of the interface used by the tape library

Operating system of the archiving and backup server (name,


patch, kernel, and bits)

Operating system of the archiving and backup client (name,


patch, kernel, and bits)

2.1.3.1.3 Hardware Planning


 Layout Planning
Take Huawei all-flash storage as an example. In the delivery scenario where storage
devices are not delivered by cabinet, you need to check whether the cabinet has
sufficient space for installing storage devices and select a cabinet with sufficient
depth based on storage device installation conditions before planning the network.
Before installing a storage system, work out an appropriate device layout plan to
ensure that the storage devices operate properly.
 Cascading Solution Planning
Take Huawei all-flash storage as an example. Observe the principles of cascading
disk enclosures and then cascade disk enclosures in the correct way. In the rear view
of a disk enclosure, expansion module A is located on the upper part and expansion
module B is located on the lower part. The expansion modules are labeled on the
rear of the disk enclosure.
The cascading rules of Huawei all-flash storage are as follows:
1. SAS disk enclosures and smart disk enclosures cannot be connected in an
expansion loop.
2. If more than two disk enclosures are cascaded, you are advised to set up
multiple loops based on the number of available ports. Try to configure the
same number of disk enclosures for each loop. For smart disk enclosures, you are
Page 44

advised to configure the same number of loops on each RDMA interface module
based on the number of RDMA interface modules on each controller enclosure.
3. The number of disk enclosures connected to the expansion ports on the
controller enclosure and the number of disk enclosures connected to the back-
end ports cannot exceed the upper limit.
4. The expansion modules in the controller enclosure's slot H connect to each disk
enclosure's expansion module A, and those in slot L connect to each disk
enclosure's expansion module B.
5. A pair of SAS ports can cascade up to two SAS disk enclosures. One disk
enclosure is recommended.
6. A pair of RDMA ports can cascade up to two smart disk enclosures. One disk
enclosure is recommended.
7. Storage pools created in the storage system support disk-level redundancy policy
(common RAID mode) and enclosure-level redundancy policy (cross-enclosure
RAID mode).
 When creating a storage pool using the enclosure-level redundancy policy,
ensure that the disks in the storage pool come from at least four disk
enclosures and each disk enclosure houses at least three disks of each
capacity type.
 When a loop connects to only one SAS disk enclosure or smart disk
enclosure, the disk-level or enclosure-level redundancy policy can be
configured for disks in the disk enclosure.
 When two SAS disk enclosures are cascaded in a loop: If the forward
redundancy connection is used, the disk-level or enclosure-level redundancy
policy can be configured for disks in the level-1 SAS disk enclosure, and only
the disk-level redundancy policy can be configured for disks in the level-2
SAS disk enclosure. If the forward and backward redundant connections are
used, the disk-level or enclosure-level redundancy policy can be configured
for the disks in the two SAS disk enclosures in the loop.
 When two smart disk enclosures are cascaded in a loop and the forward
redundant connection or the forward and backward redundant connections
are used, the same redundancy policy can be configured. The disk-level or
enclosure-level redundancy policy can be configured for disks in the level-1
smart disk enclosure in the loop, and only the disk-level redundancy policy
can be configured for disks in the level-2 smart disk enclosure.
2.1.3.1.4 Network Planning
 Front-End Network Planning
On a dual-link direct-connection network, an application server is connected to two
controllers of the storage system to form two paths for redundancy. The path
between the application server and the LUN's controller is the optimum one and the
other path is standby. In normal cases, UltraPath selects the optimum path for data
transfer. If the optimum path is down, UltraPath selects the standby path for data
transfer. After an optimum path recovers, UltraPath switches data transfer back to
Page 45

the optimum path again. The dual-link direct-connection network is the simplest and
most cost-effective connection mode of the storage network.
The multi-link single-switch networking mode adds one switch on the basis of dual-
link direct connection, improving data access and forwarding capabilities. A switch
expands host ports to improve the access capability of the storage system. Moreover,
switches extend the transmission distance by connecting remote application servers
to the storage system. Since only one switch is available in this networking mode, it
is vulnerable to single points of failure. There are four paths between the application
server and storage system. The two paths between the application server and LUN's
controller are the optimum paths, and the other two paths are standby.
In multi-link single-switch networking mode, UltraPath selects two optimum paths
for data transfer in normal cases. If one optimum path is faulty, UltraPath selects the
other optimum path for data transmission. If both optimum paths are faulty,
UltraPath uses the two standby paths for data transmission. After an optimum path
recovers, UltraPath switches data transfer back to the optimum path again.
Multi-link dual-switch networking adds one switch on the basis of multi-link single-
switch networking to offer dual-switch forwarding. With two switches, the network is
protected against single points of failure, which improves the network reliability.
There are four paths between the application server and storage system. The
UltraPath software works in the same way as that in the multi-path single-switch
networking mode.
 Storage Port Planning
Storage ports are planned based on device models. For details, see the product
documentation of the corresponding device.
Planning Ethernet ports: Ethernet ports are physically visible on a device. They are the
basis for creating VLANs, bound ports, and logical ports. You can bind multiple
Ethernet ports into one port to improve bandwidth and data transfer efficiency.
Planning bound ports: Bind multiple Ethernet ports and specify the bound port name
for higher bandwidth and better redundancy. Port binding provides more bandwidth
and higher redundancy for links. Although ports are bound, each host still transmits
data through a single port. Therefore, the total bandwidth can be increased only
when there are multiple hosts. Determine whether to bind ports based on site
requirements. After ports are bound, their MTU changes to the default value. In
addition, you need to configure the port mode of the switch. Take Huawei switches
as an example. You must set the ports on a Huawei switch to work in static LACP
mode. The link aggregation modes vary with switch manufacturers. If a switch from
another manufacturer is used, contact technical support of the switch manufacturer
for specific link aggregation configurations. In addition, the port binding mode of the
storage system has some restrictions. For example, read-only users are not allowed
to bind Ethernet ports, and management network ports do not support port binding.
Pay attention to these restrictions when planning port binding.
Planning VLANs: VLANs logically divide the Ethernet port resources of the storage
system into multiple broadcast domains. In a VLAN, when service data is being sent
or received, a VLAN ID is configured for the data, so that the networks and services
of VLANs are isolated, further ensuring service data security and reliability. VLANs are
created based on Ethernet ports or bound ports. One physical port can belong to
Page 46

multiple VLANs. A bound port instead of one of its member ports can be used to
create a VLAN. The VLAN ID ranges from 1 to 4094. You can enter a VLAN ID or
VLAN IDs in batches.
Planning logical ports: Logical ports are created based on physical Ethernet ports,
bound ports, or VLANs and used for service operation. A physical port can be
configured with logical ports on the same network segment or different network
segments. Different physical ports on the same controller can be configured with
logical ports on the same network segment or different network segments. Logical
ports can be created based on Ethernet network ports, bound ports, or VLAN ports
only if these ports are not configured with any IP addresses. If logical ports are
created based on Ethernet network ports, bound ports, or VLAN ports, the Ethernet
network ports, bound ports, or VLAN ports can only be used for one storage service,
such as block storage service or file system storage service. When creating a logical
port, you need to specify an active port. If the active port fails, another standby port
will take over the services.
 Planning IP Address Failover
You need to plan policies for IP address failover to meet service requirements. The
planning items of storage ports and IP address failover have been provided in the
training materials. This section introduces IP address failover.
IP address failover: A logical IP address fails over from a faulty port to an available
port. In this way, services are switched from the faulty port to the available port
without interruption. The faulty port takes over services back after it recovers.
During the IP address failover, services are switched from the faulty port to an
available port, ensuring service continuity and improving the reliability of paths for
accessing file systems. Users are not aware of this process. The essence of IP address
failover is a service switchover between ports. The ports can be Ethernet ports, bound
ports, or VLAN ports. IP addresses can float not only between logical ports that are
created based on Ethernet ports but also between logical ports that are created
based on other ports.
When a controller is faulty, the port selection priority for IP address failover is as
follows: ports in the quadrant of the peer controller > ports in the quadrants of other
controllers (in the sequence of quadrants A, B, C, and D). The following figure shows
the IP address failover upon a port fault.
Page 47

Before planning an IP address failover policy, you need to understand the meaning of
a failover group. Failover groups are categorized into system failover groups, VLAN
failover groups, and customized failover groups.
1. System failover group (Ethernet port and bond port failover group): The storage
system automatically adds all Ethernet ports and bound ports of the cluster to
the system failover group.
2. VLAN failover group: When VLANs are created, the storage system automatically
adds VLANs with the same ID to a failover group, that is, each VLAN ID
corresponds a failover group.
3. Customized failover group: You can add ports whose IP addresses can float to a
customized failover group. The port failover policy in the customized group does
not require a symmetric network. During the failover, select the same type of
Page 48

ports based on the order in which the ports are added to the customized failover
group. If such ports are unavailable, select available ports of other types based
on the preceding order.
2.1.3.1.5 Service Planning
 Permission Planning
To prevent misoperations from compromising the storage system stability and service
data security, the storage system defines user levels and roles to determine user
permission and scope of permission.
1. User level: controls the operation or access permissions of users. However, not all
storage products adopt the concept of user level. For details, see the
corresponding product documentation. If the permissions of an administrator or
a read-only user on a device do not meet actual requirements, for example, if
the access permissions of a user need to be upgraded to the management
permissions or the management permissions of a user needs to be degraded to
access rights, the super administrator can be used to adjust the permission level
of the current user.
2. User role: defines the scope of operations that a user can perform or the objects
that a user can access. For details, see the corresponding product
documentation. The storage system provides two types of roles: preset roles and
custom roles.
The following table uses Huawei all-flash storage as an example to describe some
preset roles and their permissions.

Preset Role Permissions

Super
All permissions over the system
administrator

All permissions except user management, security configuration,


Administrator
batch operations, and high-risk maintenance operations.

System security configuration permissions, including security rule


Security
management, certificate management, KMC management, and data
administrator
destruction management

System SAN resource management permissions, including storage


SAN resource pool management, LUN management, mapping view management,
administrator host management, port management, and background configuration
task management

Data protection management permissions, including LUN


Data
management, local data protection management, remote data
protection
protection management, HyperMetro management, and background
administrator
configuration task management

Cross-device data protection management permissions, including


Remote device remote replication management, HyperMetro management, 3DC
administrator management, LUN management, and mapping view management.
This role is used for remote authentication in cross-device data
Page 49

Preset Role Permissions


protection scenarios.

Routine O&M permissions, such as information collection,


Monitor performance collection, and inspection. This role does not have
administrator permission to manage SAN resources, data protection, and security
configuration.

Basic system permissions, including the permissions to query system


Non-privileged information, user information, and role information. This role can be
administrator used to query information only on the CLI management page. It is
displayed as Empty role on the CLI management page.

In the Huawei storage system, after logging in to DeviceManager, you can choose
Settings > User and Security > Users and Roles > Role Management to view the
permissions and operation scope of the current account.
 Allocating and Using Space (Block Service)
Take Huawei all-flash storage as an example. When planning space allocation and
usage, you need to consider storage pools, LUNs, and mapping views.
A storage pool is a container that stores storage space resources. To better utilize the
storage space of a storage system, you need to properly plan the redundancy policy,
RAID policy, and hot spare policy of a storage pool based on actual service
requirements.
When creating a storage pool, you can set an alarm threshold for capacity usage. The
default threshold is 80%. Capacity alarm is particularly important when thin LUNs
are used. Set a proper alarm threshold based on the data growth speed to prevent
service interruption due to insufficient capacity of the storage pool.
The storage system supports two redundancy policies: disk-level redundancy and
enclosure-level redundancy. The enclosure-level redundancy policy can be set only for
new storage pools. You cannot modify the redundancy policy (disk-level or enclosure-
level redundancy) for a storage pool that has been created.
 Disk redundancy: Chunks in a chunk group come from different SSDs. With this
redundancy policy used, the system can tolerate disk failures within the RAID
redundancy capacity.
 Enclosure redundancy: Chunks in a chunk group come from different SSDs and
are distributed in different enclosures if possible. In addition, the number of
chunks in each enclosure does not exceed the RAID redundancy. This policy
enables the system to tolerate a single disk enclosure failure without service
interruption or data loss.
Disk redundancy delivers equal or higher RAID usage as compared with enclosure
redundancy while enclosure redundancy provides higher reliability.
 RAID usage: At the same RAID level, the RAID usage of the disk-level
redundancy policy is greater than or equal to that of the enclosure-level
redundancy policy. For example, the storage system has four disk enclosures
configured, each of which has five disks configured. Two storage pools
containing all disks with the disk-level redundancy policy and the enclosure-level
Page 50

redundancy policy are created, respectively. The hot spare policy is set to low
(one disk) and the RAID level is set to RAID-TP. For the storage pool with the
disk-level redundancy policy, the number of RAID columns is 19, and the RAID
usage is calculated in the formula: [(19 – 3)/19] x 100% = 84.21%. For the
storage pool with the enclosure-level redundancy policy, the number of RAID
columns is 7, and the RAID usage is calculated in the formula: [(7 – 3)/7] ×
100% = 57.14%.
 Reliability: Enclosure redundancy can tolerate an enclosure failure while disk
redundancy cannot. Enclosure redundancy provides a higher reliability than disk
redundancy.
When planning RAID policies, note that different RAID levels have similar impact on
performance. As the redundancy level increases, the performance slightly
deteriorates.
The performance of different I/O models (random/sequential read/write) is different.
Random read I/O performance is equivalent to sequential read I/O performance.
Sequential write I/O performance is superior to random write I/O performance.
In the storage planning and deployment stage, the most suitable RAID level should
be selected based on service requirements, and the performance differences, space
utilization, and reliability should also be considered. Recommended RAID policy
configuration: RAID-TP is recommended for core services (such as billing systems of
carriers or financial A-class online transaction systems). For non-core services, RAID 6
or RAID 5 is recommended.
In addition, in the enclosure-level redundancy policy, the storage system provides two
protection levels: RAID 6 and RAID-TP.
For hot spare policy planning, note that Huawei all-flash storage system adopts RAID
2.0+ underlying virtualization technology. The hot spare space is distributed on each
member disk in the storage pool. For ease of understanding, the hot spare space is
represented by the number of hot spare disks on DeviceManager. If spare disks are
sufficient, the default number of hot spare disks is 1, which can meet the
requirements of most scenarios. Even if the hot spare space is used up, the storage
system can still use the free space of the storage pool for reconstruction to ensure
the reliability. If the remaining capacity of a storage pool is about to be used up and
the storage system is configured with the data protection feature, replace the faulty
disk in a timely manner. If the total number of hard disks is less than 13, the number
of hot spare disks must meet the following condition: Total number of hard disks –
Number of hot spare disks ≥ 5.
To achieve the optimal performance of the storage system, you need to select a
proper policy for LUNs based on the actual data storage situation. The key
parameters include the application type, number of LUNs, and LUN capacity.
The planning of the application type is closely related to the actual application type.
Take the SQL Server OLAP application as an example. You are advised to select
SQL_Server_OLAP. The corresponding policy is that the application request size is 32
KB and SmartCompression is enabled. The storage system compresses all written
data. SmartDedupe is disabled. The default deduplication granularity is 8 KB. Each 8
KB page in the SQL Server database contains a header with a unique field. Therefore,
Page 51

enabling SmartDedupe does not reduce data storage space. You are not advised to
enable SmartDedupe.
Regarding the capacity and number of LUNs, different from traditional RAID groups
with 10 to 20 disks, a storage pool that leverages the RAID 2.0+ mechanism creates
LUNs using all disks in a disk domain, which can have dozens of or even more than
100 disks. To bring disk performance into full play, it is recommended that you
configure LUN capacity and quantity based on the following rules:
Total number of LUNs in a disk domain ≥ Number of disks x 4/32 (4 is the proper
number of concurrent access requests per disk, and 32 is the default maximum
queue depth of a LUN). You are advised to create 8 to 16 LUNs to store SQL Server
data files and 2 to 8 LUNs to store tempdb files. This configuration provides a
maximum capacity of 32 TB for a database. If you need a larger capacity, change the
number of LUNs as required.
In addition, it is recommended that the capacity of a single LUN be less than or equal
to 2 TB and that the LUN capacity be as large as possible to reduce management
overheads.
A mapping view defines logical mappings among LUNs, array ports, and host ports.
You are advised to create a mapping view based on the following rules:
A LUN group is an object designed to facilitate LUN resource management. Typically,
LUNs that serve the same service should be added to the same LUN group.
A host group is a set of hosts that share storage resources. Each host contains
multiple initiators (host ports). You are advised to create a host for each server and
add all initiators of the server to the host.
Port groups help you allocate storage ports in a more fine-grained manner. Port
groups are not mandatory. However, it is recommended that you allocate a port
group for the SQL Server application to facilitate O&M and reduce performance
impacts between applications. To prevent single points of failure, a port group must
contain at least one port on each controller.
 Allocating Space (File Service)
Using Huawei all-flash storage as an example, when planning space allocation and
usage, you need to consider the storage pools, file systems, and shares.
The planning for a storage pool is similar to that for block services.
The following table lists the parameters that need to be considered during file
system planning.

Parameter Description

Name Name of a file system.

Tenant vStore to which the file system belongs.

Storage pool Storage pool to which the file system belongs.

NTFS is applicable to scenarios where CIFS user permissions are


Security mode controlled by Windows NT ACL. UNIX is applicable to scenarios
where the permissions of NFS users are controlled by Unix Mode,
Page 52

Parameter Description
NFSv4 ACL, or NFSv3 ACL.

Capacity of the file system to be planned, which indicates the


maximum capacity allocated to the thin file system. That is, the
Capacity
total capacity dynamically allocated to the thin file system
cannot exceed this value.

Capacity Alarm Alarm threshold of the planned file system capacity. An alarm
Threshold (%) will be generated when the threshold is reached.

Application type of the file system.


The storage system presets application types for typical
application scenarios. In file service scenarios, the possible
options are NAS_Default, NAS_Virtual_Machine,
NAS_Database, NAS_Large_File, and Office_Automation.
The Application Request Size and File System Distribution
Algorithm parameters are set for preset application types. The
value of Application Request Size is 16 KB for NAS_Default,
NAS_Virtual_Machine, and Office_Automation, 8 KB for
NAS_Database, and 32 KB for NAS_Large_File. If Application
Type is set to NAS_Default, NAS_Large_File, or
Application Type Office_Automation, File System Distribution Algorithm is
Directory balance mode. In this mode, directories are evenly
allocated to each controller by quantity. If Application Type is
set to NAS_Virtual_Machine or NAS_Database, File System
Distribution Algorithm is Performance mode. In this mode,
directories and files are allocated to the access controller
preferentially to improve access performance of directories and
files. When SmartCompression and SmartDedupe licenses are
imported to the system, the preset application types also display
whether SmartCompression and SmartDedupe are enabled.
Application Type cannot be changed once being configured. You
are advised to set the value based on the service I/O model.

Indicates whether to add the file system to a SmartCache


SmartCache
partition. Adding a file system to a SmartCache partition
Partition
shortens the response time for reading the file system.

Share Specifies the NFS or CIFS shares of the file system.

Snapshot Directory Indicates whether the directory of the file system snapshots is
Visibility visible.

Determines whether to enable the function of automatically


updating Atime. Atime indicates the last file system access time.
Auto Atime Update
After this function is enabled, Atime is updated every time data
in the file system is accessed.
Page 53

2.1.3.2 Distributed Storage Planning and Design


2.1.3.2.1 Project Survey
The cover of the Survey Report must be completed during engineering survey. The basic
information in the Survey Report includes information about the project, survey, and
personnel.
Perform onsite surveys on the sites that are ready for onsite surveys. Clarify the
requirements for supporting facilities, such as the power supply, ground bar, air
conditioner, cable tray, and raised floor to the customer in advance. Determine device
positions, cabling, and reserved space for expansion jointly with the customer. Consult
the customer for required data, including the building parameters of equipment rooms,
distance between equipment rooms, and available number of optical cable cores.
Survey each item carefully and record the data properly, but do not imagine or assume
subjectively on the basis of experience.
2.1.3.2.2 Node Selection
Select a node type based on the actual application requirements. Note that different
products apply to different node types. For details, see the product documentation of the
corresponding product. The converged services of Huawei intelligent distributed storage
support P100, C100, P110, and C110 nodes.
2.1.3.2.3 Network Planning (Block Service)
The network planning details vary with distributed storage products. The networking
solution details of the same product also vary according to the services provided by the
product. For details, see the network planning guide of the corresponding product.
The case in the training material is about the converged service networking. The relevant
content is only a brief introduction. In actual situations, the networking of converged
services is complex.
This slide is a supplement to the training materials. Block service is used as an example to
describe the network planning of Huawei intelligent distributed storage.
 Network Introduction
1. The service network (iSCSI network) is used for communication between
compute nodes and VBS according to the iSCSI protocol.
2. The BMC network connects to Mgmt ports of storage nodes to enable remote
device management.
3. The management network manages and maintains the storage system.
4. The storage network is used for data communication between VBS and OSD
nodes or between OSD nodes.
 Front-end storage network: communication network between VBS and OSD
nodes.
 Back-end storage network: communication network between OSDs.
 Network shared by front-end and back-end storage: data communication
network shared between OSDs as well as between VBS and OSDs.
 Basic Concepts
Page 54

1. Management node: a server on which the FSM management process is deployed.


It provides O&M functions, such as alarm reporting, monitoring, logging, and
configuration, for the OceanStor 100D block service.
 Management nodes can be deployed on VMs or physical servers.
 Management nodes can be deployed in active/standby mode but cannot be
deployed in standalone mode.
2. Storage node: a server that provides storage resources.
3. Compute node: a server that runs application systems.
4. Control Virtual Machine (CVM): When the VMware vSphere platform is used,
extra hosts or VMs are required to deploy the VBS service because the VBS
service cannot be directly deployed on ESXi hosts. Linux ESXi VMs on which VBS
is deployed are called CVMs.
 Physical Network Requirements

Node Type Network Requirements

 Management nodes can be deployed on VMs or physical


servers.
 Management nodes can be deployed in active/standby
mode.
 Management network:
 A management network port can be a GE or 10GE
Management node network port.
 The number of management network ports can be one
or two. You are advised to bind two management
network ports in bond1 mode.
 Local link IPv6 addresses starting with FE80 and
multicast IPv6 addresses starting with FF00 cannot be
used. HyperMetro and HyperReplication are not
supported when IPv6 addresses are used.

 Management network:
 A management network port can be a GE or 10GE
network port.
 The number of management network ports can be one
or two. You are advised to bind two management
Compute node network ports in bond1 mode.
 Local link IPv6 addresses starting with FE80 and
multicast IPv6 addresses starting with FF00 cannot be
used. HyperMetro and HyperReplication are not
supported when IPv6 addresses are used.
 Service network:
Page 55

Node Type Network Requirements


 Port binding is recommended to aggregate network
ports between compute nodes and storage nodes. This
will improve network redundancy reliability.
 The service network requirements apply to scenarios
where VBS is deployed on storage nodes.
 If the service network (iSCSI) uses TCP/IP, you are
advised to configure the bond4 mode. Binding ports
across NICs is supported.
 Local link IPv6 addresses starting with FE80 and
multicast IPv6 addresses starting with FF00 cannot be
used. HyperMetro and HyperReplication are not
supported when IPv6 addresses are used.
 Storage network:
 Port binding is recommended to aggregate network
ports between compute nodes and storage nodes. This
will improve network redundancy reliability.
 The storage network supports TCP/IP, IB, and RoCE.
 TCP/IP: Ports can be bound in bond1 or bond4 mode
and the bond4 mode is recommended. You can bind
ports across NICs. At least two 10GE or 25GE ports are
required.
 IB: Ports can be bound only in bond1 mode. In the IB
network, the bond1 mode enables load balancing.
Binding ports across NICs is supported. At least two 56
Gbit/s or 100 Gbit/s ports are required.
 RoCE: P100, C100, and F100 support the bond1 and
bond4 modes. P110, C110, and F110 support the bond1
mode. Cross-NIC binding is not supported. At least two
25GE ports are required. RoCEv2 NICs are recommended
for RoCE networks.
 Local link IPv6 addresses starting with FE80 and
multicast IPv6 addresses starting with FF00 cannot be
used. HyperMetro and HyperReplication are not
supported when IPv6 addresses are used.

 Management network:
 A management network port can be a GE or 10GE
network port.

Storage node  The number of management network ports can be one


or two. You are advised to bind two management
network ports in bond1 mode.
 Local link IPv6 addresses starting with FE80 and
multicast IPv6 addresses starting with FF00 cannot be
Page 56

Node Type Network Requirements


used. HyperMetro and HyperReplication are not
supported when IPv6 addresses are used.
 Service network:
 Port binding is recommended to aggregate network
ports between compute nodes and storage nodes. This
will improve network redundancy reliability.
 The service network requirements apply to scenarios
where VBS is deployed on storage nodes.
 If the service network (iSCSI) uses TCP/IP, you are
advised to configure the bond4 mode. Binding ports
across NICs is supported.
 Local link IPv6 addresses starting with FE80 and
multicast IPv6 addresses starting with FF00 cannot be
used. HyperMetro and HyperReplication are not
supported when IPv6 addresses are used.
 Storage network:
 Port binding is recommended to aggregate network
ports between compute nodes and storage nodes. This
will improve network redundancy reliability.
 The storage network supports TCP/IP, IB, and RoCE.
 TCP/IP: Ports can be bound in bond1 or bond4 mode
and the bond4 mode is recommended. You can bind
ports across NICs. At least two 10GE or 25GE ports are
required.
 IB: Ports can be bound only in bond1 mode. In the IB
network, the bond1 mode enables load balancing.
Binding ports across NICs is supported. At least two 56
Gbit/s or 100 Gbit/s ports are required.
 RoCE: P100, C100, and F100 support the bond1 and
bond4 modes. P110, C110, and F110 support the bond1
mode. Cross-NIC binding is not supported. At least two
25GE ports are required. RoCEv2 NICs are recommended
for RoCE networks.
 Local link IPv6 addresses starting with FE80 and
multicast IPv6 addresses starting with FF00 cannot be
used. HyperMetro and HyperReplication are not
supported when IPv6 addresses are used.

The requirements for the IP address type of each network are as follows:
 When the TCP/IP protocol is used on each network, IPv4 and IPv6 addresses are
supported.
 In the same cluster, the IP address type on the same network must be the same.
Page 57

 The IP address type of the front-end storage network must be the same as that
of the back-end storage network.
 Deployment mode
Huawei intelligent distributed block storage service can be deployed in two modes:
centralized deployment of compute nodes and storage nodes and separate
deployment of compute nodes and storage nodes.
The networking schemes, node port planning, and switch port planning of the two
deployment modes are different. For details, see the network planning guide of the
corresponding product.

2.2 Storage System Performance Tuning


2.2.1 Overview
 Performance indicators
Before conducting a performance analysis, you must understand performance
indicators. Performance indicators provide information on the operating status of a
storage system. Input/output operations per second (IOPS), bandwidth, and latency
are critical indicators that directly reflect the I/O performance of the current services.
1. IOPS indicates the number of processed I/Os per second. It measures the I/O
processing capabilities of storage devices. It is a major performance indicator for
online transaction processing (OLTP) services and SPC-1 authentication.
2. Bandwidth indicates the data volume that can be processed per second. It is
expressed in MB/s or GB/s. Bandwidth is used to measure storage system
throughput. Bandwidth can also be used to measure system performance for
database online analytical processing (OLAP), Media & Entertainment (M&E),
and video surveillance services.
3. Latency indicates the time interval from an I/O request being initiated to the I/O
request being processed. It is expressed in millisecond (ms). Common indicators
include the average response time and maximum response time. For example,
the latency of OLTP services must be less than 10 ms and that for the virtual
desktop infrastructure (VDI) scenario must be less than 30 ms. The requirements
for the latency of video on demand (VOD) and video surveillance services vary
with bit rates.
4. The fluctuation rate is a derived performance indicator. For services that require
a constant performance, a large performance fluctuation will lead to service
faults. Values needed to measure the fluctuation rate include the maximum
value, minimum value, and mean square error. The frequently used method to
calculate the fluctuation rate: Mean square error/Average value x 100%.
In application scenarios where the I/O size is less than 64 KB, IOPS is the main
performance indicator of storage systems.
In application scenarios where the I/O size is greater than or equal to 64 KB,
bandwidth is the main performance indicator of storage systems.
Page 58

These performance indicators are related to each other.


1. Relationship between IOPS and bandwidth: Bandwidth = IOPS x Average I/O size.
Small I/Os are often used in IOPS tests, and large I/Os are often used in
bandwidth tests.
2. Relationship between IOPS and latency: IOPS = 1000/Average response time.
Faster storage response means higher IOPS.
 Terms
The following table lists some common terms that are related to performance and
performance issues.

Term Description

A slow disk responds to I/Os slowly. Consequently, the read and write
Slow disk performance of the slow disk is obviously lower than that of other
disks.

Dirty data refers to new data that has been cached but is not
Dirty data
persistently stored into disks.

Deduplication refers to the process in which the system compares,


Deduplication
verifies, and deletes duplicate data on storage devices.

Write amplification is an unexpected phenomenon in SSDs, that is, the


Write
amount of physical data that is actually written is multiple times the
amplification
amount of logical data that is to be written.

Garbage collection refers to the process during which the system


Garbage
copies data from a valid page in a flash memory block to a blank
collection
block, and then erases the block.

Over-provisioning (OP) indicates that an SSD reserves a part of flash


OP space memory space for other use. This space cannot be operated by users.
Its capacity is determined by controllers.

 System workflow and bottlenecks


Logical modules refer to the software. Hardware resources refer to hardware of the
service system. When an I/O request is delivered, the logical modules in the service
system process the request from the top down.
Hardware resources provide support for the normal operation of each logical
module. If a logical module is incorrectly designed or configured, hardware resources
will be exhausted or restricted, or respond slowly. As a result, performance issues
occur.
In performance tuning, you need to know the performance requirements and find
out which hardware resources are the bottleneck and the corresponding causes by
streamlining the I/O process. Then you can tune the performance by adjusting the
designs and configurations of the logical modules.
CPUs are the brain of a system and are the most important resources. CPUs are
consumed by the following applications:
Page 59

1. Software and hardware interrupts triggered by the running of operating system


kernels and user applications
2. System context switches
3. Iowait
4. VMs
As CPUs are busy, they are apt to become the major bottleneck of a system. The CPU
usage is used for determining whether a CPU becomes a bottleneck. As shown in the
figure on the right, when the CPU usage is greater than 90%, the CPU has become
the system bottleneck. In this case, many problems may occur, for example, high soft
interrupt rate, high context switchover rate, high I/O waiting rate, and long queue
depth. In this case, you need to use the CPU performance analysis tool to analyze
and optimize the applications that cause CPU bottlenecks.
 Performance tuning process and precautions
The tuning process is iterative and progressive. The result of each tuning is used in
subsequent system running. Running performance indicators should be monitored
continuously.
System performance depends on good design, while tuning skills are only an auxiliary
means for improving system performance. The system reliability decreases as load
increases. Each system has a performance inflection point. When the load exceeds
the threshold, the fluctuation rate increases sharply.
In actual applications, hosts, switching network, and storage devices are involved in
an I/O process. A performance problem may occur on any part of an I/O path.
Performance problems may be caused by host applications, network congestion, or
back-end storage systems.
 Process flow for storage system performance tuning and performance problem
demarcation
From the overall perspective, the layers that affect services include the application
layer, host layer, network and link layer, and storage layer.
Service configuration changes at the host layer are common. If configuration changes
occur, the system performance may be affected. This also applies to the network
layer and storage layer.
With intelligent software and devices, check whether alarms are generated. If yes,
handle them first. This rule also applies to changes. There are intelligent alarming
systems available for software and devices at the application, host, network, and
storage layers. If an alarm is detected, handle that first (see the troubleshooting
presentation). After the alarm is cleared, there is a high probability that the adverse
impact on performance is removed.
This course focuses on performance tuning from the perspective of storage. For
details about how to handle alarms and problems at other layers, see the related
performance tuning guides.
A performance problem may occur on any part of an I/O path. It may be caused by
host applications, network congestion, or back-end storage systems. Service
performance is limited by the performance bottlenecks in a system. It is therefore
Page 60

important to check all parts of the I/O path to find out where the problem is and
eliminate it.
The common method for preliminarily demarcating a performance problem is to
compare the average latency differences between hosts and storage devices to
determine whether the problem is caused by the host, network and link, or storage
device.
If the read latency on both the host side and the storage side is large and the
difference is small, the problem may be caused by storage devices. The common
possible causes are that the disk performance, disk enclosure performance, and
mirror bandwidth reach the upper limit. If the write latency on both the host side and
the storage side is large, it cannot be determined that the problem occurs on the
storage device because the write latency includes the time for data transmission from
the host to the storage device. In this case, you need to check all possible causes.
If host latency is significantly greater than storage latency, the host configurations
may be incorrect or the network links are disconnected. For example, I/Os are stuck
due to insufficient concurrency capability of block devices or HBAs, the host CPU
usage reaches 100%, a bandwidth bottleneck occurs, the switch configurations are
improper, or the multipathing software selects incorrect paths.
After confirming that the problem is caused by the host, storage device, or network,
analyze and rectify the problem.
The training material does not describe how to view the latency. This slide describes
how to view the latency on common operating systems.
 Checking the latency on a Linux host
1. Method 1: Use the iostat tool to query the I/O latency on a Linux host. As shown
in the following figure, await indicates the average time for processing each I/O
request (in ms), that is, the I/O response time.

2. Method 2: Use the Linux performance test tool Vdbench to query the latency. As
shown in the following figure, resp indicates the average time for processing
each I/O request (in ms), that is, the I/O response time.
Page 61

3. Other methods: You can also use the performance measurement function of the
service software to query the host latency, for example, the AWR report of
Oracle.
 Checking the latency on a Windows host
1. Method 1: Use IOmeter, a common Windows performance test tool, to query the
host latency.

2. Method 2: Use the Windows Performance Monitor to monitor the disk


performance and obtain the host latency.
Page 62

The Windows Performance Monitor is a performance monitoring tool integrated in


the Windows operating system. It can be used to monitor the performance of the
CPU, memory, disks, network connection, and applications. The following table lists
the disk performance indicators related to latency.

Indicator Sub-Indicator Description

Average time for the


Avg. Disk sec/Transfer storage system to process
each I/O. The unit is ms.

Average time for the


Latency indicators Avg. Disk sec/Read storage system to process
each read I/O.

Average time for the


Avg. Disk sec/Write storage system to process
each write I/O.

The process of using the Windows Performance Monitor to monitor disk performance
and obtain the host latency is as follows:
 On the Windows desktop, choose Start > Run. In the Run dialog box, enter
perfmon to open the performance monitoring tool page.
 In the Performance Monitor window, choose Monitoring Tools > Performance
Monitor from the navigation tree on the left, and then click the Add button to
add performance items.
Page 63

 In the Add Counters dialog box, select PhysicalDisk, add the performance items
to be monitored, click Add, and click OK.
 The Windows Performance Monitor starts to monitor the disk performance in
real time.

2.2.2 Application-Layer Performance Tuning


Service systems run diversified applications. Based on the I/O characteristics and
performance requirements, the application scenarios are classified into three types:

Application Scenario Service Feature Performance Requirement

Small blocks, 2 KB to 8 KB
Random access
OLTP High IOPS and low latency
20% to 60% writes, high
concurrency

Large blocks, 64 KB to 512


KB
OLAP High bandwidth
Multi-channel sequential
access, 90%+ reads

Small blocks, < 64 KB


VDI High IOPS
Random access, 80%+ reads

Online transaction processing (OLTP) involves mostly small random I/Os and is sensitive
to response latency. Online analytical processing (OLAP) is the most important
application of the data warehouse system. Most OLAP applications involve multi-channel
large sequential I/Os and are sensitive to bandwidth. VDI involves mostly small random
I/Os with high hit ratios and is sensitive to response latency.
In addition, there is a typical I/O model applied to SPC-1. SPC-1 is a foremost
authoritative and well-recognized SAN performance testing benchmark model developed
by the Storage Performance Council (SPC) with small random I/Os, sensitive to IOPS and
latency.
 OLTP
OLTP is a type of database application that allows many users to perform
transactional operations online.
From a database perspective, the service characteristics of OLTP are as follows:
1. Reading, writing, and changing of each transaction involve a small amount of
data.
2. Database data must be up-to-date. Therefore, high database availability is
required.
3. Many users are accessing the database simultaneously.
4. The database must be highly responsive and able to complete a transaction
within seconds.
From a storage perspective, the service characteristics of OLTP are as follows:
Page 64

1. Every I/O is small in size, typically ranging from 2 KB to 8 KB.


2. Disk data is accessed randomly.
3. At least 30% of data is generated by random writing operations.
4. Redo logs are written frequently.
The following table describes the OLTP monitoring items.

Monitoring
Component Definition Recommended Threshold
Indicator

Set this parameter based on


service requirements. An alarm
is triggered when the latency
reaches three times the
specified value. The alarm is
Average response
Average cleared when the latency drops
latency of all I/Os
I/O to two times the specified
processed by a
response value. For example, if a user
controller in a
time (μs) requires the OLTP service
measurement period.
Controller latency to be 15 ms, an alarm is
triggered when the latency
reaches 45 ms, and it is cleared
when the latency decreases to
30 ms.

CPU usage in a
An alarm is triggered when the
Average measurement period
CPU usage reaches 90%. The
CPU usage (5s by default). It
alarm is cleared when the CPU
(%) indicates the service
usage drops to 85%.
load of the CPU.

Set this parameter based on


service requirements. An alarm
is triggered when the latency
Average response reaches three times the
Average
latency of all I/Os specified value.
Front-end I/O
processed by a front-
port response For example, if a user requires
end port in a
time (μs) the OLTP service latency to be
measurement period.
15 ms, an alarm is triggered
when the latency reaches 45
ms.

Disk usage in a
An alarm is triggered when the
measurement period
disk usage reaches 90%. The
Disk Usage (%) (5s by default). It
alarm is cleared when the disk
indicates the service
usage drops to 70%.
load of a disk.

LUN Average Average response Set this parameter based on


I/O latency of all I/Os on a service requirements. An alarm
Page 65

Monitoring
Component Definition Recommended Threshold
Indicator
response read/write LUN in a is triggered when the latency
time (μs) measurement period. reaches three times the
specified value. For example, if
a user requires the OLTP service
latency to be 15 ms, an alarm is
triggered when the latency
reaches 45 ms.

Set this parameter based on


service requirements. An alarm
Average response is triggered when the latency
Average
latency of all I/Os in a reaches three times the
Storage I/O
read/write storage pool specified value. For example, if
pool response
in a measurement a user requires the OLTP service
time (μs)
period. latency to be 15 ms, an alarm is
triggered when the latency
reaches 45 ms.

The performance bottleneck of random services usually lies on disks. The following
table lists the empirical values of random IOPS of disks (for reference only).
 SATA disk: 30 to 60
 SAS disk: 100 to 200
 FC disk: 100 to 200
 SSD: 5000 to 10000
 OLAP
The technology behind OLAP services uses multidimensional structures to provide
rapid access to data for analysis. OLAP is a type of application that allows users to
execute complex statistics queries in a database for an extended time.
From a database perspective, the service characteristics of OLAP are as follows:
1. No data or only a small amount of data is modified.
2. The data query process is complex.
3. The data is used at a gradually declining frequency.
4. The query output is usually in the form of a statistical value.
From a storage perspective, the service characteristics of OLAP are as follows:
1. Every I/O is large in size, typically ranging from 64 KB to 1 MB.
2. Data is read in sequence.
3. When read operations are being performed, write operations are performed in a
temporary tablespace.
4. Online logs are seldom written. The number of write operations increases only
when data is loaded in batches.
The following table describes the performance monitoring items.
Page 66

Monitoring
Component Description Recommended Threshold
Indicator

CPU usage in a
An alarm is triggered when the
Average measurement period
CPU usage reaches 90%. The
Controller CPU usage (5s by default). It
alarm is cleared when the CPU
(%) indicates the service
usage drops to 85%.
load of the CPU.

Fibre Channel port: An alarm is


triggered when the bandwidth
Average bandwidth of a reaches 650 MB/s. The alarm is
front-end port in a cleared when the bandwidth
Front-end Bandwidth measurement period. drops to 500 MB/s.
port (MB/s) (The recommended 10GE port: An alarm is
threshold depends on triggered when the bandwidth
the port type.) reaches 600 MB/s. The alarm is
cleared when the bandwidth
drops to 500 MB/s.

Disk usage in a
An alarm is triggered when the
measurement period
disk usage reaches 90%. The
Disk Usage (%) (5s by default). It
alarm is cleared when the disk
indicates the service
usage drops to 70%.
load of a disk.

An alarm is triggered when the


Average bandwidth of a
Back-end Bandwidth bandwidth reaches 1200 MB/s.
back-end SAS port in a
port (MB/s) The alarm is cleared when the
measurement period.
bandwidth drops to 1000 MB/s.

The performance bottleneck of sequential services usually lies in channels. The


following lists the theoretical bandwidth of several channels (for reference only).
 8 Gbit/s FC: 800 MiB
 10 GE: 1000 MiB
Note: MiB is 2 to the power of 10, that is, 1024 bytes. MB is sometimes considered as
1000 bytes.
 Precautions for database application performance
This part is not closely related to storage and is not fully presented in the teaching
materials. It is only used for knowledge extension and reference.
1. About indexes
 Indexes are used to reduce the I/Os of the select and update statements,
thereby shortening the execution time. However, the insert speed slows
down and extra space is required to store indexes.
 Indexes are created only when necessary. You do not need to create indexes
for small tables. Delete or disable useless indexes.
Page 67

 Select proper indexes. For example, create bitmap indexes for columns used
for classification. Create a function index for a function that is often used as
a where clause for query. Create anti-key indexes if a large number of
concurrent transactions are concentrated on a small amount of hotspot
data. Create local indexes for partition tables in the OLAP system to shorten
the time for inserting data in batches.
 The fields for which indexes should be created include the fields that are
often used in the query expression, fields that are often used for table
joining, fields with a high degree of differentiation, and fields of foreign
keys.
 The fields for which indexes should not be created include the fields used
only in functions and expressions, fields that are frequently updated, and
fields with a low degree of differentiation. In addition, do not create indexes
for the performance of a single SQL statement.
 Unnecessary indexes must be deleted after being identified. Deleting useless
indexes can release more space and reduce the database load, improving
the performance of DML operations.
2. About partitions
 Partitions are used to improve the performance of large tables. When the
number of records in a table exceeds 100 million, the table needs to be
partitioned.
 During the query, the Oracle database automatically ignores irrelevant
partitions to reduce the query time.
 Hash partitioning can be used to balance the load of tablespaces.
 This feature facilitates file management, backup, and restoration.
3. About compression
 When CPU resources are sufficient but I/O resources are insufficient,
compression can be used.
 Compression helps the storage system reduce I/O operations and save
storage space.
 Compression is especially suitable for a large amount of regular data.

2.2.3 Host-Layer Performance Tuning


2.2.3.1 Performance Monitoring at the Host Layer
2.2.3.1.1 Linux System Performance Monitoring
 iostat
This command is used to monitor the I/O performance.
Command: iostat -kx 1 60

[root@XXX ~]# iostat -x 1 60


Linux 3.10.0-1062.12.1.el7.x86_64 (XXX) 12/15/2020 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.57 0.00 1.41 0.78 0.00 94.24
Page 68

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
r_await w_await svctm %util

 rrqm/s: number of read requests merged per second that were queued to the
device.
 wrqm/s: number of write requests merged per second that were queued to the
device.
 r/s: number of read requests per second.
 w/s: number of write requests per second.
 rkB/s: number of kilobytes read from the device per second.
 wkB/s: number of kilobytes written to the device per second.
 avgrq-sz: average I/O size.
 avgqu-sz: average I/O queue length.
 await: average wait time of each I/O operation.
 svctm: average storage access time of each I/O operation.
 %util: percentage of time in which the I/O queue is not empty during the
measurement period.
Sequential services: The value of util% is close to 100%. The values of rkb/s and
wkb/s should reach the theoretical bandwidth values of the channel. The value of
avgrq-sz must be equal to the value of Multi Block set for the Oracle database.
Random services: The values of r/s and w/s should be equal to the theoretically
calculated IOPS values. The value of await must be less than 30 ms.
This command can also be used to analyze the HBA status of the host. In the output
of the iostat command, avgqu-sz indicates the average queue depth of the block
devices of a LUN. If the value is greater than 10 for a long time, the problem is
probably caused by concurrency limits. As a result, I/O requests are piled up at the
block device layer of the host instead of being delivered to the storage side. In this
case, you can modify the HBA concurrency.

[root@XXX ~]# iostat -kx 1


Linux 3.10.0-1062.12.1.el7.x86_64 (XXX) 12/15/2020 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.57 0.00 1.41 0.78 0.00 94.24

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await
r_await w_await svctm %util
sda 0.05 2.14 0.09 1.46 1.27 14.41 20.21 0.00 0.69
0.38 0.06
sdf 0.00 0.00 0.01 0.99 0.36 11.71 24.07 0.00 0.30
0.27 0.03
sdg 0.00 0.00 0.00 0.00 0.00 0.04 62.78 0.00 1.38
0.27 0.00
sdh 0.00 0.00 0.00 0.00 0.02 0.00 8.01 0.00 0.11
0.10 0.00
Page 69

 top
The top command is a common performance analysis tool in the Linux operating
system. It can display the resource usage of each process in the system in real time to
help analyze whether the memory of the host is insufficient or whether the CPU
usage is high.

[root@XXX ~]# top


top - 16:20:36 up 24 min, 1 user, load average: 0.00, 0.01, 0.04
Tasks: 85 total, 1 running, 84 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.2 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3879860 total, 3049276 free, 239044 used, 591540 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 3407268 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND


1337 root 20 0 3490408 75392 14588 S 0.3 1.9 0:01.42 java
1 root 20 0 125380 3892 2604 S 0.0 0.1 0:01.34 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd

 The load average values in the first line show the average numbers of processes
in the running queue in the past 1, 5, and 15 minutes, respectively.
 The second line shows the number of processes in each state.
 The third line shows the CPU usage.
 us indicates the CPU usage in user mode.
 sy indicates the CPU usage in kernel mode.
 id indicates the CPU idle rate, which cannot be low. If the value is less than
10%, check whether it becomes a bottleneck.
 si indicates the CPU usage of the software interrupt, which is related to
cards and should not be high.
 The fourth line displays the usage of the physical memory. If the remaining
memory space is small, it will become a performance bottleneck.
 The fifth line shows the usage of the swap partition. If the remaining space is
small, it will become a performance bottleneck.
 sar
Run the following command to query the CPU performance:

[root@XXX ~]# sar -u -P ALL 1 60


Linux 3.10.0-1062.12.1.el7.x86_64 (XXX) 12/15/2020 _x86_64_ (2 CPU)
04:18:32 PM CPU %user %nice %system %iowait %steal %idle
04:18:33 PM all 0.00 0.00 0.00 0.00 0.00 100.00
04:18:33 PM 0 1.00 0.00 0.00 0.00 0.00 99.00
04:18:33 PM 1 0.00 0.00 0.00 0.00 0.00 100.00

Run the following command to query the network performance:

[root@XXX ~]# sar -n DEV 1 100


Linux 3.10.0-1062.12.1.el7.x86_64 (XXX) 12/15/2020 _x86_64_ (2 CPU)
04:49:06 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
04:49:07 PM eth0 1.00 0.00 0.06 0.00 0.00 0.00 0.00
Page 70

04:49:07 PM lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Run the following command to query the memory performance:

[root@XXX ~]# sar -r 1 100


Linux 3.10.0-1062.12.1.el7.x86_64 (XXX) 12/15/2020 _x86_64_ (2 CPU)
04:50:25 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
kbactive kbinact kbdirty
04:50:26 PM 3056880 822980 21.21 28980 542328 821064 21.16
427648 272720 48
04:50:27 PM 3056936 822924 21.21 28980 542328 821064 21.16
427772 272724 48

 cat
On a Linux host, you can run the cat /proc/cpuinfo command to view the CPU
frequency of the current host and check whether the CPU underclocking occurs.

[root@XXX ~]# cat /proc/cpuinfo


Processor :0
vendor_id : GenuineIntel
cpu family :6
model : 85
model name : Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
stepping :7
microcode: 0x1
cpu MHz : 3000.000
cache size : 30976 KB
physical id :0
siblings :2
core id :0
cpu cores :1
apicid :0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes

2.2.3.1.2 Windows System Performance Monitoring


 Task Manager
On a Windows host, you can use the Windows performance monitor or Windows
task manager to query the memory and CPU usage of the host. The following figure
shows the Windows Task Manager window.
Page 71

 DirectX diagnostic tool


On the Windows host, choose Start > Run. In the Run dialog box, enter DXDIAG and
click OK to check whether the CPU underclocking occurs.
Page 72

 Server Manager
An outdated HBA driver will result in large I/O splitting, inadequate concurrency, and
long I/O latency. These problems often occur in low load, full hit, and bandwidth-
intensive scenarios.
Windows host: Open Server Manager in Windows, and select the corresponding HBA
in the device list. The driver version is displayed on the Driver tab page of the device
properties panel.
Page 73

2.2.3.2 Host-Layer Performance Tuning


2.2.3.2.1 Linux Disk Access Mode
You are advised to set the disk access mode to DirectIO when conducting common
performance tests in Linux.
The Linux operating system has a kernel buffer. The Linux operating system caches I/O
data in a page cache, that is, the data is first copied to the buffer of the operating system
kernel, and then copied from the buffer to the address space of the application program,
that is, BufferedIO. In DirectIO mode, data is directly transferred between the buffer of
the user address space and the disk without using the page cache.
When Vdbench or the dd command is used to test performance, BufferedIO is used by
default if DirectIO is not specified. The default page cache size of the Linux is 4 KB. If the
specified I/O size is greater than 4 KB, the following symptoms occur:
 The host splits I/Os in the kernel buffer and aggregates small I/Os at the block
device layer. As a result, the host CPU overhead is high.
 In multi-task concurrency scenarios, small I/Os cannot be completely aggregated into
large I/Os at the block device layer and then written to disks. Consequently, the
specified I/O model is changed, and the bandwidth performance of disks may fail to
reach the maximum.
To set the disk access mode to DirectIO, use either of the following methods:
 For a test using Vdbench, set openflags to o_direct in sd.

sd=default, openflags=o_direct, threads=32


Page 74

 For the dd command, set oflag to direct when testing the write performance and set
iflag to direct when testing the read performance.

dd if=/dev/zero of=/testw.dbf bs=4k oflag=direct count=100000


dd if=/dev/sdb of=/dev/null bs=4k iflag=direct count=100000

2.2.3.2.2 Linux Block Device Parameters


Block devices are an important part that affects host performance. To improve the
system performance, you need to properly configure the queue depth, scheduling
algorithm, prefetch size, and I/O alignment for block devices.
Note that the path of the file queried by running the cat command varies with the
operating system version. The following commands are for your reference only.
 Queue depth
The queue depth determines the maximum number of concurrent I/Os that can be
written to block devices. The default value is 128 in Linux. You are not advised to
change the value. You can run the cat command to query the queue depth of the
current block device.

linux-XXX:~ # cat /sys/block/sdc/queue/nr_requests


128

When testing a system's peak performance, you can raise the queue depth to
increase host write I/O pressure and the probability that I/Os are merged in the
queue. You can run the following command to temporarily change the queue depth
of a block device:

echo 256 > /sys/block/sdc/queue/nr_requests

You can temporarily change the queue depth of a block device for performance
tuning. After the application server is restarted, the queue depth of the block device
is restored to the default value.
 Scheduling algorithm
Linux kernel 2.6 supports four block device scheduling algorithms: Completely Fair
Queuing (CFQ), NOOP, deadline, and anticipatory.
 CFQ I/O scheduling
 In the latest kernel versions and distributions, CFQ is selected as the default
I/O scheduler and is the best choice for general-purpose servers.
 CFQ tries to evenly distribute the I/O bandwidth to access requests to avoid
process starvation and achieve low latency. It is a compromise between the
deadline and as schedulers.
 CFQ is the best choice for multimedia applications (such as video and audio)
and desktop systems.
 CFQ assigns a priority to each I/O request. The I/O priority is independent of
the process priority. Read and write operations of a process with a high
priority cannot automatically inherit the high I/O priority.
Page 75

 NOOP (elevator scheduling)


 In Linux 2.4 or earlier versions, NOOP is the only I/O scheduling algorithm.
NOOP implements a simple first in first out (FIFO) queue. It organizes I/O
requests like an elevator. When a new request is initiated, NOOP merges
this request into the latest request, ensuring that the requests are initiated
by the same medium.
 NOOP brings more advantages to write operations than read operations.
NOOP is the best choice for flash devices, RAM, and embedded systems.
 Deadline (media time scheduling)
 Similar to NOOP, Deadline classifies and combines I/Os by time and disk
area.
 Deadline ensures that the service requests are within an adjustable deadline.
The read period is shorter than the write period by default. In this way, the
write operation is prevented from being starved because it cannot be read.
 Deadline is the best choice for the database environment (such as Oracle
RAC and MySQL).
 AS (anticipatory)
 AS is essentially similar to Deadline. However, after the last read operation,
AS waits for 6 ms and then schedules other I/O requests.
 AS can schedule a new read request from the application and improve the
execution of the read operation. However, some write operations are
compromised.
 AS inserts a new I/O operation every 6 ms, and aggregates multiple small
data write streams into a large one. The write latency is compromised for
the maximum write throughput.
 AS is applicable to environments with a large number of write operations,
such as the file server. AS has poor performance in the database
environment.
Generally, the default scheduling algorithm is CFQ. Note that supported scheduling
algorithms may vary with operating systems.
You can run the cat command to query the scheduling algorithm of the current block
device.

linux-XXX:~ # cat /sys/block/sdc/queue/scheduler


noop deadline [cfq]

Improper scheduling algorithm configurations will affect system performance, for


example, in terms of I/O concurrency. You can run the following command to
temporarily modify the scheduling algorithm of a block device for performance
tuning:

echo noop > /sys/block/sdc/queue/scheduler

The scheduling algorithm will be restored to the default value after an application
server restarts.
Page 76

 Prefetch function
The prefetch function of Linux is similar to the prefetch algorithm of a storage
device. The function applies only to sequential reads. It identifies sequence flows and
prefetches read_ahead_kb (expressed in sectors) of data. For example, the default
prefetch size of the SUSE 11 operating system is 512 sectors, that is, 256 KB. You can
run the cat command to query the prefetch size of the current block device.

linux-XXX:~ # cat /sys/block/sdc/queue/read_ahead_kb


512

You can increase the prefetch size to improve performance when many large files
need to be read. You can run the following command to change the prefetch size of
a block device:

echo 1024 > /sys/block/sdc/queue/read_ahead_kb

 I/O alignment
If master boot record (MBR) partitions are created in Linux or Windows earlier than
Windows Server 2003, the first 63 sectors of a disk are reserved for the MBR and
partition table. The first partition starts from the sixty-fourth sector by default. As a
result, misalignment occurs between data blocks (database or file system) of hosts
and data blocks stored in disks, causing poor I/O processing efficiency. This type of
performance tuning may not be involved in new versions of operating systems. It is
not mentioned in the training materials and is only for knowledge extension.
When creating MBR partitions in Windows earlier than Windows Server 2003, you are
advised to run the diskpart command to set partition alignment.

diskpart> select disk 1


diskpart> create partition primary align=4096

In a Linux operating system, you can use either of the following methods to resolve
I/O misalignment:
Method 1: Change the start location of partitions. When creating MBR partitions in
Linux, you are advised to run the fdisk command in expert mode and set the start
location of the first partition as the start location of the second extent on a LUN.
(The default extent size is 4 MB.) The following quick command is used to create an
MBR partition in /dev/sdb. The partition uses all space of /dev/sdb. The start sector
is set to 8192, that is, 4 MB.

printf "n\np\n1\n\n\nx\nb\n1\n 8192\nw\n" | fdisk /dev/sdb

Method 2: Use GPT partitions. The following quick command is used to create a GPT
partition in /dev/sdb. The partition uses all space of /dev/sdb. The start sector is set
to 8192, that is, 4 MB.

parted -s -- /dev/sdb "mklabel gpt" "unit s" "mkpart primary 8192 -1" "print"
Page 77

You can increase the prefetch size to improve performance when many large files
need to be read.
2.2.3.2.3 Decrease in CPU Clock Speed
During off-peak hours, hosts in some running modes may automatically decrease their
CPU clock speed to reduce power consumption. However, the decrease during off-peak
hours will affect the interaction between the host and storage systems, resulting in a
higher I/O latency on the host. Therefore, if customers are sensitive to latency during off-
peak hours, you can configure the host running mode to ensure that the CPU clock speed
does not decrease, delivering optimal performance.
As mentioned earlier, you can use the DirectX diagnostic tool to view the CPU clock speed
on a Windows host. On a Linux host, you can run the cat /proc/cpuinfo command to
view the CPU clock speed of the current host.
To ensure that the CPU clock speed does not decrease, you can change the host running
mode to high-performance mode on a Windows host. Navigation path: Start > Control
Panel > System and Security > Power Options > Choose or Customize a Power Plan >
High Performance. After selecting Category in View by, you can view the System and
Security item.
On a Linux host, you can perform the following operations to configure the host running
mode to ensure that the host CPU clock speed does not decrease:
 Run the cd command to go to the /sys/devices/system/cpu/ directory.
 Run the ll command to check the number of CPUx processes in the
/sys/devices/system/cpu directory. In CPUx, x indicates an integer.
 Run the echo performance command for each CPUx in the
/sys/devices/system/cpu/ directory.

[root@XXX ~]# cd /sys/devices/system/cpu


[root@XXX cpu]# ll
total 0
drwxr-xr-x 7 root root 0 May 25 08:35 cpu0
drwxr-xr-x 7 root root 0 May 25 08:35 cpu1
drwxr-xr-x 7 root root 0 May 25 08:35 cpu10
drwxr-xr-x 7 root root 0 May 25 08:35 cpu11
drwxr-xr-x 7 root root 0 May 25 08:35 cpu2
drwxr-xr-x 7 root root 0 May 25 08:35 cpu3
drwxr-xr-x 7 root root 0 May 25 08:35 cpu4
drwxr-xr-x 7 root root 0 May 25 08:35 cpu5
drwxr-xr-x 7 root root 0 May 25 08:35 cpu6
drwxr-xr-x 7 root root 0 May 25 08:35 cpu7
drwxr-xr-x 7 root root 0 May 25 08:35 cpu8
drwxr-xr-x 7 root root 0 May 25 08:35 cpu9
drwxr-xr-x 3 root root 0 May 28 14:43 cpufreq
drwxr-xr-x 2 root root 0 May 28 14:43 cpuidle
-r--r--r-- 1 root root 4096 May 25 08:35 kernel_max
-r--r--r-- 1 root root 4096 May 28 14:43 offline
-r--r--r-- 1 root root 4096 May 25 08:35 online
-r--r--r-- 1 root root 4096 May 28 14:43 possible
-r--r--r-- 1 root root 4096 May 28 14:43 present
--w------- 1 root root 4096 May 28 14:43 probe
--w------- 1 root root 4096 May 28 14:43 release
Page 78

-rw-r--r-- 1 root root 4096 May 28 14:43 sched_mc_power_savings


-rw-r--r-- 1 root root 4096 May 28 14:43 sched_smt_power_saving
[root@XXX cpu]# echo performance > cpu0/cpufreq/scaling_governor

2.2.3.2.4 HBA
 Concurrency limit on an HBA
Concurrency limit on an HBA indicates the maximum number of I/Os that can be
concurrently transmitted to each LUN. In high-concurrency service scenarios, the
performance is low due to insufficient concurrency of HBAs and block devices.

Performance Indicator Description

Maximum number of I/Os that can be transferred by an


HBA in a time slice. The value of this parameter is
Concurrency limit
adjustable. You are advised to set it to the maximum value
to prevent I/Os from being blocked in HBAs.

Maximum I/O size that an HBA can deliver without


Maximum I/O size
splitting I/Os.

Maximum bandwidth of a single HBA port. You need to


Maximum bandwidth add HBAs and network connection ports dynamically based
on the actual storage bandwidth requirement.

Maximum IOPS of a single HBA port. You need to add


Maximum IOPS HBAs and network connection ports dynamically based on
the actual storage IOPS requirement.

On the Windows operating system, the default number of current I/Os allowed by an
HBA is 128 in most cases but may be smaller in some versions. For example, in
Windows Server 2012 R2, the default number of current I/Os for Emulex HBAs is 32.
If the number of current I/Os is small, the host pressure cannot be completely
transferred to the storage device. If the latency on the host side differs significantly
from that on the storage device side, you can query the number of current I/Os
allowed by the current HBA by using the management software provided by the HBA
vendor, and change the number as required.
On the Linux operating system, HBA queue parameters vary depending on the HBA
type and driver. For details, see the specifications provided by HBA vendors. For
example, a QLogic dual-port 8 Gbit/s Fibre Channel HBA allows the maximum queue
depth of each LUN to be 32. If the difference between the latency on the host and
storage sides is large, run the iostat command to check whether the concurrency
bottleneck is reached.
Page 79

In the command output, avgqu-sz indicates the average queue depth of the block
devices where LUNs are created. If the value is greater than 10 for a long time, the
problem is probably caused by concurrency limits. As a result, I/O requests are piled
up at the block device layer of the host instead of being delivered to the storage side.
In this case, you need to change concurrency limits of HBAs.
 Driver issue
An outdated HBA driver will result in large I/O splitting, inadequate concurrency, and
long I/O latency. These problems often occur in low load, full hit, and bandwidth-
intensive scenarios. If the HBA driver version is outdated, you are advised to update
the HBA driver version to the latest version.
For a Windows host, open the Server Manager window in Windows, and select the
corresponding HBA from the device list. The driver version is displayed on the Driver
tab page of the device properties panel.
Page 80

For a Linux host, you can run the lsscsi command to check the channel ID
corresponding to the current HBA. As shown in the following command output, the
channel IDs of the HBAs are 5 and 6.

[root@XXX ~]# lsscsi


[0:2:0:0] disk IBM ServeRAID M5015 2.12 /dev/sda
[1:0:0:0] cd/dvd TSSTcorp DVD-ROM TS-L333H ID03 /dev/sr0
[5:0:0:0] disk HUASY XXXXXX 2105 -
[5:0:0:1] disk HUASY XXXXXX 2105 -
[5:0:0:2] disk HUASY XXXXXX 2105 -
[5:0:0:3] disk HUASY XXXXXX 2105 -
[5:0:0:4] disk HUASY XXXXXX 2105 -
[5:0:0:5] disk HUASY XXXXXX 2105 -
[5:0:0:6] disk HUASY XXXXXX 2105 -
[5:0:0:7] disk HUASY XXXXXX 2105 -
[5:0:1:0] disk HUASY XXXXXX 2105 -
[5:0:1:1] disk HUASY XXXXXX 2105 -
[5:0:1:2] disk HUASY XXXXXX 2105 -
[5:0:1:3] disk HUASY XXXXXX 2105 -
[5:0:1:4] disk HUASY XXXXXX 2105 -
[5:0:1:5] disk HUASY XXXXXX 2105 -
[5:0:1:6] disk HUASY XXXXXX 2105 -
[5:0:1:7] disk HUASY XXXXXX 2105 -
[6:0:0:0] disk HUASY XXXXXX 2105 -
[6:0:0:1] disk HUASY XXXXXX 2105 -
[6:0:0:2] disk HUASY XXXXXX 2105 -
[6:0:0:3] disk HUASY XXXXXX 2105 -
[6:0:0:4] disk HUASY XXXXXX 2105 -
[6:0:0:5] disk HUASY XXXXXX 2105 -
[6:0:0:6] disk HUASY XXXXXX 2105 -
[6:0:0:7] disk HUASY XXXXXX 2105 -
[6:0:1:0] disk HUASY XXXXXX 2105 -
[6:0:1:1] disk HUASY XXXXXX 2105 -
[6:0:1:2] disk HUASY XXXXXX 2105 -
[6:0:1:3] disk HUASY XXXXXX 2105 -
[6:0:1:4] disk HUASY XXXXXX 2105 -
[6:0:1:5] disk HUASY XXXXXX 2105 -
[6:0:1:6] disk HUASY XXXXXX 2105 -
[6:0:1:7] disk HUASY XXXXXX 2105 -

Run the cd command to go to the corresponding host directory.

[root@XXX ~]# cd /sys/class/scsi_host/host5

Run the cat command to display the information about driver_version.

[root@XXX ~]# cat driver_version


8.04.00.13.11.3-k
Page 81

2.2.4 Network-Layer Performance Tuning


2.2.4.1 Link Bandwidth Bottlenecks
The bandwidth of the entire system is affected by the front-end link bandwidth,
controller bandwidth, and back-end disk bandwidth. The front-end link bandwidth may
become a bottleneck in most cases.
To estimate the front-end link bandwidth, you need to understand the host-storage
network plan and clarify the cascading of switches. Bandwidth is restricted by the
weakest point on the network. For example, if two optical fibers are used to connect a
host and a switch while only one 8 Gbit/s Fibre Channel fiber is used to connect the
switch and a controller, the maximum unidirectional bandwidth is about 750 MB/s.
The following table lists the measured bandwidth of common Fibre Channel links. Note
that the maximum bandwidth of a single Fibre Channel link is reached in the large
sequential I/O scenario. In the case of random small I/Os, the maximum bandwidth of
the link decreases.

Unidirectional Bandwidth Bidirectional Bandwidth


Link Bandwidth
(MB/s) (MB/s)

Bandwidth of 8 Gbit/s
750 1400
Fibre Channel links

Bandwidth of 16 Gbit/s
1500 2900
Fibre Channel links

Ports on host HBAs, Fibre Channel switches, and Fibre Channel optical modules on
storage devices can work at different rates. The actual transmission rate is the minimum
rate among these rates. If the working rate of a link is improper or a link is faulty, the
system performance may deteriorate. In addition, load balancing and redundancy need to
be configured for links to ensure better performance and higher reliability.

2.2.4.2 Multipathing Software


Multipathing software ensures redundancy, reliability, and high performance of links
between a host and a storage system. UltraPath, multipathing software developed by
Huawei, is recommended. Note that if OS native or third-party multipathing software is
used, incorrect paths may be selected due to improper configurations or compatibility
issues.
All descriptions about the multipathing software in this document are based on the
assumption that the customer uses Huawei UltraPath. If OS native or third-party
multipathing software is used, see the related documents provided by the corresponding
multipathing software vendor.
The operations described in this document are performed on Windows and Linux hosts.
For details about how to query the multipathing software on other operating systems,
see the multipathing software user guide of the corresponding operating system.
 Multipathing status
Before querying the multipathing status, ensure that the link status is normal and the
owning controller is working properly.
Page 82

On a Windows host, open the UltraPath Console and check whether the link status is
normal and whether the owning controller is working properly.
On a Linux host, run the upadmin show path command to check whether physical
paths are normal.
In some scenarios, links may be lost, that is, some initiator ports are absent. On
UltraPath, check whether the number of links is the same as the number of physical
connections. If link loss occurs in the IOPS-intensive scenario, I/O forwarding
performance deteriorates. If link loss occurs in the bandwidth-intensive scenario, a
physical connection is reduced, and the bandwidth decreases.
 Configuring the multipathing software
If the front-end intelligent balancing mode, load balancing algorithm, or other
parameters of UltraPath are set inappropriately, I/O loads on links will be
unbalanced or the sequence of I/Os will be affected.
These prevent optimal performance delivery and adversely affect bandwidth
capability.
On a Windows host, use the UltraPath Console to query the multipathing parameter
settings. Navigation path: System > Global Settings
On a Linux host, run the upadmin show upconfig command to query the
multipathing parameter settings.
 Intelligent front-end balancing
Huawei UltraPath 21.6.0 or later works closely with storage systems. UltraPath uses
the intelligent distribution algorithm to calculate the shard value of each I/O and
searches for the corresponding vNode in the storage system based on the shard
value. Based on the search result, UltraPath distributes the I/O to the front-end link
corresponding to the vNode. This prevents data forwarding between vNodes.
 Load balancing algorithm
If the load balancing algorithm of UltraPath is inappropriate, I/O loads on each link
may be unbalanced. Consequently, the optimal performance cannot be achieved.
UltraPath 21.6.0 supports three types of load balancing policies: min-queue-depth,
round-robin, and min-task.
min-queue-depth: The policy obtains the number of I/Os at each path in real time
and delivers new I/Os to the path with the fewest I/Os. When an application server
delivers I/Os to a storage system, the queue with the minimum number of I/Os takes
precedence over other queues in sending I/Os. Min-queue-depth is the default path
selection algorithm and provides the optimal performance in most cases.
round-robin: When an application server delivers I/Os to a storage system, UltraPath
sends the first set of I/Os through path 1 and second set of I/Os through path 2, and
so on. In this way, each path can be fully utilized. This algorithm is used together
with the policy of load balancing among controllers. Generally, the round-robin
algorithm has adverse impact on bandwidth performance. I/Os are delivered to each
logical path in sequence without considering the link load. As a result, a link may
become congested with too much pressure, which affects the sequence of I/Os.
min-task: When sending I/Os to the storage system, the application server calculates
the total data volume based on the block size of each I/O request and delivers the
Page 83

I/Os to the path with the minimum data volume. The algorithm is seldom used and
has the same impact on performance as the min-queue-depth policy.
 Number of consecutive I/Os for load balancing
The number of consecutive I/Os for load balancing indicates the number of
consecutive I/Os delivered by hosts on a path at a time. For example, the
Loadbalance io threshold parameter in the UltraPath software of the Linux
operating system indicates the number of consecutive I/Os for load balancing. The
default value is 100, indicating that the host delivers 100 consecutive I/Os on each
selected path. A Linux block device aggregates consecutive I/Os. If multiple I/Os
delivered on a path use consecutive addresses, the block device aggregates them into
a large I/O and sends the large I/O to the storage system, thereby improving
efficiency.
In some scenarios, improper setting of this parameter will affect the performance.
On the Windows operating system, the default value of this parameter is 1. It is
advised to retain the default value because the Windows operating system does not
support block devices and I/Os will not be aggregated.
On the Linux operating system, it is advised to set this parameter to a smaller value
if large I/Os are involved. The default maximum size of I/Os that can be processed by
a Linux block device is 512 KB. That is, a normal or aggregated I/O will never be
larger than 512 KB. For example, if a 512 KB I/O is delivered, it cannot be
aggregated. In this case, the recommended value of this parameter is 1.
In scenarios where all I/Os are random, I/Os are barely aggregated. To reduce the
overhead, it is advised to set this parameter to 1.

2.2.4.3 Ping-Pong Effect


As shown in the teaching material, in the multipathing networking environment of a
cluster, the multipathing software is configured for automatically switching over the
working controller of a LUN. When two application servers 1 and 2 access the same LUN
(whose owning controller is controller A), if one link of application server 1 is
disconnected, as shown in the figure, application server 1 switches the LUN's working
controller to controller B. Because the two links of application server 2 are working
properly, application server 2 attempts to switch the LUN's working controller back to
controller A. As a result, the two application servers keep switching the LUN's working
controller.
Such repeated switchover of the LUN's working controller is called the ping-pong effect.
This effect compromises the LUN access performance and makes I/O timeout likely occur
on application servers.
If the ping-pong effect occurs in a storage system, take the following measures:
1. Disable automatic LUN switchover in multipathing software. For details, see the
multipathing software user guide of the corresponding version. For Huawei
OceanStor UltraPath, log in to http://support.huawei.com and search for the related
document. For third-party multipathing software, obtain related documents from the
official channels of the third-party vendor.
2. Recover the interrupted link as soon as possible. Ensure that the link between each
node and each storage controller works properly.
Page 84

2.2.4.4 Switch Ports


Generally, hosts and storage devices are not directly connected on the live network. To
ensure reliability and load balancing, Fibre Channel switches are used for cross
networking. If zones for Fibre Channel switches or VLANs for IP switches are incorrectly
configured, loads delivered by the host to the storage device is halved or reduced. As a
result, the host loads are insufficient, affecting the system performance.
Zoning for FC switches is similar to VLAN division of Ethernet switches. Zoning isolates
links, reducing the number of fault domains and the competition between multiple hosts
and multiple services on links.
In a point-to-point zoning plan, each zone contains only one initiator and one target. This
policy prevents devices in other zones from interfering with each other to the maximum
extent.
For example, each of two hosts is connected to two switches to form a crossover
network. Each switch is connected to controller A and controller B through two 8 Gbit/s
FC optical fibers. No zone is configured, as shown in the figure on the left.
Theoretically, the bandwidth can reach 3.2 Gbit/s in the current networking environment.
However, when the dd command is used to test the performance, the maximum read
bandwidth can reach only about 2.4 Gbit/s. The eight physical paths between the two
switches and the storage devices are shared by the two hosts. As a result, the two hosts
compete for the links during service peak hours. Consequently, the bandwidth fluctuates
and cannot reach the maximum.
Zones can be configured to isolate the shared links. As shown in the figure on the right,
each host uses the fixed four links between switches and controllers. This prevents
multiple hosts from competing with each other on the links. After zones are configured,
the bandwidth can reach 3.2 Gbit/s during the performance test.
In addition to zoning, you can group shared links for isolation purposes by using
multipathing software to disable paths or binding a LUN group with a port group.
Zones can be configured on the GUI or by running commands. Switch vendors may be
required to provide assistance in troubleshooting.

2.2.4.5 Network Performance Parameters


In Linux, all TCP/IP tuning parameters are stored in the /proc/sys/net/ directory. The
training materials briefly list some tuning parameters and explanations of how they work.
Other tuning parameters are not closely related to storage and therefore are not
described here.

2.2.5 Storage-Layer Performance Tuning


2.2.5.1 Information Collection
Information collection is the basis for locating problems and optimizing performance on
a storage device. For example, collecting information about I/O characteristics helps users
analyze the controller CPU performance more efficiently and accurately. Information that
needs to be collected includes the types of services provided by the storage system, I/O
characteristics, and a storage resource plan.
 Service types and I/O characteristics
Page 85

Before locating performance problems on the storage side, you need to know the
service types and I/O characteristics on the storage side.
Service types include the Oracle database online transaction processing (OLTP)
service, online analytical processing (OLAP) service, virtual desktop infrastructure
(VDI) service, and Exchange mail service. By understanding and analyzing service
types, you can determine whether to pay attention to IOPS or bandwidth for the
current service and whether a low latency is required.
I/O characteristics include the I/O size, read/write ratio, cache hit ratio, and hotspot
distribution. You can use DeviceManager to monitor the system performance. By
understanding and analyzing I/O characteristics, you can determine whether the
current services are random I/Os or sequential I/Os, large I/Os or small I/Os, and
whether read services or write services are dominant.
 Storage resource plan
Before locating performance problems on the storage side, you need to learn about
the current storage resource plan to locate the fault domain. You can use
DeviceManager or the information collection function of Toolkit to collect storage
resource planning information. A storage resource plan covers the following aspects:
1. Product model, specifications, and software version. To ensure load balancing
among multiple controllers, the number of interface modules and the number of
front-end and back-end ports on each controller must be basically the same.
Otherwise, a controller may bear heavy load while other controllers may bear
light load.
2. Number of front-end and back-end interface modules and number of connected
ports. For bandwidth-intensive services, the number of interface modules that
provide bandwidth for back-end ports must match the number of interface
modules that provide bandwidth for front-end ports, ensuring that the
bandwidth of the storage system can be fully utilized.
3. LUN properties, including whether deduplication and compression are enabled
for a LUN.
4. Value-added feature configurations, such as whether value-added functions such
as snapshot and remote replication are configured. Due to the implementation
mechanism of value-added features, extra performance overheads are
generated. A large number of metadata operations (such as LUN initialization
and space allocation) are required. In addition, many non-host I/Os (such as
remote replication records and deletion of differences) may be generated.
Therefore, you need to know the value-added functions configured in the
storage system.

2.2.5.2 Impact of Internal Transactions and Value-Added Features on


Performance
Internal transactions refer to the response operations performed by the storage system in
the background based on user operations, such as reconstruction after a disk failure,
balancing during disk capacity expansion, and global garbage collection. Due to the
implementation mechanism of internal transactions and value-added features, a large
Page 86

number of metadata operations are required and many non-host I/Os may be generated,
which inevitably affects host services of the storage system.
When locating performance problems, check whether there are internal transactions or
value-added features first. Common value-added features include migration,
reconstruction, and pre-copy.
Data copies are generated for pre-copy and reconstruction tasks. When there are a large
number of data copies, the current service performance is affected. Therefore, during the
performance test, you need to check whether the current storage system runs pre-copy
or reconstruction tasks by running the following CLI command. When a pre-copy or
reconstruction task exists in a disk domain, you can view the progress of each task. If no
pre-copy, reconstruction, or balancing task exists in the disk domain, the command can
be successfully executed.

admin:/>show disk_domain task disk_domain_id=0

 Migration
SmartMigration supports LUN migration within a storage device or between storage
devices. Data migration involves a large number of read and write copies, impacting
the system performance. If the migration speed is set to a high value, the host I/O
performance will be significantly affected. If the migration speed is set to a low
value, the host I/O performance will be slightly impacted.
 Reconstruction
If a disk is faulty, the reconstruction function will recover data to newly assigned hot
spare space using RAID redundancy. This process will generate a large number of
RAID calculating overheads and data copies.
 Pre-copy
Disk data pre-copy enables a storage system to routinely check its hardware status
and migrate data from any faulty disk to minimize the risks of data loss. Data copies
are generated in the pre-copy process.
 Others
Other value-added features such as snapshot and remote replication increase
overheads and complicate I/O processing procedures, thereby undermining the
system performance. For details about the impact of value-added features on host
performance, see the corresponding feature guide.

2.2.5.3 CPU Performance Analysis


CPU capability is the most critical factor in determining a storage controller's maximum
performance. You need to check the CPU performance of controllers first when locating
performance problems based on I/O paths.
High CPU usage by a storage system increases the scheduling and I/O latencies.
A storage system's CPU usage varies significantly with different I/O models and
networking modes. You can view the CPU usage of the current controller by using the
performance monitoring function of DeviceManager or by running CLI commands.
Navigation path for DeviceManager performance monitoring: Monitor > Performance
Monitoring > Analysis. Create a controller metric graph as prompted and select the
Page 87

average CPU usage. For details about how to create a metric graph, see "Creating a
Metric Graph" in the corresponding product documentation.
You can also run the show performance controller command to display the CPU usage
of a controller.

admin:/> show performance controller controller_id=0A

If the overall CPU usage is high for a long time, the controller performance has reached
the upper limit. In this case, you are advised to migrate some services to other storage
systems to reduce service load.
You can set the CPU usage threshold of a storage system. The default threshold is 90%.
When the CPU usage of a controller in a storage system exceeds the threshold, the
storage system is triggered to collect information. The collected information is stored in
the /OSM/coffer_data/omm/perf/exception_info/ directory. The total size of files in this
directory cannot exceed 14 MB. If the total size exceeds 14 MB, the earliest files will be
overwritten. The collected information is used for subsequent performance tuning or
problem locating and analysis.

2.2.5.4 Performance Analysis of Front-End Ports.


Front-end ports process host I/Os. Check relevant factors and identify potential
bottlenecks affecting the storage system's performance.
 Viewing front-end ports
Before analyzing the performance of front-end ports, confirm the positions of
interface modules and the number, working status, and speeds of connected ports.
You can use DeviceManager or run the show port general command on the CLI to
view the information.
Confirm the position of the front-end interface module as well as the number of
connected ports, port working status, and port rates. Then, ensure that the following
requirements are met:
1. The number of front-end ports connected to the two controllers must be the
same. The connected ports must be evenly distributed on multiple front-end
interface modules to ensure load balancing between controllers and front-end
interface modules. For example, two 8 Gbit/s Fibre Channel interface modules
are deployed on controller A and controller B respectively, and each controller is
connected to the switch through four optical fibers. In this case, each Fibre
Channel interface module should be connected using two optical fibers.
2. Ensure that the working rates of the front-end ports displayed on
DeviceManager or CLI are consistent with the actual specifications and do not
decrease. When the port rate decreases, check the network, switches, and port
optical modules for faults and replace the faulty component.
3. Ensure that the displayed working mode is consistent with the actual networking
mode. P2P indicates switch-based connections, and FC-AL indicates direct
connections.
 Checking the concurrency pressure of front-end ports
Page 88

Before testing the maximum performance of a storage system, ensure that the host
concurrency pressure is high enough. If the number of concurrent tasks on the host is
large enough but the performance value (IOPS/bandwidth) is not high, the host
pressure may not have been transferred to the storage front-end or the storage
performance may have reached the bottleneck.
In addition to comparing the latency, you can also use either of the following
methods to check the concurrency pressure of front-end ports for supplementary
analysis.
1. Method 1: Perform calculation by using a formula. This method applies to
scenarios where pressure is constant, for example, using test tools such as
IOmeter and Vdbench to test the performance under constant pressure. The
number of concurrent tasks is generally fixed during the test, and can be derived
if the IOPS and latency are provided, to understand the front-end concurrency
pressure. The formula is as follows: IOPS = Number of concurrent tasks x
1000/Latency (unit: millisecond). For example, if the IOPS is 3546 and the
latency is 6.77 ms, the number of concurrent tasks is 24 (which is calculated by
3546 x 6.77/1000).
2. Method 2: Run related commands on the CLI to obtain the approximate front-
end concurrency. This method applies to scenarios where pressure changes. You
can run the show controller io io_type=frontEnd controller_id=XX command to
query the number of concurrent front-end I/Os of a specified controller. Run this
command multiple times to obtain a stable value that can approximate the
number of concurrent front-end tasks delivered to the controller. XX indicates
the controller ID.
If the latency is low, the number of concurrent tasks obtained by running the show
controller io command may be inaccurate. In this case, method 1 can be used for
analysis.
If the front-end concurrency pressure is not high enough, increase the host
concurrency pressure. If the front-end concurrency pressure remains the same, locate
the fault on the host.
 Checking whether front-end ports have bit errors
If the performance fluctuates frequently or declines unexpectedly, faults may have
occurred on front-end ports or links. In this case, check whether the front-end ports
experience bit errors.
You can check whether bit errors occur on front-end ports on DeviceManager, by
running the show port bit_error command on the CLI, or in the inspection report.
The bit error information of front-end ports varies with system versions. Bit errors on
the actual interface prevail.
If the number of bit errors on a front-end port increases continuously, performance
faults have occurred on the front-end port or link. In this case, you are advised to
replace the optical fiber or optical module.
 Front-end port performance indicators
Key performance indicators of front-end ports include the average read I/O response
time, average write I/O response time, average I/O size, IOPS, and bandwidth.
You can use DeviceManager or the CLI command to query these indicators.
Page 89

Performance indicators of front-end ports vary with versions. Performance indicators


on the actual interface prevail.
On DeviceManager, choose Monitor > Performance Monitoring. On the Monitoring
Panel page, you can view typical performance indicators of various front-end ports. If
you want to query more performance indicators, create a controller metric graph on
the Analysis page as prompted. After the graph is created, you can view the relevant
information.
Run the show performance port command on the CLI.
By viewing performance indicators of front-end ports, you can analyze them and
preliminarily locate possible performance problems.
Average read/write latency: When locating performance problems, you need to
compare the average latency of front-end ports with that on the host to check
whether there is a large difference, and then determine whether the problem may
occur on the storage side. The I/O front-end latency does not include the latency for
interaction and link transmission between the storage device and the host.
Average I/O size: indicates the average size of I/Os received by the storage device. If
the I/O size is different from that of I/Os delivered by the host, I/O splitting or
merging has been performed on the host or HBA drive.
IOPS and bandwidth: Compare the IOPS and bandwidth of each front-end port to
determine the differences in the service pressure delivered by the host connected to
each port and determine whether the front-end pressure is balanced. Pay attention
to whether the bandwidth of front-end ports is close to the theoretical maximum
bandwidth of a link. If the two values are approximate, the front-end bandwidth
becomes a performance bottleneck.

2.2.5.5 Cache Performance Analysis


Cache is a key module in improving system performance and improving user experience.
When analyzing the cache performance, pay attention to the impact of the cache
configuration on the write performance of the storage system.
 Write policy (affecting write performance)
Cache write policies include write-back and write-through.
When HDDs are used, writing data to the cache is much faster than writing data to
HDDs. Therefore, performance in write-back mode is better than that in write-
through mode.
If SSDs are used, whether write-back or write-through provides better performance
will depend on the algorithm in use.
1. Write-back: A write success message is returned to the host as soon as each
write I/O arrives at the cache. Then, the cache sorts and combines data and
writes data to disks.
2. Write-through: Each write request is directly sent to its destination disk. The
write performance is subject to the disk performance, especially the key disk
parameters such as the disk type, rotational speed, and seek time. The reliability
of write-through is higher than that of write-back.
Page 90

Whether cache write policy can be set depends on the product model. The cache
write policy cannot be set for some product models. In Huawei hybrid flash storage,
when running the create lun command on the CLI to create a LUN, you can set the
cache write policy.
When a BBU fails, only one controller is working, the temperature is too high, or the
number of LUN fault pages is higher than the threshold, the write mode of LUNs
may shift from write-back to write-through (the LUN health status may switch to
write protection). In this situation, data cannot be written to disks but can be read
because write protection prevents data in disks from being modified.
You can query the properties of a LUN on DeviceManager or by running the show
lun general command on the CLI to obtain the current LUN health status and cache
write policy.
In the command output, Health Status indicates the health status of a LUN, Write
Policy indicates the cache write policy, and Running Write Policy indicates the
current cache write policy.
If the LUN health status is write protection, check for the cause, for example,
whether a BBU fails, only one controller is working, the temperature is too high, or
the number of LUN fault pages is higher than the threshold.
 High and low watermarks (affecting write performance)
The high or low watermark of a cache indicates the maximum or minimum amount
of dirty data that can be stored in the cache. An inappropriate high or low
watermark of a cache can cause the write performance to deteriorate.
When the amount of dirty data in the cache reaches the upper limit, the dirty data is
synchronized to disks at a high speed. When the amount of dirty data in the cache is
between the upper and lower limits, the dirty data is synchronized to disks at a
medium speed. When the amount of dirty data in the cache is below the lower limit,
the dirty data is synchronized to disks at a low speed.
Do not set an excessively large value for the high watermark. If the high watermark
value is excessively large, the page cache is small. When the front-end I/O traffic
surges, the I/Os become unstable and the latency is prolonged, adversely affecting
the write performance.
Do not set an excessively small value for the low watermark. If the low watermark
value is excessively small, cached data is frequently written to disks, decreasing the
write performance.
If the difference between the high and low watermarks is too small, back-end
bandwidth cannot be fully utilized.
The recommended high and low watermarks for a cache are 80% and 20%,
respectively.
If the current high and low watermarks of the cache do not meet the requirements,
adjust them based on the preceding rules and then check whether the performance
is improved. You can run the quota show pttinfo command to display the details
about the high and low watermarks of the cache or use SystemReporter to view the
performance indicators of the cache.
Page 91

Changing the high or low watermark of a cache affects the frequency and size of
writing data from the cache to disks. Do not change the high or low watermark
unless required.
Whether the high and low watermarks can be changed depends on the product
model. For Huawei hybrid flash storage systems, you can run the change system
cache command on the CLI to change the high and low watermarks.
 Prefetch policy (affecting read performance)
Regarding data reading, the cache prefetches data to increase the cache hit ratio and
reduce the number of read I/Os delivered to disks, thereby minimizing the latency
and providing higher performance.
When reading data from a LUN, a storage system prefetches data from disks to the
cache based on the specified policy. A storage system supports four prefetch policies:
non-prefetch, intelligent prefetch, constant prefetch, and variable prefetch.
1. Non-prefetch: Data is not prefetched. Instead, data requested by a host is read
only from disks. This policy is suitable for scenarios where all I/Os are random.
2. Intelligent prefetch: Data prefetch is dynamically determined based on I/O
characteristics. Only in the case of sequential I/Os, data is prefetched. Intelligent
prefetch is the default policy and recommended in most scenarios.
3. Constant prefetch: The size of data prefetched each time is a predefined fixed
value. This policy is suitable for scenarios where multiple channels of sequential
fixed-size I/Os are delivered in a LUN. This policy can be used in media &
entertainment (M&E) and video surveillance scenarios.
4. Variable prefetch: The size of data prefetched each time is a multiple of the I/O
size. (The multiple is user-defined, ranging from 0 to 1024.) This policy is
suitable for scenarios where multiple channels of sequential I/Os are delivered
and the I/Os vary in size. This policy can be used in M&E and video surveillance
scenarios.
If a prefetch policy is set inappropriately, excessive or insufficient prefetch may occur.
Excessive prefetch indicates that the amount of prefetched data is much larger than
the amount of data that is actually read. For example, if the constant prefetch policy
is set in a scenario where all I/Os are random, excessive prefetch definitely occurs.
Excessive prefetch will cause poor read performance, fluctuations in the read
bandwidth, or the failure to reach the maximum read bandwidth of disks. To
determine whether excessive prefetch occurs, you can use SystemReporter to
compare the read bandwidth of a storage pool with that of the disk domain where
the storage pool resides. Excessive prefetch occurs if the read bandwidth of a storage
pool is much lower that of the disk domain where the storage pool resides. If
excessive prefetch occurs, change the prefetch policy to non-prefetch.
Insufficient prefetch indicates that the amount of prefetched data is insufficient for
sequential I/Os. As a result, all I/Os must be delivered to disks to read data, and no
data is hit in the cache. For example, if the non-prefetch policy is set in the event of
small sequential I/Os (database logs), insufficient prefetch definitely occurs.
Insufficient prefetch leads to a relatively long I/O latency. To determine whether
insufficient prefetch occurs, you can use SystemReporter to query the read cache hit
Page 92

ratio of a LUN. If the read cache hit ratio of the LUN mapped to the database log is
low, you can set the prefetch policy to intelligent prefetch or constant prefetch.
Note that the prefetch policy of some product models cannot be changed.

2.2.5.6 Storage Pool Performance Analysis


 RAID level
A storage pool provides storage space for upper-layer services. RAID policies must be
properly planned for a storage pool based on actual service requirements so that
multiple disks can work at the same time, improving the overall I/O processing
capability of the system and data security.
In the storage planning and deployment stage, the most suitable RAID level should
be selected based on service requirements, and the performance differences, space
utilization, and reliability should also be considered.
As an algorithm, RAID combines disks in a certain way and performs data striping. In
this way, multiple disks can work at the same time, improving the overall I/O
processing capability of the system.
For different I/O models, the reliability, read/write performance, and disk utilization
of the storage system vary with RAID levels. Different RAID levels provide different
read and write performance:
1. Random read: The performance of all RAID levels is equivalent.
2. Random write: Each random write will cause extra I/O overheads because of
RAID parity algorithms and mirror processing methods. This policy is called write
penalty. The larger the value of write penalty, the poorer the performance of
random writes. Therefore, the random write performance of RAID 10 is better
than that of RAID 5, whereas that of RAID 5 is better than that of RAID 6.
3. Sequential read: In the RAID 2.0+ architecture, all disks in a disk domain can
provide read performance because no independent parity disks are required. For
this reason, RAID 5 and RAID 6 deliver better sequential read performance than
RAID 10 (due to mirroring).
4. Sequential write: In sequential write scenarios, data can be written into disks in
units of stripes. Parity bit writing (in RAID 5 and RAID 6) and mirror-based dual
writes (RAID 10) cause extra write overheads. In sequential writes, write
performance becomes poorer as parity bits grow. In the storage planning and
deployment stage, the most suitable RAID level should be selected based on
service requirements, and the performance differences, space utilization, and
reliability should also be considered.
 RAID stripe depth
The stripe depth determines the maximum size of I/Os written to disks after host
I/Os are processed using the RAID algorithm. For I/Os with different characteristics,
the stripe depth affects performance to some extent. Not all storage systems allow
you to set stripe depth.
For Huawei hybrid flash storage, the stripe depth is specified when you run the
create storage_pool command on the CLI to create a storage pool. The stripe depth
cannot be changed after being specified. The default stripe depth is 128 KB.
Page 93

Random small I/Os: indicate I/Os smaller than 16 KB. Random I/Os cannot be
aggregated in the cache and their original sizes are typically retained when they are
written into disks. Therefore, if the stripe depth (128 KB by default) is multiple times
of the I/O size, there is a slight change that a small I/O will be split across stripes. In
random small I/O scenarios, the stripe depth has little impact on I/O performance.
You are advised to retain the default stripe depth 128 KB.
Sequential small I/Os: Multiple sequential small I/Os in the cache are aggregated into
a large I/O, ideally equal in size to the stripe depth, and then written into a disk. A
large stripe depth helps reduce the number of I/Os written into disks, improving the
data write efficiency. A stripe depth of 128 KB or larger is recommended.
Sequential and random large I/Os: indicate I/Os that are greater than 256 KB. If the
selected stripe depth is smaller than the I/O size, I/Os are split on the RAID layer,
affecting the data write efficiency. The maximum stripe depth, 512 KB, is
recommended.
 Others
Huawei all-flash storage uses the ROW mechanism based on the characteristics of
SSDs. Compared with overwrite, full-stripe write avoids the disk read overhead
caused by write penalty and prevents parity data from being overwritten frequently.
After data in the logical space is overwritten, the corresponding data on the original
disk becomes garbage data. The garbage data is reclaimed and released in the
background. Garbage collection generates extra overheads. Therefore, after garbage
collection is started, service performance is affected to some extent.

2.2.5.7 LUN Performance Analysis


 LUN types
Some storage products provide the SmartThin function. After SmartThin is enabled,
the storage system creates thin LUNs and does not allocate the configured capacity
to LUNs at a time. Within the configured capacity of a LUN, the storage system
dynamically allocates the storage resources to the LUN based on the actual capacity
used by the host. Compared with thick LUNs, thin LUNs differ in read and write
performance.
Write performance: For the first write, thick LUNs are formatted after being created.
Therefore, only host I/O write operations are involved. Space is allocated when new
data is written, leading to frequent read and write I/Os of metadata. The write
process is made longer and extra pressure is imposed on the back-end disks.
Therefore, upon the first write, thick LUNs outperform thin LUNs. When data is
overwritten, no extra overhead is generated because space has been allocated to thin
LUNs and thick LUNs. In this case, the performance of thin LUNs is equivalent to that
of thick LUNs.
Read performance: During sequential reads, storage resources are dynamically
allocated to thin LUNs based on user requirements. As space allocation is not
continuous in terms of time, the continuity of space mapped to disks is affected.
When creating a thick LUN, the system uses the automatic resource allocation
technology to allocate storage resources to the thick LUN at a time. This maximizes
the continuity of the space mapped to disks and improves the sequential read
efficiency of HDDs. Therefore, a thick LUN outperforms a thin LUN in sequential
Page 94

reads. The address space for random reads is discrete. Therefore, the performance of
a thick LUN is equivalent to that of a thin LUN in random reads.
 Local access
LUN access by local controllers ensures storage system performance.
Local access to a LUN means that I/Os destined for a LUN are directly delivered to
the owning controller of that LUN. As shown in the figure in the training material, a
host is physically connected to controller A, the owning controller of LUN 1 is
controller A, and that of LUN 2 is controller B.
When the host accesses LUN 1, the access requests are delivered through controller
A. Such a LUN access mode is called local access.
When the host attempts to access LUN 2, the access requests are first delivered to
controller A. Then, controller A forwards them to controller B through the mirror
channel between controllers A and B. Finally, controller B delivers the access requests
to LUN 2. Such a LUN access mode is called peer access.
The peer access scenario involves the mirror channel between controllers. The
channel limitations affect LUN read/write performance. To prevent peer access,
ensure that there are physical connections between the host and controllers A and B.
If a host is physically connected to only one controller, configure the controller as the
owning controller of the LUN.

2.2.5.8 Performance Analysis on Back-End Ports and Disks


By analyzing the performance of back-end ports and disks, you can understand the
impact of back-end ports and disks on storage performance and identify potential
performance bottlenecks in the storage system.
 Performance analysis of back-end ports
Back-end ports are SAS ports that connect a controller enclosure to a disk enclosure
and provide disks with channels for reading and writing data. The impact of back-
end SAS ports on performance typically lies in disk enclosure loops.
A single port provides limited bandwidth. The bandwidth supported by the ports in a
loop must be higher than the total bandwidth of all disks in the disk enclosures that
compose the loop. In addition, as the number of disk enclosures in a loop grows, the
latency caused by expansion links increases, affecting the back-end I/O latency and
IOPS. Based on the preceding two points, if there are sufficient SAS ports:
1. It is recommended that disk enclosures be evenly distributed to multiple loops.
2. Ensure that disks inserted into each disk enclosure are of the same type and
quantity.
3. If a single controller has multiple back-end interface modules, it is advised to
distribute loops to multiple back-end interface modules.
4. It is not advised to connect a large number of disk enclosures in a single loop.
The proper number of disk enclosures to be connected depends on the product
model and port type.
You can use DeviceManager or run the show performance port command on the
CLI to view the performance indicators of back-end ports, including the back-end
Page 95

port usage, total IOPS, and block bandwidth. Performance indicators of back-end
ports vary with versions. Performance indicators on the actual interface prevail.
 Disk performance analysis
Common storage media include SSDs, SAS disks, and NL-SAS disks. Each type of
storage media offers unique advantages and disadvantages in terms of performance
and cost. You must therefore select the disk type based on the service load and I/O
characteristics during storage planning.
In addition, the performance differences between tiers formed by different types of
disks are the basis of the tiered storage technology. Therefore, before configuring
storage services, you need to understand the performance characteristics and
differences of different types of disks.
1. SSD: SSDs do not have rotational latency of HDDs. For I/O models with obvious
access hotspots and sensitive response latency, especially for the random small
I/O read model commonly used in many database applications, SSDs have
obvious performance advantages over HDDs. However, for bandwidth-intensive
applications, SSDs deliver slightly higher performance than HDDs. In the tiered
storage technology, SSDs are used at the high-performance tier to bear high
IOPS pressure.
2. SAS disk: SAS disks store data on a series of high-speed rotating disks, providing
excellent performance, high capacity, and high reliability. There are two types of
revolutions per minute (RPMs): 10K RPM and 15K RPM. In the tiered storage
technology, SAS disks are used at the performance tier to provide excellent
performance, including stable latency, high IOPS, and large bandwidth. In
addition, the price of SAS disks is moderate.
3. NL-SAS disk: The rotation speed of NL-SAS disks is lower than that of SAS disks.
Generally, the rotation speed of NL-SAS disks is 7.2K RPM. NL-SAS disks can
provide the maximum capacity but low performance, and therefore are used at
the capacity tier in tiered storage. According to statistics, for most applications,
60% to 80% of the capacity bears light service load. Therefore, NL-SAS disks,
which can provide large capacity at a low price, are suitable for this part of
capacity. In addition, NL-SAS disks consume less power. Compared with SAS
disks, NL-SAS disks can reduce energy consumption per TB by 96%.
For a hybrid flash storage system, if a performance problem occurs and the front-end
storage is working properly, check whether the disk performance reaches the upper
limit. When the disk performance reaches the upper limit (the disk usage is close to
100%), storage performance is subject to back-end disk performance, and the IOPS
and bandwidth performance cannot be improved.
To ensure disk reliability and extend the disk service life, you are advised to set the
disk usage upper limit to 70%. If the usage of most disks is greater than 90%, you
are advised to add disks to the storage pool for capacity expansion or migrate
services to disks with better performance.
You can view the disk usage on DeviceManager or SystemReporter.
Note that the all-flash storage system uses only SSDs as storage media.
 Disk selection principles for disk domains
Not all storage systems need to select disks for disk domains.
Page 96

For a storage system requiring disk domains, disk selection during disk domain
creation may affect the performance of the storage system. You can check whether
the types and capacities of disks in the same disk domain are the same on
DeviceManager.
To ensure performance, you can observe the following principles when selecting disks
for creating a disk domain:
1. Avoiding dual-port access in bandwidth-intensive scenarios: To ensure reliability,
each disk enclosure loop is connected to one SAS port on controller A and one
SAS port on controller B in the same engine. In this way, disks in the loop can be
accessed by both controllers at the same time. Dual-port access indicates that
both controllers A and B can deliver I/Os to disks in a disk domain at the same
time. Single-port access indicates that only one controller can deliver I/Os to
disks. For sequential services, dual-port access affects the sequence of I/Os
delivered to disks. In this case, dual-port access underperforms single-port access
in terms of bandwidth performance. Therefore, in scenarios with sequential I/Os
and high bandwidth requirements, such as the Media & Entertainment (M&E)
industry, you can plan single-port access. That is, set one owning controller for
LUNs in the storage pool corresponding to the disk domain. Dual-port access is
used in random I/O scenarios.
2. Avoiding intermixing of disks: Disks with different rotational speeds or capacities
have different I/O processing latency and bandwidth. Therefore, disks with poor
performance in a RAID group may become a bottleneck that restricts the
performance of the entire stripe group. In addition, problems such as fast and
slow disks and uneven disk usage may occur. If sufficient disks are available, you
are advised to select disks with the same rotational speed and capacity in the
same disk domain and avoid intermixing different types of disks.
Page 97

3 Storage Solution

3.1 Application Practices of the Storage Backup Solution


3.1.1 Overview
The Huawei backup solution is designed based on the RTO and RPO requirements of
service systems. It integrates a variety of DR protection technologies to meet the RTO
and RPO requirements of various services. The solution provides real-time backup and
replication functions to protect against data loss with an RTO and RPO of seconds, an
industry-leading copy data management technique to protect against data loss with an
RTO and RPO of minutes, and a conventional scheduled backup function to protect
against data loss with an RTO and RPO of hours.
By integrating a variety of DR protection technologies, the solution not only meets
various SLA requirements but also reduces the TCO.

3.1.2 Backup System


3.1.2.1 Backup System Components
3. Backup software
The backup software is the core of a backup system. It manages and executes backup
tasks, and manages and restores backup data.
4. Backup server
The backup server provides the operating system and hardware resources for the
backup software to run. The operating system running on the backup server must be
more secure than the production environment. For example, the Linux operating
system should be used and related security reinforcement measures should be taken.
5. Backup storage media
Backup data is stored on backup storage media. After backup data is generated, the
backup software writes the backup data to the backup storage media.
6. Backup network
The backup system network consists of the backup management network and the
backup transmission network. The backup management network is based on the
TCP/IP protocol and is only used to transmit backup system management data. The
backup transmission network supports both the TCP/IP-based network and the FC-
based network, depending on the backup transmission mode.
Page 98

The backup system network can be deployed independently of the production service
network to prevent backup traffic from affecting services. Alternatively, the service
network can function as the backup system network.

3.1.3 Technologies

3.1.3.1 Standard Backup


The standard backup process consists of three steps:
1. Reads data to be protected through the backup client (agent client). For different
applications, the agent client can either be deployed on the production server
(agent-based backup) or can use the Huawei Data Protection Appliance (DPA) built-
in agent client (agent-free backup).
2. Reads data from a production system to a DPA over the network (TCP).
3. The DPA receives and saves data to the backup storage.
For different backup modes, such as full backup, incremental backup, permanent
incremental backup, and differential backup, data is read in different ways. Accordingly,
all data is transmitted or only unique data is transmitted with deduplication.

3.1.3.2 Advanced Backup


The advanced backup consists of three parts:
1. Production data capture: Data is captured in the native format. Format conversion
is not required. Data is accessible upon being mounted. SLA policies can be
customized based on applications. Retention duration, RPO, RTO, and data storage
locations are intuitively displayed.
2. Copy management
i. Permanent incremental backup: Initial full backup and N incremental
backups are performed. A full copy is generated at each point in time that an
incremental backup is performed. Damages to a copy at an incremental backup
point in time will not impede recovery from any other point in time.
ii. No rollback: Point-in-time copies created through virtual clone can direct
to both source data and current incremental data and can be directly used for
recovery.
3. Copy access and usage: Backup data copies can be mounted in minutes without
moving data. Recovery speeds are not affected by the amount of data. The same
virtual copy can be mounted to multiple hosts simultaneously. Data can be
recovered from any point in time. A host automatically takes over the original
production applications after the virtual copy is mounted.

3.1.3.3 Continuous backup


Huawei Data Protection Appliance provides the continuous backup feature to
continuously capture I/O changes of production volumes based on the volume
When the backup network has no bottleneck, RPO can be close to 0. In addition, the
virtual snapshot technology is used to recover data to any time point.
Page 99

The client driver captures I/O changes of the production volume in real time and sends
the changed data to the log volume of the Data Protection Appliance for temporary
storage. In addition, the virtual snapshot technology is used to periodically generate
point-in-time copies. With the virtual snapshot data and I/O log data in the log volume,
data can be recovered to any point in time. According to the snapshot retention period in
the backup policy, snapshots and I/O logs in log volumes are periodically deleted to
release the backup capacity.
If the data of the production system is damaged, you need to select the historical time
with data to be recovered. Then, the latest virtual snapshot is found based on the
selected time point for recovery. The I/O log data between the snapshot time point and
the time point for recovery in the log volume is found, and the virtual snapshot data and
the I/O log data in the log volume are combined to generate a new virtual volume. In
this case, data can be recovered to any point in time at the I/O level.
Continuous backup has the following characteristics:
● The RPO is close to zero. The volume-based I/O log technology captures I/O changes
of production volumes in real time and backs up the I/O changes. In this way, data loss
can be approximately zero during recovery.
● Recovery to any point in time The virtual snapshot and log technologies can be used
together to recover data to any point in time.
● Supports deduplication. Supports deduplication at the target end to reduce backup
storage capacity requirements.
● Supports compression. Supports compression at the storage end to reduce the backup
storage capacity usage.
● Supports different recovery modes, such as recovery to the original location on the
original host, recovery to a different location on the original host, and recovery to a
different host, to meet requirements in different recovery scenarios.

3.1.3.4 Backup Solution Network


3.1.3.4.1 LAN-Base
LAN-Base backup consumes network resources as both data and control flows are
transmitted over a LAN. Consequently, when a large amount of data needs to be backed
up within a short duration of time, network congestion is likely to occur.
Direction of backup data flows: The backup server sends a control flow to an application
server where an agent is installed over a LAN. The application server responds to the
request and sends the data to the backup server. The backup server receives and stores
the data to a storage device. The backup operation is complete.
Strengths:
- The backup system and the application system are independent of each other,
conserving hardware resources of application servers during backup.
Weaknesses:
- Additional backup servers increase hardware costs.
Page 100

- Backup agents adversely affect the performance of application servers.


- Backup data is transmitted over a LAN, which adversely affects network performance.
- Backup services must be separately maintained, complicating management and
maintenance operations.
- Users must be highly proficient at processing backup services.
3.1.3.4.2 LAN-Free
Control flows are transmitted over a LAN, but data flows are not. LAN-Free backup
transmits data over a SAN instead of a LAN. The server that needs to be backed up is
connected to backup media over a SAN. When triggered by a LAN-Free backup client, the
media server reads the data that needs to be backed up and backs up the same data to
the shared backup media.
Direction of backup data flows: The backup server sends a control flow over a LAN to an
application server where an agent is installed. The application server responds to the
request and reads the production data. Then, the media server reads the data from the
application server and transmits the data to the backup media. The backup operation is
complete.
Strengths:
- Backup data is transmitted without using LAN resources, significantly improving backup
performance while maintaining high network performance.
Weaknesses:
- Backup agents adversely affect the performance of application servers.
- LAN-Free backup requires a high budget.
- Devices must meet certain requirements.

3.1.4 Solution Planning


3.1.4.1 Information Collection
Information collection involves project survey, customer requirement analysis, device
information collection, live network information collection, and application environment
information collection. The information collection process also applies to other
applications, such as VMs and databases.

3.1.4.2 Backend Capacity Planning Principles


1. For permanent incremental backup, the number of full backup copies is 1.
3. The recommended redundancy ratio is 20%.
4. The deduplication ratio varies with data types. The deduplication ratio is about 60%
for file systems, databases, and hybrid data, about 70% for virtualization
applications, and about 50% for mixed applications.
5. For advanced backup and continuous backup, reserve at least 30% of the production
volume capacity for copy mounting tests.
Page 101

3.2 Application Practices of the Active-Standby Storage DR


Solution
3.2.1 Overview
3.2.1.1 Positioning
As IT services become increasingly important, the continuous and secure running of IT
systems has become critical for normal enterprise operation. However, IT systems face
many threats, such as:
Device faults: IT systems cannot function properly. Replacing devices may cause service
disruption and result in a long recovery period.
Regional outages: The entire IT system fails, interrupting all services.
Data center fire: All IT systems are damaged. Data loss and service interruption will
deliver a fatal blow to enterprises.
Natural disasters: Floods, earthquakes, and mudslides may cause devastating damage to
IT systems, interrupting all services and causing all data to be lost.
To enhance IT system availability and address the preceding threats, Huawei launched
the active/standby DR solution. This solution enhances IT system availability by
establishing a DR center; if a disaster occurs at the production center, the DR center
quickly takes over production services to minimize service downtime and reduce
customer loss.
In addition to basic data DR functions, the active/standby DR solution can implement DR
for file systems, databases, and VM applications based on remote array replication and
DR management software.

3.2.1.2 Solution Highlights and Customer Benefits


1. Storage interconnection and communication
Huawei storage devices in the production and DR centers interwork to support DR
services at different performance levels.
6. Data replication
The array DR replication technology enables DR data to be copied at the storage
layer, reducing the impact of DR data replication on the production host
performance.
7. Various data replication modes
Supports synchronous and asynchronous replication of DR data. The optimal DR
replication mode is determined based on the existing network resources of users.
8. Quick service recovery
One-click DR and automation scripts replace complex manual operations, greatly
shortening service recovery time, improving DR efficiency, and reducing RTO.
9. Visualization
A visual display of the DR topology is provided to monitor the DR system in real time,
facilitating maintenance and enhancing system usability.
Page 102

3.2.2 Technologies
3.2.2.1 HyperReplication
3.2.2.1.1 Introduction
HyperReplication is the remote replication feature developed by Huawei. The feature
provides flexible and powerful data replication functions to achieve remote data backup
and recovery, continuous support for service data, and disaster recovery. This feature
requires at least two OceanStor storage systems that can be placed in the same
equipment room, same city, or two cities up to 1000 km apart. The storage system that
provides data access for production services is the primary storage system, and the
storage system that stores backup data is the secondary storage system.
HyperReplication supports the following replication modes:
● HyperReplication/S for LUN: In this mode, data is synchronized between two storage
systems in real time to achieve full protection for data consistency, minimizing data loss
in the event of a disaster. However, production service performance is affected by the
data transfer latency.
● HyperReplication/A for LUN: In this mode, data is synchronized between two storage
systems periodically to minimize service performance deterioration caused by the latency
of long-distance data transmission. Production service performance is not affected by the
data transfer latency. However, some data may be lost if a disaster occurs.
● HyperReplication/A for file system: In this mode, data is synchronized between two
file systems periodically to minimize service performance deterioration caused by the
latency of long-distance data transmission. Production service performance is not
affected by the data transfer latency. However, some data may be lost if a disaster
occurs. HyperReplication provides the storage array-based consistency group function for
synchronous or asynchronous remote replication between LUNs to ensure the consistency
of cross-LUN applications in disaster recovery replication. A consistency group is a
collection of pairs that have a service relationship with each other. For example, the
primary storage system has three primary LUNs that respectively store service data, logs,
and change tracking information of a database. If data on any of the three LUNs
becomes invalid, all data on the three LUNs becomes unusable. For the pairs in which
these LUNs exist, you can create a consistency group. Upon actual configuration, you
need to create a consistency group and then manually add pairs to the consistency
group. The consistency group function protects the dependency of host write I/Os across
multiple LUNs, ensuring data consistency on secondary LUNs. In addition,
HyperReplication allows data to be replicated through both Fibre Channel and IP
networks. Data can be transferred between the primary and secondary storage systems
through Fibre Channel or IP links.
3.2.2.1.2 Application scenarios

HyperReplication is mainly used for data backup and disaster recovery. Different remote
replication modes apply to different application scenarios:
Page 103

● HyperReplication/S applies to backup and disaster recovery scenarios where the


primary site is near to the secondary site, for example, in the same city (same data center
or campus).
● HyperReplication/A applies to backup and disaster recovery scenarios where the
primary site is far from the secondary site (for example, across countries or regions) or
the network bandwidth is limited.
In a specific application scenario, the replication mode must be determined based on the
distance and available bandwidth between sites. Typical application scenarios include
central backup for disaster recovery as well as mutual backup for disaster recovery.

3.2.2.1.3 HyperReplication/S for LUN

HyperReplication/S for LUN supports data disaster recovery of LUNs over shortdistances.
It applies to same-city disaster recovery that requires zero data loss. It concurrently writes
each host write I/O to both the primary and secondary LUNs and returns a write success
acknowledgement to the host after the data is successfully written to the primary and
secondary LUNs. Therefore, the RPO is zero.

The following describes the working principles of HyperReplication/S for LUN:


1. Initial synchronization: After a remote replication relationship is established between
the primary and secondary LUNs, initial synchronization is initiated.
a. All data on the primary LUN is replicated to the secondary LUN.
b. If the primary LUN receives a write request from the host during initial
synchronization, data is written to both the primary and secondary LUNs.

2. Dual-write state: After initial synchronization, data on the primary LUN is the same as
that on the secondary LUN. The normal I/O processing process is as follows:
a. The production storage system receives a write request from the host.
HyperReplication logs the address information instead of data content.
b. The data of the write request is written to both the primary and secondary LUNs. If a
LUN is in the write-back state, data will be written to the cache.
c. HyperReplication waits for the data write results from the primary and secondary
LUNs. If the data has been successfully written to the primary and secondary LUNs,
HyperReplication deletes the log. Otherwise, HyperReplication retains the log and enters
the interrupted state. The data will be replicated in the next synchronization.
d. HyperReplication returns the data write result. The data write result of the primary
LUN prevails.

3. Single-write state: If a user runs a specific command to split the pair, the replication
link is disconnected and the data of the write request fails to be written to both primary
and secondary LUNs. The remote replication pair enters the singlewrite state.
Page 104

a. A write success acknowledgement is returned to the host immediately after the data
is written to the cache of the primary storage system.
b. After data in the primary cache is written to the primary LUN, data differences
between the primary LUN and the secondary LUN are recorded in the data change log
(DCL).

In medium- and large-sized database applications, data, logs, and modification


information are stored on different LUNs. If data on one of these LUNs is unavailable,
data on the other LUNs is also invalid. How to keep consistency between multiple remote
replication pairs must be considered if remote disaster recovery must be implemented for
these LUNs simultaneously. HyperReplication/S provides the consistency group function
to maintain the same synchronization pace among multiple remote replication pairs.
A consistency group is a collection of pairs that have a service relationship with each
other, ensuring data consistency in a scenario where a host writes data to multiple LUNs
on a single storage system. After data is written to a consistency group at the primary
site, all data in the consistency group is simultaneously copied to the secondary LUN
using the synchronization function of the consistency group, ensuring integrity and
availability of the data used for backup and disaster recovery. HyperReplication/S allows
users to add multiple remote replication pairs to a consistency group.
When users perform splitting, synchronization, or a primary/ secondary switchover or set
secondary LUNs to writable for a consistency group, the operation applies to all members
in the consistency group. When a link fault occurs, all members of the consistency group
enter the interrupted state together.
After the fault is rectified, data synchronization is performed again to ensure availability
of the data on the secondary storage system.

3.2.2.1.4 HyperReplication/A for LUN

HyperReplication/A for LUN supports data disaster recovery of LUNs over longdistances.
It applies to scenarios where a remote disaster recovery center is used and the impact on
production service performance must be reduced.
HyperReplication/A for LUN employs the multi-point-in-time caching technology to
periodically synchronize data between primary and secondary LUNs. All data changes to
the primary LUN since last synchronization will be synchronized to the secondary LUN.
Similar to HyperReplication/S, HyperReplication/A supports the consistency group
function. Users can create or delete a consistency group and add members to or delete
members from the group.
HyperReplication/A adopts the multi-point-in-time caching technology. The working
principles are as follows:
1. After an asynchronous remote replication relationship is set up between a primary LUN
at the primary site and a secondary LUN at the secondary site, an initial synchronization
is implemented to fully copy data from the primary LUN to the secondary LUN.
Page 105

2. After the initial synchronization is complete, the secondary LUN data status becomes
Consistent (data on the secondary LUN is a copy of data on the primary LUN at a
specified past point in time). Then, the I/O process shown in the following figure starts.
1. When a replication period starts, snapshots are generated for primary and secondary
LUNs, and the points in time of these two LUNs are updated. The snapshot of the
primary LUN is X and that of the secondary LUN is X–1.
2. The data in a write request from the host is written to time segment X + 1 in the cache
of the primary LUN.
3. A write success response is returned to the host.
4. Differential data generated at point in time X is directly replicated to the secondary
LUN based on the DCL.
5. Both primary and secondary LUNs flush received data onto disks. The latest data on
the secondary LUN is the data at point in time X of the primary LUN.

3.2.2.1.5 Key technologies


[Link Compression]
Link compression is an inline compression technology. In a remote replication task, data
is compressed on the primary storage system before transfer. Then the data is
decompressed on the secondary storage system, reducing bandwidth consumption in
data transfer. Link compression can:
● Efficiently transmit a large amount of data over a network with a low network
bandwidth.
● Remarkably reduce network bandwidth consumption upon data transmission,
improving network bandwidth utilization.
● Use fewer network bandwidth resources to complete remote replication, reducing
network bandwidth lease costs.
After link compression is enabled, data is compressed before transfer and decompressed
after transfer, increasing the data processing latency. Therefore, link compression is not
applicable to services that require a low latency, such as synchronous remote replication
services.
OceanStor Dorado storage systems support link compression in asynchronous remote
replication, effectively reducing network bandwidth consumption and improving transfer
performance.

[High Fan-in& High Fan-out]


HyperReplication in OceanStor Dorado V6 series storage systems supports data
replication from 64 storage systems to one storage system for centralized disaster
recovery (64:1). This achieves disaster recovery resource sharing and greatly reduces the
cost in deploying disaster recovery devices.
Page 106

3.2.2.2 BCManager
3.2.2.2.1 Introduction
eReplication is designed to manage DR services of data centers for enterprises. With its
excellent application-aware capabilities and Huawei storage products' value-added
features, eReplication ensures service consistency during DR, simplify DR service
configuration, support the monitoring of DR service status, and facilitate data recovery
and DR tests.

[Highlights]
• Simple and efficient
eReplication adopts an application-based management method and guides you through
the DR service configuration step by step. It supports one-click DR tests, planned
migration, fault recovery, and reprotection.
• Visualized
By graphically displaying physical topologies of global DR and logical topologies of
service protection, eReplication enables you to easily manage the entire DR process. The
status of protected groups and recovery plans is clear.
• Integrated
eReplication integrates storage resource management and can be used in various
application scenarios, such as active-passive DR, geo-redundant DR, HyperMetro DC, and
local protection, reducing the O&M cost and improving the O&M efficiency.
• Reliable
eReplication ensures that the applications and data at the DR site are consistent with
those at the production site. After the production site fails, the DR site immediately takes
over services from the DR site, ensuring business continuity. The multi-site deployment
improves the reliability of the DR service management system. eReplication backs up
management data automatically so that the management system can be recovered
rapidly if a disaster occurs.

3.2.2.2.2 Software Architecture


eReplication is designed to protect host applications and VMs and help recover data
when disasters occur. It is compatible with Windows, Linux, AIX, and Solaris. On these
platforms, eReplication works with value-added functions of Huawei storage arrays to
ensure data consistency, protect applications, and implement DR for applications.
Virtualization DR protects VM in virtual environments, such as FusionSphere(in array-
based replication mode)and vSphere VMware(in host-based mode). A dopting the
browser/server(B/S)architecture, eReplication allows you to use a browser to manage DR.

eReplication contains two parts:


• eReplication Agent
Installed on a service host, the eReplication Agent is used to discover host applications,
protect application data consistency, and recover applications. It also transfers
Page 107

information about application systems to the eReplication Server to help system


administrators manage and configure data protection and DR tasks at ease.
• eReplication Server
Installed on a DR management server, the eReplication Server provides DR service
process control, DR policy scheduling, and configuration data management of DR
resources and services. The eReplication Server contains the UI sub-system that provides
users a GUI to perform DR management in a visible manner.
If the protected objects are FusionSphere VMs, VMware VMs, or NAS file systems, you do
not need to install the eReplication Agent.

3.2.3 Planning
3.2.3.1 Information Collection
When collecting information, survey the project background, extract customer
requirements, and collect information about the devices, live network, and application
environment. If Fibre Channel switches are involved, collect information about the
manufacturer models and versions, rates, and remaining ports and whether the switches
are reused.

3.2.3.2 DR Solution Selection and Compatibility Confirmation


Focus on the customer's RTO and RPO requirements in the SLA. If the customer accepts
that the RTO and RPO are close to 0, the active/standby solution can be used.
Confirm the operating systems, databases, applications, and storage device types and
versions.

3.2.3.3 Bandwidth and Capacity Calculation


3.2.3.3.1 Capacity Calculation
1. Determine the data that needs to be backed up. To ensure that applications are
started at the DR site, different data must be protected for different applications.
10. Data that needs to be protected in the Oracle database.
a. To ensure that the database at the DR site can be started, the mandatory
replication data is Data, Control, and Log files. These three types of files must be
placed in the same consistency group.
b. FRA files can be replicated as required. It is not mandatory that FRA files and the
preceding three files be in the same replication consistency group.
c. If an Oracle RAC cluster is deployed on the production site, an RAC cluster or a
single Oracle node can be deployed on the DR site. If only one node is deployed,
you do not need to replicate the Grid file from the production site to the DR site.
11. Data that needs to be protected in the SQL Server database.
To ensure that the database at the DR site can be started, the mandatory replication
data is Data and Log files. These two types of files must be placed in the same
consistency group.
Page 108

3.2.3.3.2 Bandwidth Calculation


1. The customer's network design with RPO = 0 must be implemented using the array-
based synchronous replication technology. The requirements for the network are as
follows:
a. You are advised to use optical fibers to connect the active and standby sites and
use the dual-link redundancy network.
b. It is recommended that the optical fiber distance between the active and standby
sites be less than or equal to 200 km and the latency between the active and
standby sites be less than 1 ms.
12. You are advised to implement the customer's network design with RPO > 0 using the
array-based asynchronous replication technology. The requirements for the network
are as follows:
a. It is recommended that the active and standby sites be connected through an IP
network. The distance between the active and standby sites must be less than
3000 km.
b. If the required RPO is less than 1 minute, the OceanStor V3 series must be
deployed.
3.2.3.3.3 Network Design
1. Network design of the DR network:
a. It is recommended that dual redundancy network be implemented for Fibre
Channel DR links used by synchronous remote replication, thereby preventing a
single point of failure on the link side and ensuring zero RPO.
b. Asynchronous replication can use only one IP DR link.
13. Network design of the management network:
a. The active and standby sites are connected to the management network,
ensuring that the management server at the DR site can access the production
host and implement one-click automated DR management.
b. A management server, either a physical or virtual machine, must be deployed. It
is recommended that the management server be deployed at the DR site, so that
the server is still available when the production site is faulty.
14. Network design of the production site where both the active and standby sites use
Huawei storage.
You are advised to use the dual-Fibre Channel switch network to prevent the impact
of a single switch fault on services.
15. Network design of the production site that needs to be compatible with storage
devices from other vendors:
a. It is recommended that OceanStor V3 be deployed and heterogeneous
virtualization and volume mirroring functions be used, thereby implementing
local HA to prevent the failure of a third-party storage system from affecting
production services.
b. You are advised to use the dual-Fibre Channel switch network to prevent the
impact of a single switch fault on production services.
Page 109

3.3 Application Practices of the HyperMetro Storage DR


Solution
3.3.1 Overview
3.3.1.1 Application Scenarios
3.3.1.1.1 Applicable Requirements
1. The disaster recovery (DR) link distance between the production site and the DR site
is less than or equal to 100 km.
16. RPO = 0, RTO ≈ 0.
3.3.1.1.2 Typical Scenarios
1. Local data center (DC) deployment: Storage systems are deployed in different
equipment rooms in the same data center. Hosts are deployed in a cluster. Hosts and
storage systems are connected through Fibre Channel or IP switches. Mirror-based
dual-write channels are deployed to ensure the normal running of active-active
services.
17. Cross-DC deployment: Generally, storage systems are deployed in two data centers in
the same city or two adjacent cities. The physical distance between the two data
centers is no greater than 300 km. Both data centers are running and can carry the
same services, improving the overall service capability and resource utilization. If one
data center is faulty, services are automatically switched to the other one.
Note: In the cross-DC deployment involving long-distance transmission (a minimum of 80
km for IP networking and 25 km for Fibre Channel networking), wavelength division
multiplexing (WDM) devices must be used to ensure a short transmission latency. Mirror-
based dual-write channels are deployed to ensure the normal running of active-active
services.

3.3.1.2 Highlights
1. Dual write ensures storage redundancy.
18. In the event of a fault in a storage system or production center, services can be
quickly switched to the DR center, ensuring zero data loss and service continuity.
19. This solution meets the requirements of RTO = 0 and RPO = 0.
20. Two data centers carry services at the same time, fully utilizing DR resources.
21. Services run 24/7.
22. If one storage system in the HyperMetro deployment performs poorly, the
performance of the HyperMetro system will be negatively affected.
Note: For some models, if HyperMetro is applied to typical cluster scenarios, RTO = 0; if
HyperMetro is applied to virtualization clusters, when a host in a cluster experiences a
fault, the VM automatically restarts and recovers on another host at an RTO ≈ 0.
Page 110

3.3.2 Technologies
3.3.2.1 Basic Concepts
 Protected Object
For customers, the protected objects are LUNs or protection groups. That is, HyperMetro
is configured for LUNs or protection groups for data backup and disaster recovery.
1) Data protection can be implemented for each individual LUN.
2) Data protection can be implemented for a protection group, which consists of multiple
independent LUNs or a LUN group.
 Protection Group (PG) and LUN Group
A LUN group can be directly mapped to a host for the host to use storage resources. You
can group LUNs for different hosts or applications.
A protection group (PG) applies to data protection with consistency groups. You can plan
data protection policies for different applications and components in the applications. In
addition, you can enable unified protection for LUNs used by multiple applications in the
same protection scenario. For example, you can group the LUNs to form a LUN group,
map the LUN group to a host or host group, and create a protection group for the LUN
group to implement unified data protection of the LUNs used by multiple applications in
the same protection scenario.
 HyperMetro Domain
A HyperMetro domain allows application servers to access data across DCs. It consists of
a quorum server and the local and remote storage systems.
 HyperMetro Pair
A HyperMetro pair is created between a local and a remote LUN within a HyperMetro
domain. The two LUNs in a HyperMetro pair have an active-active relationship. You can
examine the state of the HyperMetro pair to determine whether operations such as
synchronization, suspension, or priority switchover are required by its LUNs and whether
such an operation is performed successfully.
 HyperMetro Consistency Group (CG)
A HyperMetro consistency group (CG) is created based on a protection group. It is a
collection of HyperMetro pairs that have a service relationship with each other. For
example, the service data, logs, and change tracking information of a medium- or large-
size database are stored on different LUNs of a storage system. Placing these LUNs in a
protection group and then creating a HyperMetro consistency group for that protection
group can preserve the integrity of their data and guarantee write-order fidelity.
 Creating a HyperMetro pair for an individual LUN

 Creating a HyperMetro consistency group for a protection group (formed by multiple


independent LUNs)
Page 111

 Creating a HyperMetro consistency group for a protection group (formed by a LUN


group)

 Dual-Write
Dual-write enables the synchronization of application I/O requests with both local and
remote LUNs.
 DCL
Data change logs (DCLs) record changes to the data in the storage systems.
 Synchronization
HyperMetro synchronizes differential data between the local and remote LUNs in a
HyperMetro pair. You can also synchronize data among multiple HyperMetro pairs in a
consistency group.
 Pause
Pause is a state indicating the suspension of a HyperMetro pair.
 Force Start
To ensure data consistency in the event that multiple elements in the HyperMetro
deployment malfunction simultaneously, HyperMetro stops hosts from accessing both
Page 112

storage systems. You can forcibly start the local or remote storage system (depending on
which one is normal) to restore services quickly.
 Preferred Site Switchover
Preferred site switchover indicates that during arbitration, precedence is given to the
storage system which has been set as the preferred site (by default, this is the local
storage system). If the HyperMetro replication network is down, the storage system that
wins arbitration continues providing services to hosts.
 FastWrite
FastWrite uses the First Burst Enabled function of the SCSI protocol to optimize data
transmission between storage devices, reducing the number of interactions in a data
write process by half.

3.3.2.2 HyperMetro I/O Processing Mechanism


3.3.2.2.1 Write I/O Process
Dual-write and locking mechanisms are essential for data consistency between storage
systems.
1) Dual-write and DCL technologies synchronize data changes while services are running.
Dual-write enables hosts' I/O requests to be delivered to both local and remote caches,
ensuring data consistency between the caches. If the storage system in one DC
malfunctions, the DCL records data changes. After the storage system recovers, the
data changes are synchronized to the storage system, ensuring data consistency across
DCs.
2) Two HyperMetro storage systems can process hosts' I/O requests concurrently. To
prevent conflicts when different hosts access the same data on a storage system
simultaneously, a locking mechanism is used to allow only one storage system to write
data. The storage system denied by the locking mechanism must wait until the lock is
released and then obtain the write permission.
The following figure shows an example of the write I/O process in which a host delivers
an I/O request to the local storage system and dual-write is used to write the data to the
remote storage system.
Page 113

1) A host delivers a write I/O to the HyperMetro I/O processing module.


2) The write I/O applies for write permission from the optimistic lock on the local storage
system. After write permission is obtained, the system records the address information
in the log but does not record the data content.
3) The HyperMetro I/O processing module writes the data to the caches of both the local
and remote LUNs concurrently. When data is written to the remote storage system, the
write I/O applies for write permission from the optimistic lock before the data can be
written to the cache.
4) The local and remote caches return the write result to the HyperMetro I/O processing
module.
5) The system determines whether dual-write is successful:
a) If writing to both caches is successful, the log is deleted.
b) If writing to either cache fails, the system:
i. Converts the log into a DCL that records the differential data between the
local and remote LUNs. After conversion, the original log is deleted.
ii. Suspends the HyperMetro pair. The status of the HyperMetro pair becomes
To be synchronized. I/Os are only written to the storage system on which
writing to its cache succeeded. The storage system on which writing to its
cache failed stops providing services for the host.
6) The HyperMetro I/O processing module returns the write result to the host.
NOTE: In the background, the storage systems use the DCL to synchronize data between
them. Once the data on the local and remote LUNs is identical, HyperMetro services are
restored.
Page 114

3.3.2.2.2 Read I/O Process


The data of LUNs on both storage systems is synchronized in real time. Both storage
systems are accessible to hosts. If one storage system malfunctions, the other one
continues providing services for hosts.
The following figure shows an example of the read I/O process.

1) A host delivers a read I/O to the HyperMetro I/O processing module.


2) The HyperMetro I/O processing module enables the local storage system to respond to
the read request of the host.
3) If the local storage system is operating properly, it returns data to the HyperMetro I/O
processing module.
4) If the local storage system is not operating properly, the HyperMetro I/O processing
module enables the host to read data from the remote storage system. Then the
remote storage system returns data to the HyperMetro I/O processing module.
5) The HyperMetro I/O processing module returns the requested data to the host.

3.3.2.3 Arbitration Mechanism


3.3.2.3.1 Static priority mode
If no quorum server is configured or the quorum server is inaccessible, HyperMetro works
in static priority mode. When an arbitration occurs, the preferred site wins the arbitration
and provides services.
1) If links between the two storage systems are down or the non-preferred site of a
HyperMetro pair breaks down, LUNs of the storage system at the preferred site
continue providing HyperMetro services and LUNs of the storage system at the non-
preferred site stop.
2) If the preferred site of a HyperMetro pair breaks down, the non-preferred site does not
take over HyperMetro services automatically. As a result, the services stop. You must
forcibly start the services at the non-preferred site.
Page 115

3.3.2.3.2 Quorum server mode (recommended)


In this mode, an independent physical server or VM is used as the quorum server. You are
advised to deploy the quorum server at a dedicated quorum site that is in a different
fault domain from the two DCs.
In the event of a DC failure or disconnection between the storage systems, each storage
system sends an arbitration request to the quorum server, and only the winner continues
providing services. The preferred site takes precedence in arbitration.

3.3.3 Planning
3.3.3.1 Information Collection
Collecting information about the upper-layer application environment helps determine
whether applications run as expected on the active-active platform, and identify and
optimize application deployment as early as possible.
The future data growth trend can be predicted based on current data amount and
historical data growth records.

3.3.3.2 DR Solution Selection and Compatibility Confirmation


When selecting an active-active solution selection, determine whether it is necessary to
use the active-active data center solution through the active-active feasibility analysis.
Compatibility confirmation checks the compatibility between hardware components (such
as computing, network, and storage devices), software components (such as active-active
DR and storage management software), and features.

3.3.3.3 Bandwidth and Capacity Calculation


3.3.3.3.1 Capacity Calculation
1. Determine the services for which HyperMetro should be deployed and check the
current storage space configuration.
23. Obtain the daily capacity change and the required additional capacity based on the
used capacity and current production services.
24. Predict the capacity growth in the next three to five years based on the business
development requirements.
25. Estimate the active-active bandwidth based on the current data amount and future
business development trends.
3.3.3.3.2 Bandwidth Calculation
I/Os are synchronized to storage arrays in both data centers at any time. As a result, the
peak active-active bandwidth must be sufficiently calculated and must not become a
bottleneck.

3.3.3.4 Networking Design


3.3.3.4.1 Networks Between Hosts and Storage Arrays
Each host must be connected to each storage array using two switches, that is, two
independent networks are deployed between each host and each storage array. This
ensures that in the event of a network fault, services remain running properly.
Page 116

Logical links between each host and the two storage arrays are connected. In the active-
active data center solution, although the host in data center A is only connected to the
switches in data center A using physical links and is connected to the switches in data
center B without using physical links, ensure that storage arrays in data center A and
data center B are connected using logical links when planning zones or VLANs for
switches.
The same type of networks (Fibre Channel or IP network) must be deployed between
each host and the two storage arrays.
The Fibre Channel network is recommended for networks between hosts and storage
arrays because the Fibre Channel protocol provides better performance than the iSCSI
protocol.
3.3.3.4.2 HyperMetro Replication Networks
The HyperMetro replication networks between storage arrays must use two switches,
that is, two independent networks are deployed between each host and each storage
array. In this way, if either network experiences a fault, the other network runs properly.
Each controller on each storage array must have two ports configured to connect
HyperMetro replication networks for link redundancy and load balancing. For easy
network management, controller A on storage array A is connected only to controller A
on storage array B.
For convenient network management, it is recommended that the HyperMetro
replication networks and the networks between hosts and storage arrays be the same
type of networks. For example, if Fibre Channel networks are deployed between hosts
and storage arrays, the HyperMetro replication networks should also be Fibre Channel
networks.
The Fibre Channel network is recommended for the HyperMetro replication networks as
the Fibre Channel protocol provides better performance than the iSCSI protocol.
A write success is returned to the host only when I/Os are successfully written to both
storage arrays through the HyperMetro replication links. To ensure service performance,
the solution requires a low latency RTT on HyperMetro replication networks under 1 ms.
3.3.3.4.3 Fibre Channel Switches
If a large number of devices are connected to switches or the service planning is complex,
hundreds of or even thousands of small zones may be required. In this case, you are
advised to configure zones based on the single HBA principle. That is, if a zone has only
one HBA or initiator, multiple targets are allowed in the zone.
3.3.3.4.4 Quorum Network Planning
Storage array access: Each controller on each storage array provides one GE or 10GE port
dual-homed to the quorum network VLANs of two core switches. The quorum network IP
addresses of the two storage arrays in data centers A and B must be set in different
network segments. In this way, data center A and data center B can advertise different
routes in the WAN to isolate the path between the quorum server and data center A
from that between the quorum server and data center B.
Quorum server access: The quorum server uses the TCP/IP protocol to communicate with
the storage arrays in the two data centers and is dual-homed to the quorum VLANs of
Page 117

the two Ethernet switches using two GE/10GE links. In this way, the quorum networks are
logically isolated from other networks.
Quorum network between data centers: The storage arrays in the two data centers
periodically access the third-place quorum site over the quorum networks, but there is no
quorum network traffic between the two data centers. Therefore, no route needs to be
advertised for the quorum network between the two data centers. Quorum networks
require high reliability, and therefore, it is recommended that different paths or private
links be used between the third-place quorum site and the two data centers. When the
path or link to one data center is disconnected, the path or link to the other data center
must be normal. Quorum networks do not require Layer 2 interconnection, but the
quorum server and storage arrays must properly communicate with each other via Layer
3 routing. It is recommended that the latency from the quorum server to the controllers
on the storage arrays in the two data centers be less than RTT 50 ms and the bandwidth
be greater than 2 Mbit/s.
3.3.3.4.5 WDM Network Planning
Huawei WDM products use advanced low-latency processing technologies to optimize
latency and are ideal for Data Center Interconnect (DCI) scenarios requiring low latency.
Therefore, it is recommended that Huawei WDM products be used to construct intra-city
transmission networks. The following describes the recommended WDM network
deployment modes:
1. In large-scale networking scenarios, separate WDM channels should carry various
networks. At least the SAN, public, and private networks must be physically
separated. The public network may share a WDM channel with other service traffic
(such as traffic between servers) depending on the bandwidth utilization.
26. WDM devices provide three redundancy modes in line-side 1 + 1 protection, intra-
board 1 + 1 protection, and client-side protection, with the latter providing the
highest reliability. Reliability in an ascending order is as follows: line-side 1 + 1
protection < intra-board 1 + 1 protection < client-side protection.
27. The minimum adaptive bandwidth of the optical module on a Fibre Channel switch is
2 Gbit/s. Therefore, you are advised to configure optical modules with a bandwidth
greater than or equal to 2 Gbit/s on WDM devices for interconnection with Fibre
Channel switches.
3.3.3.4.6 Overall Network Planning
Global Server Load Balance (GSLB): GSLB distributes traffic amongst servers dispersed
across multiple geographies in the WAN (including the Internet) so that the best
available server serves the closest user, ensuring quick and reliable access.
Server Load Balance (SLB): SLB distributes traffic amongst servers in the same geography.
The GSLB and SLB are two functions provided by F5 BIG-IP Global Traffic Manager
(GTM) and Local Traffic Manager (LTM).
The GSLB function resolves the Domain Name System (DNS) and notifies the user of the
SLB of the data center to which the domain name should be sent. The SLB in each data
center is deployed in an HA cluster in AP mode. Only one F5 LTM provisions services at a
time. The user accesses the virtual server IP address of the SLB. In the SLB, the virtual
server IP address is mapped to several pools. A pool is a logical unit and contains
multiple web servers.
Page 118

The GSLB and SLB perform comprehensive health checks to identify the health state of
the target server to which data is to be sent. If the target server is faulty, data will be
sent to a server in another resource pool instead.
The SAN network planning involves planning the networks between database servers and
storage arrays, HyperMetro replication networks between storage arrays, and Fibre
Channel switches.
The IP network planning involves planning the GSLB access networks, SLB access
networks, web/app access networks, database public/private networks, storage quorum
networks, access switches, and core switches.
The intra-city transmission networks connect the SAN networks and IP networks between
two data centers.
3.3.3.4.7 Overall Port and IP Address Planning
The teaching material only illustrates the networking mode for data center A. In practice,
the networking mode of data center A also applies to data center B.
 IP address planning rules
Consecutiveness: Consecutive IP addresses must be allocated to the same network
area.
Scalability: The IP addresses allocated to a network area must have certain
redundancy so that IP addresses can remain consecutive when more devices are
added.
Security: A network should be divided into different network segments (subnets) for
easy management.
 Device naming rules
Storage array naming rule: DC ID_array model_SN, for example, DC1_5500_1.
Server naming rule: DC ID_service type_SN, for example, DC1_OrcleDB_1.
Switch naming rule: DC ID_switch model_SN, for example, DC1_2248_1.
 Precautions
Configure the IP addresses of iSCSI host ports and service network ports on
application servers in the same network segment. Do not connect host ports and
service network ports via routing gateways.
Do not plan the IP addresses of service ports (such as HyperMetro, mirroring, and
network service ports) and management ports in the same network segment.
Do not overlap IP addresses with the IP addresses of heartbeat network ports and the
network segment allocated for connecting switches to controllers. Otherwise, routing
will fail.
3.3.3.4.8 Function Design
The function design involves service-related planning and design.
Page 119

3.4 Application Practices of the Storage Data Migration


Solution
3.4.1 Overview
3.4.1.1 Application Scenarios
3.4.1.1.1 Applicable Requirements
1. The existing data in the IT system needs to be transferred to the newly purchased
equipment.
28. Rapid service expansion and storage of massive volumes of data place higher
requirements on devices.
29. To meet load balancing and performance optimization requirements or service
requirements, data can be migrated between different storage media and between
different RAID levels.
30. Online service data migration is required to avoid losses caused by service
interruption during the migration.
31. During service data migration, data changes generated on the host must be
synchronized in a timely manner to the two LUNs between which the data migration
relationship is established in order to ensure data consistency and prevent data loss.
3.4.1.1.2 Typical Scenarios
1. Upgrading storage systems: As the service data volume increases, higher capacity
and performance requirements are generated. As a result, new storage systems need
to be purchased for upgrade. In this scenario, the data migration solution can
completely and reliably migrate service data from the original storage system to the
new storage system.
32. Configuring the reuse of storage resources: After the storage system is upgraded,
cold data that is seldom used remains in the system. If a large amount of such data
is stored in the new storage system, the storage resource utilization will be reduced,
resulting in a waste of storage space. To reduce operation costs and improve
resource utilization, the data migration solution can migrate cold data back to the
original storage system.
Note: In the migration solution based on SmartMigration and SmartVirtualization,
SmartVirtualization takes over external LUNs in offline or online mode. If the online
mode is used, host services are not interrupted. If offline takeover is used, host services
may be suspended for a short period of time.

3.4.1.2 Highlights
1. Reliable business continuity
33. Stable data consistency
34. Adjustable performance
35. Heterogeneous compatibility
Page 120

Note: Huawei provides all-around data migration solutions at the host layer, SAN
network layer, and storage layer, and provides mature tools to help customers solve data
migration challenges. This process is implemented using rigorous technologies, rich
experience, and developed tools.

3.4.2 Technologies
3.4.2.1 SmartVirtualization
3.4.2.1.1 Basic Concepts
 External LUN
External LUN is a LUN in a heterogeneous storage system, which is displayed as an
external LUN on the DeviceManager.
 eDevLUN
In the storage pool of a local storage system, the mapped external LUNs are reorganized
as raw storage devices based on a certain data organization form. A raw device is called
an eDevLUN. The physical space occupied by an eDevLUN in the local storage system is
merely the storage space needed by the metadata. The service data is still stored on the
heterogeneous storage system. Application servers can use eDevLUNs to access data on
external LUNs in the heterogeneous storage system, and the SmartMigration feature can
be configured for the eDevLUNs.
 Hosting
LUNs in a heterogeneous storage system are mapped to a local storage system for use
and management.
3.4.2.1.2 Relationship Between an eDevLUN and an External LUN
An eDevLUN consists of data and metadata. A mapping relationship is established
between data and metadata.
1) The physical space needed by data is provided by the external LUN from the
heterogeneous storage system. Data does not occupy the capacity of the local storage
system.
2) Metadata is used to manage storage locations of data on an eDevLUN. The space used
to store metadata comes from the metadata space in the storage pool created in the
local storage system. Metadata occupies merely a small amount of space. Therefore,
eDevLUNs occupy a small amount of space in the local storage system. (If no value-
added feature is configured for eDevLUNs, each eDevLUN occupies only dozens of KBs
in the storage pool created in the local storage system.)
The following figure illustrates the relationship between an eDevLUN created in the local
storage system and an external LUN created in the heterogeneous storage system. An
application server accesses an external LUN by reading data from and writing data to the
corresponding eDevLUN.
Page 121

3.4.2.1.3 Online or Offline Takeover


 Online takeover
During the online takeover process, services are not interrupted, ensuring service
continuity and data integrity. In this mode, the critical identity information about
heterogeneous LUNs is masqueraded so that multipathing software can automatically
identify new storage systems and switch I/Os to the new storage systems. This
remarkably simplifies data migration and minimizes time consumption.
When a Huawei heterogeneous storage system is taken over in online mode, the
masquerading property for eDevLUNs is Basic masquerading or Extended
masquerading. The selection of basic masquerading or extended masquerading depends
on the vendor and version of the multipathing software and the versions of Huawei
heterogeneous storage systems.
 Offline takeover
During the offline takeover process, connections between heterogeneous storage systems
and application servers are down and services are interrupted temporarily. This mode is
applicable to all compatible Huawei and third-party heterogeneous storage systems.
The offline takeover mode is applicable to all compatible Huawei and third-party
heterogeneous storage systems. In this mode, services running on the related application
servers are stopped temporarily and the masquerading property for eDevLUNs is No
masquerading.
 Selecting a takeover mode
Page 122

3.4.2.2 SmartMigration
3.4.2.2.1 Basic Concepts
 Data organization
The storage system uses a virtualization storage technology. Virtual data in the storage
pool consists of meta volumes and data volumes.
1) Meta volume: records the data storage locations, including LUN IDs and data volume
IDs. LUN IDs are used to identify LUNs and data volume IDs are used to identify physical
space of data volumes.
2) Data volume: stores user data.
 Source LUN
LUN from which service data is migrated.
 Target LUN
LUN to which service data is migrated.
 LM module
Manages SmartMigration in the storage system.
 Pair
In SmartMigration, a pair indicates the data migration relationship between the source
LUN and target LUN. A pair can have only one source LUN and one target LUN.
 Dual-write
The process of writing data to the source and target LUNs at the same time during
service data migration.
 Log
Page 123

Records data changes on the source LUN to determine whether the data is written to the
target LUN at the same time. Both systems can be written simultaneously using the dual-
write technology.
 Data change log (DCL)
Records differential data that fails to be written to the target LUN during the data
change synchronization.
 Splitting
The process of stopping service data synchronization between the source LUN and target
LUN, exchanging LUN information, and then removing the data migration relationship
between the source LUN and target LUN.
3.4.2.2.2 Service Data Synchronization
 Initial synchronization
After service data synchronization starts on the source LUN, all initial service data is
copied to the target LUN, as shown in the following figure.

 Data change synchronization


During the synchronization, host services do not need to be interrupted. When data is
changed on the host, the host sends an I/O write request to the storage system. Then,
the storage system starts data change synchronization and writes the changed service
data to both the source LUN and target LUN using the dual-write technology.
If data fails to be written to the target LUN, the storage system records the data
differences in the DCL and copies the data that fails to be written from the source LUN
to the target LUN according to the DCL. After the copy is complete, the storage system
returns a write success acknowledgment to the host.
If data fails to be written to the source LUN, the storage system returns a write failure.
Upon receiving the write failure, the host re-sends the data to the source LUN only, but
not to the target LUN. This mechanism ensures data consistency on the source and target
LUNs during migration.
The following figure shows how changed data is synchronized.
Page 124

Process of synchronizing changed data during service migration:


1) The host delivers an I/O write request to the LM module of the storage system.
2) The LM module writes the data to the source LUN and target LUN and records this
write operation to the log.
3) The source LUN and target LUN return the data write result to the LM module.
4) The LM module determines the result:
a) If the data fails to be written to the target LUN, the log is saved to the DCL that
records the data changes. The storage system copies the changed data from the
source LUN to the target LUN according to the DCL and then clears the DCL
records.
b) If the data fails to be written to the source LUN, a write I/O failure is returned to
the host. Then, the host re-sends the data only to the source LUN. After the data
is successfully written, log records will be cleared.
c) If the data is successfully written to the source LUN and target LUN using dual-
write technology, log records will be cleared automatically.
5) A write success acknowledgment is returned to the host.
3.4.2.2.3 Splitting
Splitting is performed on a single pair. The splitting process is as follows: Service data
synchronization between the source LUN and the target LUN in a pair is stopped. Then,
the two LUNs exchange LUN information. After that, the data migration relationship is
canceled. During the split, host services are suspended. After the information exchange,
services are delivered to the target LUN. The process is invisible to users. The following
figure illustrates the principle of splitting.

 illustrates the principle of splitting.


LUN information exchange is the prerequisite for the target LUN to take over services
from the source LUN after service information is synchronized.
1) Before LUN information is exchanged, the host uses the source LUN ID to identify the
source LUN and uses the source data volume ID to identify the source LUN's physical
space. A mapping relationship is established between the source LUN ID and the source
data volume ID so that the host can read the physical space of the source LUN. The
mapping relationship also exists between the target LUN ID and target data volume
ID.
2) During LUN information exchange, the source and target data volume IDs are
exchanged while the source and target LUN IDs remain unchanged. In this way, the
source LUN ID points to the physical space identified by the target data volume ID.
Page 125

3) After LUN information exchange, the host can still identify the source LUN using the
source LUN ID but read the target LUN's physical space. In this way, services are
migrated without awareness of users.
The following figure illustrates the principle of LUN information exchange.

 Pair splitting
Pair splitting means that the data migration relationship between the source LUN and
target LUN is removed after LUN information is exchanged. After the pair is split, if the
host delivers an I/O request to the storage system, data is only written to the source LUN
(the physical space to which the source LUN ID points is the target data volume). The
target LUN will store all data of the source LUN at the pair splitting point in time. After
the pair splitting, no connections can be established between the source LUN and target
LUN.
Page 126

Consistent splitting enables multiple pairs to exchange LUN information simultaneously


and splits all pairs after the information exchange is complete, ensuring data consistency
at any point in time before and after the pairs are split.
In scenarios where multiple pairs are used, such as in medium- and large-size database
applications, data, logs, records, and other files are stored on LUNs that are associated
with one another in the storage system. Splitting cannot ensure that information in one
LUN is always associated with that in another. If data in a LUN is unavailable, data in the
other LUNs may become invalid. Consistent splitting can resolve this problem.
The following figure illustrates the differences in processes and results between splitting
and consistent splitting in a scenario where multiple pairs are used.
Page 127

3.4.3 Planning
3.4.3.1 Huawei Migration Solutions
Currently, Huawei supports the following migration solutions:
1. Migration based on application software functions: This type of migration uses
functions of upper-layer applications to migrate data, such as migration functions of
Oracle databases and replication functions of file systems. This type of migration is
smooth and has less downtime. However, it has strict requirements on the
application scenarios and is not universally applicable.
36. Migration based on volume management software functions: This type of migration
is implemented using the mirroring or migration function of the volume
management software. For example, AIX, HP-UX LVM, and VxVM have the mirroring
function. Data can be migrated by mirroring and then splitting the mirroring. This
method is smooth, and downtime is minimal. However, customers must use the
operating system and volume management software that support this feature to
implement this function.
37. Migration based on network functions including VIS, heterogeneous virtualization,
and MigrationDirector for SAN: VIS is a Huawei-developed product. Heterogeneous
virtualization uses this function of storage devices to migrate data. MigrationDirector
for SAN is a non-disruptive migration tool developed by Huawei; however, the
supported source storage and target storage are limited.
38. Migration based on storage functions including internal LUN migration and remote
replication: This type of migration occupies few host resources. However, online
migration can be used only between homogeneous storage systems. Otherwise,
offline migration must be performed after the system is shut down.
3.4.3.1.1 Migration Based on Application Software Functions
 VM software functions
Common VMs, such as VMware and Hyper-V, provide basic online storage
replacement and migration technologies. For example, the storage vMotion function
of VMware and the live migration function of Hyper-V can implement online storage
replacement, which requires simple operations and achieves non-disruptive
replacement and migration. However, this solution is not applicable to scenarios such
as RDM and VDI. Therefore, you need to check whether the requirements in the
application scenario are met before starting a VM migration. If yes, the preferred
solution is the storage replacement and migration function provided by the VM
software. The current version mainly provides the migration solution using the VM
built-in function.
 Database software functions
The database migration solution is mainly used for database migration. In addition,
databases can also be migrated at the external storage or network layer. However, in
certain scenarios, data must be migrated at the database layer to achieve a better
effect because device types, data changes, and shortened downtime are involved.
Such scenarios include cross-platform database migration, cross-media migration,
and those that have high requirements regarding downtime.
Page 128

3.4.3.1.2 Migration Based on Volume Management Software Functions


If you need to replace the host multipathing software, stop the customer's services. Then,
uninstall the multipathing software corresponding to the source storage from the server,
install the multipathing software corresponding to the target storage, and restart the
host for the new software to take effect. After it is installed, start the customer service
system.

3.4.3.2 Policy for Designing a Data Migration Solution


When designing a solution, recommend a delivery solution with a longer downtime that's
acceptable for the customer to reduce delivery complexity.
In VM scenarios, if online migration conditions for using VM functions are met, use the
VM online migration method.
In UNIX midrange computer scenarios, if only a short period of service downtime is
allowed, recommend the host volume mirroring-based migration solution.
Generally, if compatibility requirements are met, the heterogeneous storage virtualization
solution is preferred in order to reduce the downtime and operation steps. If the VIS has
been deployed on the live network, the VIS is preferred.
Migration without service interruption places high requirements on solution accuracy and
operation personnel. Therefore, you are advised to exercise caution and confirm with the
HQ and R&D personnel before selecting this solution.
When a data migration project applies multiple solutions, you can use a common
solution instead.
If the migration project is delivered to the HQ or R&D department, it indicates that the
solution is complicated and requires comprehensive analysis. In this case, contact the HQ
or R&D department for a final solution.

3.4.3.3 Risk Assessment


1. Compatibility assessment
39. Service interruption duration assessment: Make sufficient preparations before data
migration, specify the data migration responsibility matrix, and ensure that all
personnel and resources are in place. The service interruption period depends on the
customer's cooperation conditions, preparations before migration, and proficiency of
the operation personnel. The time required for removing mappings from the source
storage depends on the operation complexity of the source storage. The
multipathing software configuration includes uninstalling the multipathing software
on the source storage array and installing it on the new storage array. In some
scenarios, the multipathing software configuration can be performed together with
the source storage array takeover. If the multipathing software configuration is not
required, ignore the operations related to it. The time required for taking over the
source storage (excluding multipathing configurations) is five multiplied by the
number of LUNs (unit: minute). That is, it takes about one to five minutes for each
LUN. In addition, mapping eDevLUNs and restarting services may take some time.
40. Data security assessment: Data in the source storage is not damaged. After the
migration is complete, a copy of the original data is retained in the source storage
and all service data is delivered to the target storage. You are advised to back up
Page 129

service data before data migration and reserve sufficient backup windows for data
backup. It is recommended that all onsite and remote support personnel be in place
and the specific implementation time be determined before data migration. You are
advised to back up service data before data migration. After selecting a data
migration solution, refer to the data migration specification list to check the
restrictions and requirements.
41. Assessment of the overall operation risks
Page 130

4 Storage Maintenance and


Troubleshooting

4.1 Storage System O&M Management


4.1.1 Overview
Rapid development of technologies combined with explosive growth of information has
meant customers everywhere need faster, better, and more storage devices. However,
this in turn makes storage O&M increasingly difficult.
Advancements in data center services have seen storage device deployment and O&M
managed centrally and over a unified environment. Now, dynamic scaling is supported,
and the scalability changes frequently to suit service demands. However, many
enterprises run numerous devices from multiple vendors, and faulty components are
common. Accordingly, these changes require O&M personnel to master configuration and
management of storage devices, replace faulty components, and adapt to the needs set
out in the Service Level Agreement (SLA).
In addition, projects are expected to be delivered by channel partners in the future, even
despite the limited, inexperienced delivery capability of channel partners. To address this
issue, the delivery process needs to be standardized to improve remote delivery support.
The following describes the two major trends of storage management technologies to
solve common storage O&M challenges:
 Artificial Intelligence for IT Operations (AIOps) helping identify and solve
problems.
AIOps combine big data and machine learning to automate IT operations, including
event correlation, exception detection, and cause-and-effect judgment.
The software enhances the prediction capability and system stability and reduces the
I/O cost, improving the product competitiveness of enterprises.
Back in 2016, Gartner noted the rise of AIOps and predicted that by 2020 half of all
enterprises worldwide would adopt a form of this intelligent software. Huawei is one
of the first vendors in China to participate in formulating the AIOps white paper and
related standards.
 Closed-loop autonomy based on automation
Circulation control theory is an interdisciplinary approach that explores the structure,
constraints, and possibilities of regulatory systems.
Page 131

The fundamental goal of the control theory is to understand and define the functions
and processes of systems. These systems have objectives and participate in the causal
chain of cycles, performing operations such as perception, comparison, and action.

Figure 4-1 Circulation control theory


Control theory focuses on how systems (digit, machinery, or biology) process
information, react to the information, and adapt to changes to complete the tasks.
O&M provides technical assurance for products to deliver quality services. Its
definition varies between companies and business stages.
In essence, O&M refers to the operation and maintenance of networks, servers, and
services in each phase of the life cycle, achieving high stability and efficiency with
efficient cost efficiency.

ITIL: Information Technology Infrastructure


PRR: Problem Reporting and Resolution

4.1.2 Tools
Huawei provides premium storage O&M tools that are tailored to common demanding
scenarios.
Users can query, set, manage, and maintain the storage system on DeviceManager and
the command-line interface (CLI). Serviceability management tools, such as SmartKit and
eService, help improve O&M efficiency. The following is a list of popular premium
Huawei tools:
DeviceManager: single-device O&M software. DeviceManager is an integrated storage
management platform designed for all Huawei storage systems, enabling you to
configure, manage, and maintain storage devices with ease.
SmartKit: a professional-grade tool for Huawei technical support. SmartKit includes
compatibility evaluation, planning and design, one-click fault information collection,
inspection, upgrade, and field replaceable unit (FRU) replacement.
eSight: multi-device maintenance suite that provides fault monitoring and visualized
O&M
Page 132

DME: unified management platform for storage resources, offering service catalog
orchestration, on-demand supply of storage services, and data application services
eService client: deployed in a customer's equipment room. It discovers exceptions of
storage devices in real time and reports them to the Huawei maintenance center.
eService cloud platform: deployed on the Huawei maintenance center. It monitors devices
in the network in real time, offering proactive maintenance.
With artificial intelligence (AI), storage management will develop towards the following
trends to improve management and O&M efficiency:
1. Single- to multi-device management
42. Single dimension to full lifecycle management
43. Manual management to automation
It is under this condition that Huawei developed the DME Storage platform, a full-
lifecycle automatic management platform.
DME Storage adopts the service-oriented architecture design that runs on taking
automation, AI analysis, and policy supervision. It integrates all phases of the storage
lifecycle, covering planning, construction, O&M, and optimization, to implement
integrated, full-lifecycle storage management and control.
The platform is designed to simplify storage management and improve data center
operation efficiency by allowing you to manage your resources over a unified
management environment. It provides open APIs, cloud-based AI enablement, and multi-
dimensional intelligent risk prediction and intelligent tuning, allowing you to integrate
multiple types of storage resources on demand without changing the existing data
channels or adding new storage functions to a single storage array. In addition, users can
connect hosts and Fibre Channel (FC) switches to the system for end-to-end O&M and
management. This solution supports full lifecycle management of storage devices,
including planning, construction, maintenance, and optimization. The continuous
application of AI technologies automates storage management.
Key characteristics of the DME:
It is a distributed microservice architecture with high reliability and 99.9% availability.
It manages large-scale storage devices. A single node can manage 16 storage devices,
each with 1500 volumes.
Northbound and southbound systems are fully open. Northbound systems provide
RESTful, Simple Network Management Protocol (SNMP), and ecosystem plug-ins (such
as Ansible) to interconnect with upper-layer systems, whereas southbound systems are
associated with the OpenStack storage. Standard interface protocols, such as SNMP and
RESTful, are used to interconnect with storage devices.
DME implements proactive O&M based on the AI and policy engine. In the maintenance
phase, it analyzes the following problems automatically:
Based on preset policies and AI algorithm models, the system automatically detects
potential problems throughout the lifecycle from multiple dimensions, including capacity,
performance, configuration, availability, and optimization.
It allows users to customize check policies, such as capacity thresholds.
For planning, it covers infrastructure management and service level management.
Page 133

For construction, it provides automatic allocation of storage resources and FC switch


resources.
During O&M, the system offers problem identification and analysis.
Optimization involves the multi-dimensional reports, dashboard customization capability,
and policy-based problem self-healing.
Another premium software from Huawei is SmartKit, which is an on-site tool kit for IT
products that supports on-site maintenance.
The common types of tools are as follows:
 Routine maintenance ensures long-term, efficient, and stable running of devices.
These tools provide real-time inspection of deployment, renewal, compass disks, and
precaution.
 Information collection tools collect device information, such as system logs,
configuration data, alarms, and disk logs, in the event of a fault.
 Disk health analysis tools analyze disk log packages, identify risky disks, and prompt
you to replace the risky disks.
 Device archive collection tools help you collect device configuration and deployment
information, and generate device archive files in IBMS format.
 Upgrade evaluation tools check the real-time running environment of a device, and
generate an upgrade evaluation report to evaluate whether the current device meets
the upgrade requirements.
 Upgrade tools integrate pre-upgrade check, package import, configuration data
backup, upgrade, and verification operations, to implement one-click device upgrade.
 Capacity expansion helps quickly add controllers or engines to improve system
performance.
eService is an intelligent storage device O&M product developed by Huawei. It uses big
data analysis and AI technologies to provide services such as automatic fault reporting,
capacity performance prediction, and disk risk prediction, preventing potential risks and
providing a basis for capacity planning. It aims to reduce O&M costs and improve O&M
efficiency.
Page 134

Figure 4-2 eService design ideas


 eService cloud system: provides centralized O&M capability expansion for IT
storage, cloud management, intelligent O&M, remote maintenance, and prediction
and prevention.
 eService client: uploads alarms, performance data, configurations, system logs, and
disk information to the eService cloud. The eService client software must be installed
on physical machines, virtual machines (VMs), or service processors (SVPs).
 DeviceManager: After the Call Home service is enabled on DeviceManager, alarms,
performance, configuration, system logs, and disk information can be uploaded to
the eService cloud. No additional physical machines or VMs are required for
installing the eService client.
 eSight: allows you to upload alarms to the eService cloud by email. The eSight
software needs to be installed on physical machines or VMs.
With traditional service support, technical support personnel provide local services only.
Faults may not be detected in a timely manner and information may not be delivered
correctly.
eService proactive troubleshooting function, also called Call Home, provides alarming,
configuration, performance data, and the system log reporting. In the event of a device
fault, all fault information will be uploaded to the eService cloud system, shortening the
fault detection and handling time.
Call Home provides the following functions:
 Proactive 24/7 alarm monitoring (storage, server, and cloud computing). When
an alarm is generated, the system automatically notifies the Huawei technical
support center, creates a trouble ticket in minutes, and dispatches the trouble ticket
to the corresponding engineer for handling. This helps customers locate and resolve
problems as and when needed.
Page 135

 AI technology analysis (storage). Device alarms, configuration, disks, performance


data, and system logs are encrypted and periodically uploaded to the eService cloud
system through Internet. AI technologies are used to analyze potential risks of
devices and quickly find solutions. If the problem is not matched, the system
immediately sends the fault information to Huawei technical support and provides
possible risks for them to quickly locate and rectify the fault. In addition, an
algorithm is used to automatically extract key characteristics. After manual
confirmation, the key characteristics are added to the fault sample model library as
new data for diagnosis reference for future problems.
 Visualized troubleshooting (storage and servers). Alarms are associated with
trouble tickets on eService. You can view the trouble ticket processing progress on
eService.
eService adopts big data analysis and AI-based management to identify faults in advance,
reducing the workloads for on site personnel. In addition, it accurately predicts the future
situation, makes plans, and implements intelligent prediction and analysis. Highlights of
eService Call Home are as follows:
 Disk fault prediction. AI technologies are used to dynamically analyze data indicator
changes of disks and analyze disk performance load characteristics. This helps predict
disk faults in advance, implementing proactive fault prevention and significantly
improving system reliability.
 Capacity trend prediction. eService collects historical performance and capacity
data of storage systems on the live network, uses machine learning for training, and
selects the most suitable prediction model for future capacity needs, perfectly
adapting to service requirements on storage systems.
 Performance fluctuation analysis. eService identifies the peak and off-peak hours
of the service by analyzing the historical performance fluctuations to help formulate
a reliable change plan.
 Performance exception detection. eService learns service characteristic changes
based on historical service performance, identifies performance exceptions, and
quickly locates performance problems.
 Performance bottleneck analysis. eService evaluates the load of key components in
the I/O path, extracts the device load, identifies performance bottlenecks, and
provides optimization suggestions.
 Device health evaluation. eService evaluates device health in real time from the
aspects of system, hardware, configuration, capacity, and performance, identifies
device risks in advance, and accurately scores device health status.
DeviceManager is storage management software designed by Huawei for a single
storage system. It can help you easily configure, manage, and maintain storage devices.
Before using DeviceManager, ensure that the maintenance terminal meets the following
requirements:
 Browser compatibility with the maintenance terminal OS: DeviceManager supports
multiple OSs and browsers (for details about the compatibility, see Huawei Storage
Interoperability Navigator).
 The maintenance terminal communicates with the storage system.
Page 136

 A super administrator can only log in to the storage system as a local user.
 Before logging in to DeviceManager as a Lightweight Directory Access Protocol
(LDAP) domain user, configure the LDAP domain server in the external storage
system, and then configure parameters on the storage system to add the system into
the LDAP domain, and finally create an LDAP domain user.
By default, DeviceManager allows a maximum of 32 logged-in users.
The storage system provides two types of roles: preset roles and custom roles.
 Preset roles in the system have specific permissions, including the super
administrator, administrator, and read-only user.
 Permissions of custom roles can be flexibly configured by users based on site
requirements.
To support permission control in multi-tenant scenarios, the storage system divides the
preset roles into the system and tenant groups. The differences are as follows:
 Tenant group: The roles are used only when a user logs in to DeviceManager with a
tenant account.
 System group: The roles are used only when a user logs in to DeviceManager with a
system account.

4.1.3 Management
4.1.3.1 Routine Management of a Storage System
DeviceManager helps to monitor the performance of a storage system.
Page 137

Figure 4-3 Performance monitoring process on DeviceManager


In the event of a fault on the storage system, you can collect and report basic, fault,
network, application, server, and storage device information to help maintenance
personnel immediately locate and rectify the fault.
Log in to DeviceManager and query the serial number and version of the storage device
in the Basic Information area.
Huawei storage devices support SNMP and SMI-S interfaces that can be used for a third-
party network management system (NMS) to manage Huawei storage devices.
The following five interconnection modes are available:
 SMI-S: After the SMI-S provider is installed on a third-party Windows or Linux server,
users can manage a Huawei storage system using the SMI-S provider.
 SNMP: Users can use a third-party NMS that supports SNMP to view information
about storage devices, including LUNs, ports, and storage pools.
Page 138

 OceanStor VMware vCenter plug-in (vCenter plug-in for short): a piece of


storage management plug-in developed based on vSphere Web Services Software
Development Kit (SDK). You can manage a Huawei storage device through a
vSphere client.
 System Center: Storage Microsoft System Center plug-in is developed by Huawei
based on Microsoft System Center Operations Manager (SCOM). After a Huawei
storage device is imported to SCOM, System Center monitors the Huawei storage
device.
 Representational State Transfer (REST): OceanStor DeviceManager provides
RESTful APIs based on the REST standard. Third-party developers can use RESTful
APIs to access OceanStor DeviceManager open resources, including alarms,
performance data, and resource allocation information.
After a storage system and an application change the interconnection component, the
storage system and the application server must be configured to allow the application
server to use storage space through the new interconnection channels. This section
describes how to change configurations after replacing an HBA.
HBAs can be replaced online or offline. The following figures describe the configuration
process after an HBA is replaced online or offline.

Figure 4-4 Online and offline configuration for replacing an HBA


After the HBA is replaced, if an exception or fault occurs during the configuration and
operation on the host or storage system, you can perform an emergency rollback.
Page 139

4.1.3.2 CLI Management Operations


The CLI provides command lines to manage and maintain the storage system. The
configuration commands are entered using a keyboard and are compiled and executed
by programs. The command output is displayed in text or graphic format on the CLI.
After logging in to the CLI of a storage system, you can query, set, manage, and maintain
the storage system. On any maintenance terminal connected to a storage system, you
can log in to the CLI by using PuTTY to access the IP address of the management
network port on the controller of a storage system through the SSH protocol. After a
successful login, the CLI displays the device information about the storage system,
including the device name, health status, running status, and system capacity.
The SSH protocol supports two authentication modes, which are user name+password
and public key.
You can log in to the storage system by either of the following methods:
 Logging in to the CLI through a serial port
After the controller enclosure is connected to the maintenance terminal using a serial
cable, you can log in to the CLI of the storage system using a terminal program (such
as PuTTY).
 Logging in to the CLI through a management network port
You can log in to the CLI using an IPv4 or IPv6 address.
After the controller enclosure is connected to the maintenance terminal using a
network cable, you can log in to the CLI of the storage system using any type of
remote login software that supports the SSH protocol, for example, PuTTY.
For a 2 U controller enclosure, the default IP addresses of the management network
ports on controller A and controller B are 192.168.128.101 and 192.168.128.102,
respectively. The default subnet mask is 255.255.0.0. For a 3 U or 6 U controller
enclosure, the default IP address of the management network port on management
module 0 is 192.168.128.101, the default IP address of the management network
port on management module 1 is 192.168.128.102, and the default subnet mask is
255.255.0.0.
The IP address of the controller enclosure's management network port must be on
the same network segment as that of the maintenance terminal. Otherwise, you
need to modify the IP address of the management network port through a serial port
by running the change system management_ip command.
CLI commands are classified into the following types:
Configuration and operation commands: These commands are used to create, delete,
and modify configuration data. The configuration data is still stored after the device
is powered off or restarted. These commands are associated with the DIM and
complex classes.
Query commands: These commands are used to query configuration data or running
data in the system. These commands are associated with the configuration, dynamic,
and complex classes.
Page 140

Maintenance commands: These commands are used to perform operations and


report the current operation results. These commands are associated with the
dynamic and complex classes.
Mining Information Modeling (MIM):
 Data Information Model (DIM): defines data storage table structures, table
relationships, and table fields. DIM is actually a table in the database. Data is
stored in a relational database (RDB).
 Data Operation Model (DOM): DOM can be regarded as a middleman of data
and does not store data.
 COMPLEX (Data Dynamic): Both DIM+DOM configuration and maintenance
operations are supported.
 Data Application Model (DAM): data subscription oriented to apps. DAM data
can only be obtained from the DIM table.

4.1.3.3 Storage System Upgrade


Fast upgrade mode (recommended for Dorado V6 series)
Fast upgrade is implemented by replacing processes and without resetting controllers.
During the upgrade, host links are not interrupted and hosts are unaware of the upgrade.
This upgrade method is recommended for OceanStor Dorado V6 series storage systems.
This mode has minimal impact on services. Fast upgrade of controller software will not
restart the controllers, avoiding service switchover between the controllers. However,
specific processes will be reset and restarted. The connection keepalive technique limits
the I/O suspension time within 1s and the read and write performance will restore to
100% within 2s. Even so, you are advised to perform the upgrade during off-peak hours.
Perform the upgrade when the service pressure is light, that is, the CPU pressure of each
controller must be less than 80% and the remaining memory must be greater than 100
MB.
With this mode:
 During the hot upgrade of front-end interface modules, links are not interrupted,
and the I/O blocking duration is less than 1s (in most cases, front-end interface
modules are not upgraded).
 During the hot upgrade of other firmware (back-end interface modules and
enclosure firmware), I/Os are not interrupted.
 The software is quickly upgraded by component. I/O upgrades take under 1s, during
which I/Os are not suspended. The total software upgrade duration takes less than
10s.

Rolling upgrade mode (recommended for OceanStor V5 series)


Rolling upgrade is a highly reliable and available batch controller upgrade mode and
does not interrupt ongoing services. This upgrade mode is perfect for scenarios where it is
essential that services are not interrupted.
When controller software is upgraded in batches, services on the controller to be
upgraded are switched to other normal controllers. Then, the system automatically
Page 141

checks the firmware that needs to be upgraded on the controller and the firmware is
upgraded in sequence. After the upgrade is complete, restart the controller. After the
controller is powered on, services that belong to the controller are switched back to the
controller. Other controllers are upgraded in the same way.
This rolling upgrade method requires you to restart controllers and all services are taken
over by normal controllers. As a result, the read and write performance may decrease by
10% to 20%. Therefore, you are advised to perform the upgrade during off-peak hours.

Parallel upgrade mode (used when no service is configured)


Parallel upgrade is recommended when a storage system is not associated with any host.
In a parallel upgrade, all controllers are upgraded at the same time, which takes only
half of time required by a rolling upgrade.
This mode has a higher impact as host services must be stopped before the parallel
upgrade of controller software.
In terms of the network, during the controller restart, the connection between OceanStor
DeviceManager and the storage system is interrupted for about 15 minutes.
The fast upgrade software quickly starts components. The hot upgrade of firmware
ensures that host links are not interrupted, controllers are not reset, host services are not
affected, and I/Os are not suspended. Therefore, host compatibility does not need to be
evaluated.

4.2 Storage System Troubleshooting


4.2.1 Overview
Storage system faults are classified into minor, major, and critical faults. In terms of their
location, faults can be divided into storage faults and environment faults.
 Storage faults
A storage fault is a storage system fault caused by hardware or software. The fault
information can be obtained using the alarm platform of the storage system.
 Environment faults
An environment fault is a software or hardware fault that occurs when data is
transferred from the host to the storage system over a network. Such faults are
caused by network links. The fault information can be obtained from operating
system logs, application program logs, and switch logs.
Mean time between failures (MTBF) refers to the average time that a new product is
running in specified working environment conditions before it encounters the first
fault. It includes the fault duration and the time for detecting and maintaining the
system.
A longer MTBF correlates to higher system reliability.
Mean time to repair (MTTR) refers to the average time that a device is recovered
from a fault and is the expected value of the recovery time of a fault. It includes the
time required to confirm the occurrence of a failure, the time required for
Page 142

maintenance, the time to obtain the spare parts, the response time of the
maintenance team, the time to record all tasks, and the time to put the device back
into use.
A shorter MTTR correlates to better system recoverability.
Availability (usually represented by A) refers to the capability of a repairable product
to have or maintain its functions at a certain time when the product is used under
specified conditions. The value can be calculated based on MTBF and MTTR: A =
MTBF/(MTBF + MTTR)
The fault diagnosis principles can help you quickly exclude useless information and
quickly diagnose faults.
The principles are as follows:
 Analyze external factors first and then internal factors.
When locating faults, prioritize the external factors (optical fibers, optical cables,
power supplies, and customer devices) before internal factors (disks, controllers, and
interface modules).
 Analyze high-severity alarms first and then low-severity alarms.
The alarm severity sequence from high to low is critical, major, and warning.
 Analyze common alarms first and then uncommon alarms.
When analyzing an alarm, confirm whether it is an uncommon or common alarm
and its impact, and determine whether the fault occurred on one component or
multiple components.
To improve the emergency handling efficiency and reduce losses caused by emergency
faults, emergency handling must comply with the following principles:
 If an emergency fault causes data loss, stop host services or switch services to a
standby host, and back up the data in a timely manner.
 During emergency handling, record all operations performed.
 Emergency handling personnel must participate dedicated training courses and
understand related technologies.
 Recover core services before recovering other services.

4.2.2 Preparations
In the event of a fault, collect and report basic, fault, storage device and system,
application server, and network information to help maintenance personnel quickly
locate and rectify the fault.
The fault information to be collected includes: file system fault information, volume
management fault information, database fault information, storage system fault
information (mandatory), switch information (mandatory), host information, and HBA
information.
Before collecting the fault information, evaluate the impact of the fault on services, back
up data if necessary, and obtain related authorization information.
Page 143

4.2.3 Case Analysis


4.2.3.1 Hardware Module
If there is a fault on a hardware module, the indicators on the hardware module become
abnormal. Common hardware faults include the failures in controllers, interface modules,
disk modules, fan modules, BBUs, power modules, and expansion modules.

Figure 4-5 Interface module troubleshooting flowchart


Page 144

Figure 4-6 BBU troubleshooting flowchart

4.2.3.2 Management Software


Common management software faults refer to failures of managing and maintaining the
storage system.
Common faults include:
 Failed to activate login through a serial port.
 Failed to log in to DeviceManager.
 DeviceManager operation exceptions
Common troubleshooting methods:
 The preceding faults are typically caused by incorrect serial cable connection or serial
port parameter settings. You can reinsert the serial cable or reset serial port
parameters.
 If a browser incompatibility issue occurs, select a different browser or reset the
browser.

4.2.3.3 Basic Storage Service


Alarms are generated when basic storage service faults occur. You can handle the alarms
according to the alarm handling suggestions.
Common faults include:
 Failed to activate login through a serial port.
Page 145

 Failed to add an iSCSI link for a remote device.


 Failed to discover LUNs by an application server.
 Deleting a LUN times out.
 The first interconnection between the storage system and an AIX application server is
abnormal due to a special AIX mechanism.
 The storage system fails to detect initiators provided by HP-UX servers.

4.2.3.4 Storage Service Configuration


Value-added service faults are usually caused by improper manual configuration, which
result in system running errors or performance problems.
Common faults include:
 Basic service configuration errors, such as RAID level planning, hot spare policy
planning, LUN parameter planning, and disk domain planning errors
 Advanced function and feature configuration error, such as remote replication and
SmartPartition configuration errors
 Service network configuration errors, such as host, switch, and service interface
module configuration errors
 Performance optimization errors, such as errors related to Oracle database
performance optimization and VMware VDI performance optimization
 Maintenance errors, such as routine array inspection and array fault diagnosis and
maintenance errors
 Host configuration errors, such as host file system, host operating system
compatibility, and host multipathing configuration errors
Common troubleshooting methods:
Alarms are generated when value-added service faults occur. The system provides
troubleshooting recommendations to help clear the alarms.

4.2.3.5 Multipathing Software


The multipathing software malfunctions, leading to storage performance deterioration.
Common faults include:
 Failed to load the multipathing software in the event that an application server is
restarted.
 Failed to discover multiple paths on a SUSE application server.
 Blue screen during multipathing software installation on a Windows application
server
Common troubleshooting methods:
The typical cause is that the multipathing software is blocked because the server startup
items do not include multipathing software information or the HBA driver has a failover
function. To resolve the problem, unblock the multipathing software.
Check whether the fault is on a link or switch, or if a controller is faulty.

You might also like