0% found this document useful (0 votes)
103 views8 pages

The NVIDIA GPU Query Properties

The document provides information on running Docker containers to query NVIDIA GPU properties and run the AlexNet model. It then describes various NVIDIA GPU properties that can be queried, including manufacturer settings, user settings, setup properties, runtime properties, and properties not supported by the OS or device.

Uploaded by

Antonio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views8 pages

The NVIDIA GPU Query Properties

The document provides information on running Docker containers to query NVIDIA GPU properties and run the AlexNet model. It then describes various NVIDIA GPU properties that can be queried, including manufacturer settings, user settings, setup properties, runtime properties, and properties not supported by the OS or device.

Uploaded by

Antonio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

For Call the Collector Container

docker run --name collector -it -d --runtime=nvidia nvidia/cuda:latest nvidia-smi


--query-gpu=uuid,power.draw,pcie.link.gen.current,pcie.link.width.current,memory.used,
memory.free,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.me
mory --format=csv --loop-ms=2000

For Call the AlexNet Container

docker run -it --rm -v /opt/ml-benchmarks/input/mlfull:/tmp/input -v


/home/wellington/projects/ml-benchmarks/output:/tmp/output --runtime=nvidia
mlbench/apps/alexnet/tensorflow/cuda

The NVIDIA GPU Query Properties

ManufactureSetting Properties

name or gpu_name: The official product name of the GPU. This is an alphanumeric
string. For all products.

serial or gpu_serial: This number matches the serial number physically printed on each
board. It is a globally unique immutable alphanumeric value.

uuid or gpu_uuid: This value is the globally unique immutable alphanumeric identifier of
the GPU. It does not correspond to any physical label on the board.

memory.total: Total installed GPU memory.

clocks.default_applications.graphics or clocks.default_applications.gr: Default


frequency of applications graphics (shader) clock.

clocks.default_applications.memory or clocks.default_applications.mem: Default


frequency of applications memory clock.

clocks.max.graphics or clocks.max.gr: Maximum frequency of graphics (shader) clock.

clocks.max.sm or clocks.max.sm: Maximum frequency of SM (Streaming


Multiprocessor) clock.

clocks.max.memory or clocks.max.mem: Maximum frequency of memory clock.


pcie.link.gen.max: The maximum PCI-E link generation possible with this GPU and
system configuration. For example, if the GPU supports a higher PCIe generation than
the system supports then this reports the system PCIe generation.
pcie.link.width.max: The maximum PCI-E link width possible with this GPU and system
configuration. For example, if the GPU supports a higher PCIe generation than the
system supports then this reports the system PCIe generation.

User Setting Properties

Setup Properties

inforom.oem: Version for the OEM configuration data.

power.default_limit: The default power management algorithm's power ceiling, in watts.


Power Limit will be set back to Default Power Limit after driver unload.

power.min_limit: The minimum value in watts that power limit can be set to.

power.max_limit: The maximum value in watts that power limit can be set to.

display_mode: A flag that indicates whether a physical display (e.g. monitor) is currently
connected to any of the GPU's connectors. "Enabled" indicates an attached display.
"Disabled" indicates otherwise.

pci.bus_id or gpu_bus_id: PCI bus id as "domain:bus:device.function", in hex.

pci.domain: PCI domain number, in hex.

pci.bus: PCI bus number, in hex.

pci.device: PCI device number, in hex.

pci.device_id: PCI vendor device id, in hex.

pci.sub_device_id: PCI SubSystem id, in hex.


driver_version: The version of the installed NVIDIA display driver. This is an
alphanumeric string.

count: The number of NVIDIA GPUs in the system.

index: Zero based index of the GPU. Can change at each boot.

display_active: A flag that indicates whether a display is initialized on the GPU's (e.g.
memory is allocated on the device for display). Display can be active even when no
monitor is physically attached. "Enabled" indicates an active display. "Disabled"
indicates otherwise.

persistence_mode: A flag that indicates whether persistence mode is enabled for the
GPU. Value is either "Enabled" or "Disabled". When persistence mode is enabled the
NVIDIA driver remains loaded even when no active clients, such as X11 or nvidia-smi,
exist. This minimizes the driver load latency associated with running dependent apps,
such as CUDA programs. Linux only.

accounting.mode: A flag that indicates whether accounting mode is enabled for the
GPU. Value is either "Enabled" or "Disabled". When accounting is enabled statistics are
calculated for each computer process running on the GPU. Statistics are available for
query after the process terminates. See --help-query-accounted-apps for more info.

accounting.buffer_size: The size of the circular buffer that holds a list of processes that
can be queried for accounting stats. This is the maximum number of processes that
accounting information will be stored for before information about oldest processes will
get overwritten by information about new processes.

vbios_version: The BIOS of the GPU board.

inforom.img or inforom.image: Global version of the infoROM image. Image version just
like VBIOS version uniquely describes the exact version of the infoROM flashed on the
board in contrast to infoROM object version which is only an indicator of supported
features.

inforom.pwr or inforom.power: Version for the power management data.

gom.current or gpu_operation_mode.current: The GOM currently in use. GOM allows to


reduce power usage and optimize GPU throughput by disabling GPU features. Each
GOM is designed to meet specific user needs. In "All On" mode everything is enabled
and running at full speed. The "Compute" mode is designed for running only computer
tasks. Graphics operations are not allowed. The "Low Double Precision" mode is
designed for running graphics applications that don't require high bandwidth double
precision. GOM can be changed with the (--gom) flag.

gom.pending" or "gpu_operation_mode.pending: The GOM that will be used on the next


reboot.

clocks_throttle_reasons.supported: Bitmask of supported clock throttle reasons. See


nvml.h for more details. Retrieves information about factors that are reducing the
frequency of clocks. If all throttle reasons are returned as "Not Active" it means that
clocks are running as high as possible.

clocks_throttle_reasons.active: Bitmask of active clock throttle reasons. See nvml.h for


more details.

clocks_throttle_reasons.gpu_idle: Nothing is running on the GPU and the clocks are


dropping to Idle state. This limiter may be removed in a later release.

clocks_throttle_reasons.applications_clocks_setting: GPU clocks are limited by


applications clocks setting. E.g. can be changed by nvidia-smi --applications-clocks=

clocks_throttle_reasons.sw_power_cap: SW Power Scaling algorithm is reducing the


clocks below requested clocks because the GPU is consuming too much power. E.g. SW
power cap limit can be changed with nvidia-smi --power-limit=
clocks_throttle_reasons.hw_slowdown: HW Slowdown (reducing the core clocks by a
factor of 2 or more) is engaged. This is an indicator of:
* temperature being too high
* External Power Brake Assertion is triggered (e.g. by the system power supply)
* Power draw is too high and Fast Trigger protection is reducing the clocks
* May be also reported during PState or clock change
* This behavior may be removed in a later release

clocks_throttle_reasons.unknown: Some other unspecified factor is reducing the clocks.

compute_mode: The compute mode flag indicates whether individual or multiple


compute applications may run on the GPU.
*"Default" means multiple contexts are allowed per device.
*"Exclusive_Thread" means only one context is allowed per device, usable from one
thread at a time.
*"Exclusive_Process" means only one context is allowed per device, usable from
multiple threads at a time.
*"Prohibited" means no contexts are allowed per device (no compute apps).

power.management: A flag that indicates whether power management is enabled. Either


"Supported" or "[Not Supported]". Requires Inforom PWR object version 3.0 or higher or
Kepler device.

power.limit: The power management algorithm's power ceiling, in watts. Total board
power draw is manipulated by the power management algorithm so that it stays under
this value. On Kepler devices Power Limit can be adjusted using [-pl | --power-limit=]
switches.

clocks.applications.graphics or clocks.applications.gr: User specified frequency of


graphics (shader) clock. User specified frequency at which applications will be running
at. Can be changed with [-ac | --applications-clocks] switches.

clocks.applications.memory or clocks.applications.mem: User specified frequency of


memory clock.

Runtime Properties

timestamp: The timestamp of where the query was made in format "YYYY/MM/DD
HH:MM:SS.msec".

pcie.link.gen.current: The current PCI-E link generation. These may be reduced when the
GPU is not in use.

pcie.link.width.current: The current PCI-E link width. These may be reduced when the
GPU is not in use.

fan.speed: The fan speed value is the percent of maximum speed that the device's fan is
currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the
intended fan speed. If the fan is physically blocked and unable to spin, this output will
not match the actual fan speed. Many parts do not report fan speeds because they rely
on cooling via fans in the surrounding enclosure.
pstate: The current performance state for the GPU. States range from P0 (maximum
performance) to P12 (minimum performance).

memory.used: Total memory allocated by active contexts.

memory.free: Total free memory.

utilization.gpu: Percent of time over the past second during which one or more kernels
was executing on the GPU.

utilization.memory: Percent of time over the past second during which global (device)
memory was being read or written.

temperature.gpu: Core GPU temperature. in degrees C.

power.draw: The last measured power draw for the entire board, in watts. Only available
if power management is supported. This reading is accurate to within +/- 5 watts.

clocks.current.graphics or clocks.gr: Current frequency of graphics (shader) clock.

clocks.current.sm or clocks.sm: Current frequency of SM (Streaming Multiprocessor)


clock.

clocks.current.memory or clocks.mem: Current frequency of memory clock.

Not Supported by the OS or Device

inforom.ecc: Version for the ECC recording data.

ecc.mode.current: The Error Correcting Code (ECC) mode that the GPU is currently
operating under.

ecc.mode.pending: The ECC mode that the GPU will operate under after the next reboot.

ecc.errors.corrected.volatile.device_memory: Errors detected in global device memory.

ecc.errors.corrected.volatile.register_file: Errors detected in register file memory.

ecc.errors.corrected.volatile.l1_cache: Errors detected in the L1 cache.


ecc.errors.corrected.volatile.l2_cache: Errors detected in the L2 cache.

ecc.errors.corrected.volatile.texture_memory: Parity errors detected in texture memory.

ecc.errors.corrected.volatile.total: Total errors detected across the entire chip. Sum of


device_memory, register_file, l1_cache, l2_cache and texture_memory.

ecc.errors.corrected.aggregate.device_memory: Errors detected in global device


memory.

ecc.errors.corrected.aggregate.register_file: Errors detected in register file memory.

ecc.errors.corrected.aggregate.l1_cache: Errors detected in the L1 cache.

ecc.errors.corrected.aggregate.l2_cache: Errors detected in the L2 cache.

ecc.errors.corrected.aggregate.texture_memory: Parity errors detected in texture


memory.

ecc.errors.corrected.aggregate.total: Total errors detected across the entire chip. Sum of


device_memory, register_file, l1_cache, l2_cache and texture_memory.

ecc.errors.uncorrected.volatile.device_memory: Errors detected in global device memory.

ecc.errors.uncorrected.volatile.register_file: Errors detected in register file memory.

ecc.errors.uncorrected.volatile.l1_cache: Errors detected in the L1 cache.

ecc.errors.uncorrected.volatile.l2_cache: Errors detected in the L2 cache.

ecc.errors.uncorrected.volatile.texture_memory: Parity errors detected in texture


memory.

ecc.errors.uncorrected.volatile.total: Total errors detected across the entire chip. Sum of


device_memory, register_file, l1_cache, l2_cache and texture_memory.

ecc.errors.uncorrected.aggregate.device_memory: Errors detected in global device


memory.
ecc.errors.uncorrected.aggregate.register_file: Errors detected in register file memory.

ecc.errors.uncorrected.aggregate.l1_cache: Errors detected in the L1 cache.

ecc.errors.uncorrected.aggregate.l2_cache: Errors detected in the L2 cache.

ecc.errors.uncorrected.aggregate.texture_memory: Parity errors detected in texture


memory.

ecc.errors.uncorrected.aggregate.total: Total errors detected across the entire chip. Sum


of device_memory, register_file, l1_cache, l2_cache and texture_memory.

retired_pages.single_bit_ecc.count or retired_pages.sbe: The number of GPU device


memory pages that have been retired due to multiple single bit ECC errors.

retired_pages.double_bit.count or retired_pages.dbe: The number of GPU device memory


pages that have been retired due to a double bit ECC error.

retired_pages.pending: Checks if any GPU device memory pages are pending retirement
on the next reboot. Pages that are pending retirement can still be allocated, and may
cause further reliability issues.

driver_model.current: The driver model is currently in use. Always "N/A" on Linux.

driver_model.pending: The driver model that will be used on the next reboot. Always
"N/A" on Linux.

You might also like