Updated: Mar 9
Memory and CPU in the AMD EPYC Rome series CPU onward context has become relevant to the point of waxing lyrical on the subject a tad, hence this blog!
Understanding the relationship between a server CPU and its memory subsystem is critical when tuning a Fred for overall server performance.
Every CPU generation has a unique architecture with volatile controllers, channels and slot population guidelines that must be satisfied to attain high memory bandwidth and low memory access latency.
2nd Generation AMD EPYC Rome CPU offer a total of eight memory channels with up to two memory slots per channel.
This presents numerous possible permutations for configuring the memory subsystem with traditional Dual In-Line Memory Modules (DIMMs), yet there are only a couple of balanced configurations that will achieve the peak memory performance with regards to most of the OEM EPYC server fare out there right now.
Memory that has been incorrectly populated is referred to as an unbalanced configuration and with EPYC Rome CPU this = up to 20% CPU overhead to deal with it.
From a functionality standpoint, an unbalanced configuration will operate adequately but introduces said significant additional CPU overhead that will slow down data transfer speeds and generally piss the administrator of said Fred off considerably.
Similarly, a near balanced configuration does not yield fully optimized data transfer speeds but it is only sub-optimal compared to that of a balanced configuration.
Conversely, memory that has been correctly populated is referred to as a balanced configuration and will secure both optimal functionality and sweet data transfer speeds.
This blog covers how to balance memory configured for AMD EPYC Rome processors onward. Milan is being released now and next year it will be the turn of the Genoa fare.
To understand the relationship between the CPU and memory, terminology illustrated in the diagram above must first be defined:
Memory controllers are digital circuits that manage the flow of data going from the computer’s main memory to the corresponding memory channels. Rome processors have eight memory controllers in the processor I/O die, with one controller assigned per channel
Memory channels are the physical layer on which the data travels between the CPU and memory modules. As seen in the diagram above , Rome processors have eight memory channels designated A, B, C, D, E, F, G and H. These channels were intended to be organized into pairs such as two-way (AB, CD, EF, GH), four-way (ABCD, EFGH) or eight-way (ABCDEFGH)
The memory slots are internal ports that connect the individual DIMMs to their respective channels. Rome processors have two slots per channel, so there are a total of sixteen slots per CPU for memory module population.DIMM 1 slots are the first eight memory modules to be populated while DIMM 0 slots are the last eight. In the illustrations ahead, DIMM 1 slots will be represented with black text marked A1-A8 and DIMM 0 slots will be represented with white text marked A9-A16
The memory subsystem is the combination of all the independent memory functions listed above
Memory interleaving allows a CPU to efficiently spread memory accesses across multiple DIMMs.
When memory is put in the same interleave set, contiguous memory accesses go to different memory banks.
Memory accesses no longer need to wait until the prior access is completed before initiating the next memory operation!! Try doing that on a XEON and it will promptly melt all the DIMMS concurrently!!
For most workloads, performance is maximized when all DIMMs are in one interleave set creating a single uniform memory region that is spread across as many DIMMs as possible.
Multiple interleave sets create disjointed memory regions.
Rome processors achieve memory interleaving by using Non-Uniform Memory Access (NUMA) in Nodes Per Socket (NPS).
There are four NPS options available in most OEM BIOS offerings from the usual suspects (Dell/HPE/Lenovo/Cisco/Fujitsu et al):
NPS 0–One NUMA node per system (on two processors systems only). This means all channels in the system are using one interleave set.
NPS 1–One NUMA node per socket (on one processor systems). This means all channels in the socket are using one interleave set.
NPS 2–Two NUMA nodes per socket(one per left/right half). This means each half containing four channels is using one interleave set; a total of two sets.
NPS 4–Up to four NUMA nodes per socket(one per quadrant). This means each quadrant containing two channels is using one interleave set; a total of four sets.
The simplest visual aid for understanding the NPS system is to divide the CPU into four quadrants.
We see below in the diagram below that each quadrant contains two paired DIMM channels that can host up to two DIMMs.
The paired DIMM channels in each quadrant were designed to group and minimize the travel distance for interleaved sets.
NPS 1 would correlate to all four quadrants being fully populated. NPS 2 would correlate to having either the left or right half quadrant being fully populated. NPS 4 would correlate to having any one quadrant being fully populated.
NPS 0 and NPS 1 will typically yield the best memory performance, followed by NPS 2 and then NPS 4.
The server OEM's default setting for BIOS NUMA NPS is NPS 1 and may need to be manually adjusted to match the NPS option that supports the CPU model.
As seen in the table below there are various EPYC CPUs that will not support NPS 2 or 4 that require awareness of which memory configurations are optimized for each CPU.
This is not something to worry about for Nutanix AOS systems on the HCL but be aware it can be set on DX series stuff as well as the DL325 series from HPE by the astute tinkerer.
The Lenovo HX EPYC stuff for Nutanix AOS also features this in their BIOS as does the Dell EPYC fare..
The table below shows the Dell recommended NPS setting for each # of DIMMs per CPU:
If the NPS setting for a memory configuration does limit performance as seen in the below table, Dell EMC BIOS (and most EPYC BIOS) will return the following informative prompts to the user:
UEFI0391: Memory configuration supported but not optimal for the enabled NUMA node Per Socket(NPS) setting. Please consider the following actions:
Changing NPS setting under System Setup>System BIOS>Processor Settings>NUMA Nodes Per Socket, if such is supported.
For optimized memory configurations please refer to the General Memory Module Installation Guidelines section in the OEM Installation and Service Manual, of the respective server model available on their support sites. In layman’s terms, a different NPS setting or memory configuration will result in better memory performance.The system is fully functional when this message appears, but it is not fully optimized for best performance.
Memory Population Guidelines
DIMMs must be populated into a balanced configuration to yield the highest memory bandwidth and lowest memory access latency. Various factors will dictate whether a configuration is balanced or not.
Please follow the guidelines below for best results:
Memory Channel Population
Balanced Configuration-All memory channels must be fully populated with one or two DIMMs for best performance; a total of eight or sixteen DIMMs per CPU
Near Balanced Configuration
- Populate four or twelve DIMMs per socket
- Populate DIMMs in sequential order (A1-A8)
CPU and DIMM parts must be identical
Each CPU must be identically configured with memory
To achieve a balanced configuration, populate either eight or sixteen DIMMs per CPU.
By loading each channel with one or two DIMMs, the configuration is balanced and has data traveling across channels most efficiently on one interleave set.
Following this guideline will yield the highest memory bandwidth and the lowest memory latency.
If a balanced configuration of sixteen or eight DIMMs per CPU cannot be implemented, then the next best option is a near balanced configuration.
To obtain a near balanced population, populate four or twelve DIMMsper CPU in sequential order.
When any number of DIMMs other than 4, 8, 12 or 16 is populated, disjointed memory regions are created making NPS 4 the only supported BIOS option to select.
The last guideline is that DIMMs must be populated in an assembly order because Rome processors have an organized architecture for each type of CPU core count.
To simplify this concept, the lowest core count was used as a common denominator, so the assembly order below will apply across all Rome processor types.
Populating in this order ensures that fore very unique Rome processor, any DIMM configuration is guaranteed the lowest NPS option, therefore driving the most efficient interleave sets and data transfer speeds.
The diagram below denotes the assembly order in which individual DIMMs should be populated, starting with A1 and ending with A16:
Identical DIMMs must be used across all DIMM slots (i.e. the same OEM part number).
No OEM Server manufacturers support DIMM mixing in AMD EPYC Rome systems.
This means that only one rank, speed, capacity and DIMM type shall exist within the system.
This principle applies to the processors as well; multi-socket Rome based systems shall be populated with identical CPUs, always!!
Every CPU socket within a server must have identical memory configurations.
When only one unique memory configuration exists across both CPU sockets within a server, memory access is further optimized.
The diagram below for a Dell R6525 depicts the expected memory bandwidth curve when these rules are strictly followed:
Note how performance INCREASES with bigger and more DIMMs
I have seen this on ALL EPYC Rome OEM server variants as well.
I had a Lenovo SR665 with dual 7742 64 Core CPU in it with 32 x 128GB DIMMs and saw the same thing.
I also saw a 20% hit on the CPU performance with some DIMM types, i.e smaller size DIMMs.
As such, I suggest the smallest DIMM size for these CPU be 32GB. If giving up 20% CPU performance makes your boat float, well hell, go for it!!
Balanced configurations satisfy NPS 0/1 conditions by requiring each memory channel to be populated with one or two identical DIMMs.
By doing this, one interleave set can optimally distribute memory access requests across all the available DIMM slots; therefore, maximizing performance.
EPYC Rome Memory controller logic was designed around fully populated memory channels, so it should come as no surprise that eight or sixteen populated DIMMs are recommended.
Having eight DIMMs will reap the highest memory bandwidth while having sixteen DIMMs will yield the highest memory capacity.
I saw this across all tested OEM servers that host AMD EPYC Rome and beyond CPU.
8 is sweet!!
Sweet 16 is better!!
Thanks to Dell, HPE, Lenovo and AMD folks who contributed as well as the Intel folks at the Tel Aviv office in Israel for their input.
To the Inspur guys, hey, whats taking so long for an EPYC box for Nutanix??? Giddy-up already!! That NF5488A5 thang looks like the business?!