The CPU core ordering/numbering in 2-chipset Intel Westmere
I am using a Intel Westmere processor. The architecture of westmere consists of 12 CPU cores arranged on 2-chips. So it means that each chip contains 6 cores.
I don't how the CPU cores are ordere开发者_C百科d or numbered. My guess is that it can either of the following:
- core 0,1,2,3,4, and 5 are on one chip and core 6,7,8,9,10, and 11 are on the second chip
- core 0,2,4,6,8, and 10 are on one chip and core 1,3,5,7,9, and 11 are on the second chip
Do anyone know the ordering/numbering of the CPU cores
For more information you can try to use this tool: http://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration
It is the official tool to determine that.
Here is an example run from a machine with two physical Intel X5560 (6core+6HT) running CentOS 5.3 (might be old a bit).
Package 0 Cache and Thread details
Box Description:
Cache is cache level designator
Size is cache size
OScpu# is cpu # as seen by OS
Core is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32, Cores/cache= 2, Caches/package= 4
L1I is Level 1 Instruction cache, size(KBytes)= 32, Cores/cache= 2, Caches/package= 4
L2 is Level 2 Unified cache, size(KBytes)= 256, Cores/cache= 2, Caches/package= 4
L3 is Level 3 Unified cache, size(KBytes)= 8192, Cores/cache= 8, Caches/package= 1
+-----------+-----------+-----------+-----------+
Cache | L1D | L1D | L1D | L1D |
Size | 32K | 32K | 32K | 32K |
OScpu#| 0 8| 1 9| 2 10| 3 11|
Core |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|
AffMsk| 1 100| 2 200| 4 400| 8 800|
CmbMsk| 101 | 202 | 404 | 808 |
+-----------+-----------+-----------+-----------+
Cache | L1I | L1I | L1I | L1I |
Size | 32K | 32K | 32K | 32K |
+-----------+-----------+-----------+-----------+
Cache | L2 | L2 | L2 | L2 |
Size | 256K | 256K | 256K | 256K |
+-----------+-----------+-----------+-----------+
Cache | L3 |
Size | 8M |
CmbMsk| f0f |
+-----------------------------------------------+
Combined socket AffinityMask= 0xf0f
Package 1 Cache and Thread details
Box Description:
Cache is cache level designator
Size is cache size
OScpu# is cpu # as seen by OS
Core is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
where # is number of zeroes (so '8z5' is '0x800000')
+-----------+-----------+-----------+-----------+
Cache | L1D | L1D | L1D | L1D |
Size | 32K | 32K | 32K | 32K |
OScpu#| 4 12| 5 13| 6 14| 7 15|
Core |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|
AffMsk| 10 1z3| 20 2z3| 40 4z3| 80 8z3|
CmbMsk| 1010 | 2020 | 4040 | 8080 |
+-----------+-----------+-----------+-----------+
Cache | L1I | L1I | L1I | L1I |
Size | 32K | 32K | 32K | 32K |
+-----------+-----------+-----------+-----------+
Cache | L2 | L2 | L2 | L2 |
Size | 256K | 256K | 256K | 256K |
+-----------+-----------+-----------+-----------+
Cache | L3 |
Size | 8M |
CmbMsk| f0f0 |
+-----------------------------------------------+
They are supposed to be interleaved so that taking successive cores spreads the load as much as possible. If 0 and 1 were on the same chip, then naive code that only used two cores would be wasting half the cache.
So numbered cores should first alternate physical CPUs. They should next alternate dies, if possible. They should then go through the cores on a single die. They should then include virtual cores, if possible.
So if you had two physical CPUs (P1, P2), each dual core (C1, C2) and each hyper-threaded (V1, V2), the cores should go: P1C1V1, P2C1V1, P1C2V1, P2C2V1, P1C1V2, P2C1V2, P1C2V2, P2C2V2
The rationale is to allow code that doesn't understand the CPU topology to just grab as many cores as it knows how to use and get optimal performance. If you could only support two cores, you want P1C1V1 and P2C1V1, not P1C1V1 and P1C1V2, or you'd be massively wasting cache and execution units.
精彩评论