Bài giảng Kiến trúc máy tính - Nguyễn Kim Khánh - Bai giang Kien truc may tinh - Nguyen Kim Khanh

Tóm tắt Bài giảng Kiến trúc máy tính - Nguyễn Kim Khánh: ...số hạng thì kết quả là đúng. n  nếu kết quả có dấu ngược lại, khi đó có tràn xảy ra (Overflow) và kết quả bị sai. n  Tràn xảy ra khi tổng nằm ngoài dải biểu diễn: [ -(2n-1),+(2n-1-1)] Jan2014 Computer Architecture 174 NKK-HUST Ví dụ cộng số nguyên có dấu không tràn n  ( + 70) = ...its trong rt n  lbu rt, offset(rs) lhu rt, offset(rs) n  Mở rộng zero thành 32 bits trong rt n  sb rt, offset(rs) sh rt, offset(rs) n  Chỉ lưu byte/halfword bên phải Jan2014 Computer Architecture 300 Bài giảng Kiến trúc máy tính Jan2014 Nguyễn Kim Khánh DCE-HUST 76 NKK-HUST Ví ...2 Surface 1 Surface 0 Surface 4 Surface 3 Surface 6 Surface 5 Surface 8 Surface 7 Platter Spindle Boom Read–write head (1 per surface) Direction of arm motion Surface 9 Figure 6.5 Components of a Disk Drive Figure 6.6 Tracks and Cylinders Jan2014 Computer Architecture 433 NKK-HUST ...

136 trang | Chia sẻ: havih72 | Lượt xem: 93 | Lượt tải: 0

Nội dung tài liệu Bài giảng Kiến trúc máy tính - Nguyễn Kim Khánh, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

cache
L3 cache
L2 cache
L1-I
CPU Core n
L1-D L1-I
Main memory
(d ) Shared L3 cache
I/O
CPU Core 1
L1-D
L2 cache
L1-I
CPU Core n
L1-D L1-I
Main memory
(c) Shared L2 cache
I/O
CPU Core 1
L1-D L1-I
CPU Core n
L1-D L1-I
L2 cache
Main memory
(a) Dedicated L1 cache
I/O
Figure 18.8 Multicore Organization Alternatives
Jan2014 Computer Architecture 525
NKK-HUST
Intel - Core Duo
n 2006
n Two x86 superscalar,
shared L2 cache
n Dedicated L1 cache
per core
n 32KB instruction and
32KB data
n 2MB shared L2 cache
676 CHAPTER 18 / MULTICORE COMPUTERS
18.4 INTEL x86 MULTICORE ORGANIZATION
Intel has introduced a number of multicore products in recent years. In this section,
we look at two examples: the Intel Core Duo and the Intel Core i7-990X.
Intel Core Duo
The Intel Core Duo, introduced in 2006, implements two x86 superscalar processors
with a shared L2 cache (Figure 18.8c).
The general structure of the Intel Core Duo is shown in Figure 18.9. Let us
consider the key elements starting from the top of the figure. As is common in mul-
ticore systems, each core has its own dedicated L1 cache. In this case, each core has
a 32-kB instruction cache and a 32-kB data cache.
Each core has an independent thermal control unit. With the high transistor
density of today’s chips, thermal management is a fundamental capability, espe-
cially for laptop and mobile systems. The Core Duo thermal control unit is designed
to manage chip heat dissipation to maximize processor performance within thermal
constraints. Thermal management also improves ergonomics with a cooler system
and lower fan acoustic noise. In essence, the thermal management unit monitors
digital sensors for high-accuracy die temperature measurements. Each core can
be defined as an independent thermal zone. The maximum temperature for each
Thermal control Thermal control
APIC APIC
32
-k
B
L1
C
ac
he
s
32
-k
B
L1
C
ac
he
s
Ex
ec
ut
io
n
re
so
u
rc
es
Ex
ec
ut
io
n
re
so
u
rc
es
A
rc
h.
st
at
e
A
rc
h.
st
at
e
Power management logic
2 MB L2 shared cache
Bus interface
Front-side bus
Figure 18.9 Intel Core Duo Block Diagram
Jan2014 Computer Architecture 526
NKK-HUST
Intel Core i7-990X
678 CHAPTER 18 / MULTICORE COMPUTERS
The general structure of the Intel Core i7-990X is shown in Figure 18.10. Each
core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache.
One mechanism Intel uses to make its caches more effective is prefetching, in which
the hardware examines memory access patterns and attempts to fill the caches spec-
ulatively with data that’s likely to be requested soon. It is interesting to compare the
performance of this three-level on chip cache organization with a comparable two-
level organization from Intel. Table 18.1 shows the cache access latency, in terms of
clock cycles for two Intel multicore systems running at the same clock frequency.
The Core 2 Quad has a shared L2 cache, similar to the Core Duo. The Core i7
improves on L2 cache performance with the use of the dedicated L2 caches, and
provides a relatively high-speed access to the L3 cache.
The Core i7-990X chip supports two forms of external communications to
other chips. The DDR3 memory controller brings the memory controller for the
DDR main memory2 onto the chip. The interface supports three channels that
are 8 bytes wide for a total bus width of 192 bits, for an aggregate data rate of
up to 32 GB/s. With the memory controller on the chip, the Front Side Bus is
eliminated.
Core 0
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
256 kB
L2 Cache
Core 1
256 kB
L2 Cache
Core 2
256 kB
L2 Cache
Core 3
256 kB
L2 Cache
Core 4
256 kB
L2 Cache
Core 5
256 kB
L2 Cache
12 MB
L3 Cache
DDR3 Memory
Controllers
QuickPath
Interconnect
3 ! 8B @ 1.33 GT/s 4 ! 20B @ 6.4 GT/s
Figure 18.10 Intel Core i7-990X Block Diagram
Table 18.1 Cache Latency (in clock cycles)
CPU Clock Frequency L1 Cache L2 Cache L3 Cache
Core 2 Quad 2.66 GHz 3 cycles 15 cycles —
Core i7 2.66 GHz 4 cycles 11 cycles 39 cycles
2The DDR synchronous RAM memory is discussed in Chapter 5.
Jan2014 Computer Architecture 527
NKK-HUST
ARM11 MPCore
680 CHAPTER 18 / MULTICORE COMPUTERS
Interrupt Handling
The Distributed Interrupt Controller (DIC) collates interrupts from a large number
of sources. It provides
• Masking of interrupts
• Prioritization of the interrupts
• Distribution of the interrupts to the target MP11 CPUs
• Tracking the status of interrupts
• Generation of interrupts by software
The DIC is a single functional unit that is placed in the system alongside
MP11 CPUs. This enables the number of interrupts supported in the system to
Snoop control unit (SCU)
L1 cache
CPU/VFP
Timer CPU
inter-
faceWdog
L1 cache
CPU/VFP
L1 cache
CPU/VFP
L1 cache
CPU/VFP
Timer CPU
inter-
faceWdog
Timer CPU
inter-
faceWdog
Timer CPU
inter-
faceWdog
Distributed
interrupt
controller
Configurable
number of
hardware
interrupt lines
Instruction
and data
64-bit bus
Coherency
control bits
Instruction
and data
64-bit bus
Read/write
64-bit bus
IRQ IRQ IRQ IRQ
Per CPU private
fast interrupt
(FIQ) lines
Optional 2nd R/W
64-bit bus
Coherency
control bits
Instruction
and data
64-bit bus
Coherency
control bits
Instruction
and data
64-bit bus
Coherency
control bits
Figure 18.11 ARM11 MPCore Processor Block Diagram
Jan2014 Computer Architecture 528
Bài giảng Kiến trúc máy tính Jan2014
Nguyễn Kim Khánh DCE-HUST 133
NKK-HUST
9.3. Đa xử lý bộ nhớ phân tán
n Máy tính qui mô lớn (Warehouse Scale Computers
or Massively Parallel Processors – MPP)
n Máy tính cụm (clusters)
SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 617
As a consequence of these and other factors, there is a great deal of interest in
building and using parallel computers in which each CPU has its own private mem-
ory, not directly accessible to any other CPU. These are the multicomputers. Pro-
grams on multicomputer CPUs interact using primitives like send and receive to
explicitly pass messages because they cannot get at each other’s memory with
LOAD and STORE instructions. This difference completely changes the pro-
gramming model.
Each node in a multicomputer consists of one or a few CPUs, some RAM
(conceivably shared among the CPUs at that node only), a disk and/or other I/O de-
vices, and a communication processor. The communication processors are con-
nected by a high-speed interconnection network of the types we discussed in Sec.
8.3.3. Many different topologies, switching schemes, and routing algorithms are
used. What all multicomputers have in common is that when an application pro-
gram executes the send primitive, the communication processor is notified and
transmits a block of user data to the destination machine (possibly after first asking
for and getting permission). A generic multicomputer is shown in Fig. 8-36.
CPU Memory Node
Communication
processor
Local interconnect
Disk
and
I/O
Local interconnect
Disk
and
I/O
High-performance interconnection network
Figure 8-36. A generic multicomputer.
8.4.1 Interconnection Networks
In Fig. 8-36 we see that multicomputers are held together by interconnection
networks. Now it is time to look more closely at these interconnection networks.
Interestingly enough, multiprocessors and multicomputers are surprisingly similar
in this respect because multiprocessors often have multiple memory modules that
must also be interconnected with one another and with the CPUs. Thus the mater-
ial in this section frequently applies to both kinds of systems.
The fundamental reason why multiprocessor and multicomputer intercon-
nection networks are similar is that at the very bottom both of them use message
Jan2014 Computer Architecture 529
NKK-HUST
Mạng liên kết SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 619
(a)
(c)
(e)
(g)
(b)
(d)
(f)
(h)
Figure 8-37. Various topologies. The heavy dots represent switches. The CPUs
and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree.
(d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.
Interconnection networks can be characterized by their dimensionality. For
our purposes, the dimensionality is determined by the number of choices there are
to get from the source to the destination. If there is never any choice (i.e., there is
only one path from each source to each destination), the network is zero dimen-
sional. If there is one dimension in which a choice can be made, for example, go
Jan2014 Computer Architectur 530
NKK-HUST
Massively Parallel Processors
n Hệ thống qui mô lớn
n Đắt tiền: nhiều triệu USD
n Dùng cho tính toán khoa học và các bài
toán có số phép toán và dữ liệu rất lớn
n Siêu máy tính
Jan2014 Computer Architecture 531
NKK-HUST
IBM Blue Gene/P
624 PARALLEL COMPUTER ARCHITECTURES CHAP. 8
coherency between the L1 caches on the four CPUs. Thus when a shared piece of
memory resides in more than one cache, accesses to that storage by one processor
will be immediately visible to the other three processors. A memory reference that
misses on the L1 cache but hits on the L2 cache takes about 11 clock cycles. A
miss on L2 that hits on L3 takes ab ut 28 cycles. Finally, a miss on L3 that as to
go to the main DRAM takes about 75 cycles.
The four CPUs are connected via a high-bandwidth bus to a 3D torus network,
which requires six connections: up, down, north, south, east, and west. In addition,
each processor has a port to the collective network, used for broadcasting data to
all processors. The barrier port is used to speed up synchronization operations, giv-
ing each processor fast access to a specialized synchronization network.
At the next level up, IBM designed a custom card that holds one of the chips
shown in Fig. 8-38 along with 2 GB of DDR2 DRAM. The chip and the card are
shown in Fig. 8-39(a)–(b) respectively.
1 Chip
4 CPUs
2 GB
4 processors
8-MB L3 cache
2-GB
DDR2
DRAM
32 Cards
32 Chips
128 CPUs
64 GB
32 Boards
1024 Cards
1024 Chips
4096 CPUs
2 TB
72 Cabinets
73728 Cards
73728 Chips
294912 CPUs
144 TB
SystemCabinetBoardCardChip:
(b) (c) (d) (e)(a)
Figure 8-39. The BlueGene/P: (a) chip. (b) card. (c) board. (d) cabinet.
(e) system.
The cards are mounted on plug-in boards, with 32 cards per board for a total of
32 chips (and thus 128 CPUs) per board. Since each card contains 2 GB of
DRAM, the boards contain 64 GB apiece. One board is illustrated in Fig. 8-39(c).
At the next level, 32 of these boards are plugged into a cabinet, packing 4096
CPUs into a single cabinet. A cabinet is illustrated in Fig. 8-39(d).
Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is
depicted in Fig. 8-39(e). A PowerPC 450 can issue up to 6 instructions/cycle, thus
Jan2014 Computer Architecture 532
Bài giảng Kiến trúc máy tính Jan2014
Nguyễn Kim Khánh DCE-HUST 134
NKK-HUST
Cluster
n Nhiều máy tính được kết nối với nhau bằng
mạng liên kết tốc độ cao (~ Gbps)
n Mỗi máy tính có thể làm việc độc lập (PC
hoặc SMP)
n Mỗi máy tính được gọi là một node
n Các máy tính có thể được quản lý làm việc
song song theo nhóm (cluster)
n Toàn bộ hệ thống có thể coi như là một máy
tính song song
n Tính sẵn sàng cao
n Khả năng chịu lỗi lớn
Jan2014 Computer Architecture 533
NKK-HUST
PC Cluster của Google SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 635
hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are
just typical values for a Google cluster.
128-port Gigabit
Ethernet switch
128-port Gigabit
Ethernet switch
Two gigabit
Ethernet links
80-PC rack
OC-48 FiberOC-12 Fiber
Figure 8-44. A typical Google cluster.
Power density is also a key issue. A typical PC burns about 120 watts or about
10 kW per rack. A rack needs about 3 m2 so that maintenance personnel can in-
stall and remove PCs and for the air conditioning to function. These parameters
give a power density of over 3000 watts/m2. Most data centers are designed for
600–1200 watts/m2, so special measures are required to cool the racks.
Google has learned three key things about running massive Web servers that
bear repeating.
1. Components will fail so plan for it.
2. Replicate everything for throughput and availability.
3. Optimize price/performance.
Jan2014 Computer Architecture 534
NKK-HUST
9.4. Bộ xử lý đồ họa tính toán đa năng
n Kiến trúc SIMD
n Xuất phát từ bộ xử lý đồ họa GPU (Graphic
Processing Unit) hỗ trợ xử lý đồ họa 2D và
3D: xử lý dữ liệu song song
n GPGPU – General purpose Graphic
Processing Unit
n Hệ thống lai CPU/GPGPU
n CPU là host: thực hiện theo tuần tự
n GPGPU: tính toán song song
Jan2014 Computer Architecture 535
NKK-HUST
Bộ xử lý đồ họa trong máy tính
Jan2014 Computer Architecture 536
Bài giảng Kiến trúc máy tính Jan2014
Nguyễn Kim Khánh DCE-HUST 135
NKK-HUST
GPGPU: NVIDIA Tesla
n Streaming
multiprocessor
n 8 × Streaming
processors
Jan2014 Computer Architecture 537
NKK-HUST
GPGPU: NVIDIA Fermi
7
Hardware Execution
CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes
one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks;
and CUDA cores and other execution units in the SM execute threads. The SM executes
threads in groups of 32 threads called a warp. While programmers can generally ignore warp
execution for functional correctness and think of programming one thread, they can greatly
improve performance by having threads in a warp execute the same code path and access
memory in nearby addresses.
An Overview of the Fermi Architecture
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA
cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The
512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory
partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM
memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread
global scheduler distributes thread blocks to SM thread schedulers.
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical
rectangular strip that contain an orange portion (scheduler and dispatch), a green portion
(execution units), and light blue portions (register file and L1 cache). Jan2014 Comput r Architecture 538
NKK-HUST
NVIDIA Fermi
8
Third Generation Streaming
Multiprocessor
The third generation SM introduces several
architectural innovations that make it not only the
most powerful SM yet built, but also the most
programmable and efficient.
512 High Performance CUDA cores
Each SM features 32 CUDA
processors—a fourfold
increase over prior SM
designs. Each CUDA
processor has a fully
pipelined integer arithmetic
logic unit (ALU) and floating
point unit (FPU). Prior GPUs used IEEE 754-1985
floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point
standard, providing the fused multiply-add (FMA)
instruction for both single and double precision
arithmetic. FMA improves over a multiply-add
(MAD) instruction by doing the multiplication and
addition with a single final rounding step, with no
loss of precision in the addition. FMA is more
accurate than performing the operations
separately. GT200 implemented double precision FMA.
In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,
multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly
designed integer ALU supports full 32-bit precision for all instructions, consistent with standard
programming language requirements. The integer ALU is also optimized to efficiently support
64-bit and extended precision operations. Various instructions are supported, including
Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population
count.
16 Load/Store Units
Each SM has 16 load/store units, allowing source and destination addresses to be calculated
for sixteen threads per clock. Supporting units load and store the data at each address to
cache or DRAM.
Dispatch Unit
Warp Scheduler
Instruction Cache
Dispatch Unit
Warp Scheduler
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
Uniform Cache
Core
Register File (32,768 x 32-bit)
CUDA Core
Operand Collector
Dispatch Port
Result Queue
FP Unit INT Unit
Fermi Streaming Multiprocessor (SM)
8
l i i i i i i
i l l i i
i i l i l ( )
i i i l l i i
i i . i l i l
( ) i i i l i li i
iti it i l fi l i t , it
l f r i i i t iti . i r
r t t rf r i t r ti
r t l . i l t l r i i .
I , t i t r li it t - it r i i f r lti l r ti ; r lt,
lti-i tr ti l ti r r ir f r i t r rit ti . I r i, t l
i i t r rt f ll - it r i i f r ll i tr ti , i t t it t r
r r i l r ir t . i t r i l ti i t ffi i tl rt
- it t r i i r ti . ri i tr ti r rt , i l i
l , ift, , r , rt, it-fi l tr t, it-r r i rt, l ti
t.
/ t r it
ach has 16 l a /st re units, all in s urce an estinati n a resses t e calculate
for sixteen threa s er clock. u orting units loa an store the ata at each a ress to
cache or .
r r r r
/
/
/
/
/
/
/
/
/
I t r t t r
64 are e ry / 1 ac e
if r ac e
lt
i I i
r i tr i lti r r ( )
n Có 16 Streaming
Multiprocessors (SM)
n Mỗi SM có 32 CUDA
cores.
n Mỗi CUDA core
(Cumpute Unified
Device Architecture) có
01 FPU và 01 IU
Jan2014 Computer Archit cture 539
NKK-HUST
GPGPU: NVIDIA Kepler

StreamingMultiprocessor(SMX)Architecture
KeplerGK110’snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemost
powerfulmultiprocessorwe’vebuilt,butalsothemostprogrammableandpowerͲefficient.

SMX:192singleͲprecisionCUDAcores,64doubleͲprecisionunits,32specialfunctionunits(SFU),and32load/storeunits
(LD/ST).

An Overview of the GK110 Kepler Architecture
KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperforming
parallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcompute
horsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerand
generatingmuchlessheatoutput.
AfullKeplerGK110implementationincludes15SMXunitsandsix64Ͳbitmemorycontrollers.Different
productswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14
SMXs.
Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude:
x ThenewSMXprocessorarchitecture
x Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat
eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/O
implementation.
x Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities

KeplerGK110Fullchipblockdiagram
Jan2014 Computer Architecture 540
Bài giảng Kiến trúc máy tính Jan2014
Nguyễn Kim Khánh DCE-HUST 136
NKK-HUST
Hết
Jan2014 Computer Architecture 541

File đính kèm:

bai_giang_kien_truc_may_tinh_nguyen_kim_khanh.pdf