Bài giảng Kiến trúc máy tính - Nguyễn Kim Khánh
Tóm tắt Bài giảng Kiến trúc máy tính - Nguyễn Kim Khánh: ...số hạng thì kết quả là đúng. n nếu kết quả có dấu ngược lại, khi đó có tràn xảy ra (Overflow) và kết quả bị sai. n Tràn xảy ra khi tổng nằm ngoài dải biểu diễn: [ -(2n-1),+(2n-1-1)] Jan2014 Computer Architecture 174 NKK-HUST Ví dụ cộng số nguyên có dấu không tràn n ( + 70) = ...its trong rt n lbu rt, offset(rs) lhu rt, offset(rs) n Mở rộng zero thành 32 bits trong rt n sb rt, offset(rs) sh rt, offset(rs) n Chỉ lưu byte/halfword bên phải Jan2014 Computer Architecture 300 Bài giảng Kiến trúc máy tính Jan2014 Nguyễn Kim Khánh DCE-HUST 76 NKK-HUST Ví ...2 Surface 1 Surface 0 Surface 4 Surface 3 Surface 6 Surface 5 Surface 8 Surface 7 Platter Spindle Boom Read–write head (1 per surface) Direction of arm motion Surface 9 Figure 6.5 Components of a Disk Drive Figure 6.6 Tracks and Cylinders Jan2014 Computer Architecture 433 NKK-HUST ...
cache L3 cache L2 cache L1-I CPU Core n L1-D L1-I Main memory (d ) Shared L3 cache I/O CPU Core 1 L1-D L2 cache L1-I CPU Core n L1-D L1-I Main memory (c) Shared L2 cache I/O CPU Core 1 L1-D L1-I CPU Core n L1-D L1-I L2 cache Main memory (a) Dedicated L1 cache I/O Figure 18.8 Multicore Organization Alternatives Jan2014 Computer Architecture 525 NKK-HUST Intel - Core Duo n 2006 n Two x86 superscalar, shared L2 cache n Dedicated L1 cache per core n 32KB instruction and 32KB data n 2MB shared L2 cache 676 CHAPTER 18 / MULTICORE COMPUTERS 18.4 INTEL x86 MULTICORE ORGANIZATION Intel has introduced a number of multicore products in recent years. In this section, we look at two examples: the Intel Core Duo and the Intel Core i7-990X. Intel Core Duo The Intel Core Duo, introduced in 2006, implements two x86 superscalar processors with a shared L2 cache (Figure 18.8c). The general structure of the Intel Core Duo is shown in Figure 18.9. Let us consider the key elements starting from the top of the figure. As is common in mul- ticore systems, each core has its own dedicated L1 cache. In this case, each core has a 32-kB instruction cache and a 32-kB data cache. Each core has an independent thermal control unit. With the high transistor density of today’s chips, thermal management is a fundamental capability, espe- cially for laptop and mobile systems. The Core Duo thermal control unit is designed to manage chip heat dissipation to maximize processor performance within thermal constraints. Thermal management also improves ergonomics with a cooler system and lower fan acoustic noise. In essence, the thermal management unit monitors digital sensors for high-accuracy die temperature measurements. Each core can be defined as an independent thermal zone. The maximum temperature for each Thermal control Thermal control APIC APIC 32 -k B L1 C ac he s 32 -k B L1 C ac he s Ex ec ut io n re so u rc es Ex ec ut io n re so u rc es A rc h. st at e A rc h. st at e Power management logic 2 MB L2 shared cache Bus interface Front-side bus Figure 18.9 Intel Core Duo Block Diagram Jan2014 Computer Architecture 526 NKK-HUST Intel Core i7-990X 678 CHAPTER 18 / MULTICORE COMPUTERS The general structure of the Intel Core i7-990X is shown in Figure 18.10. Each core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache. One mechanism Intel uses to make its caches more effective is prefetching, in which the hardware examines memory access patterns and attempts to fill the caches spec- ulatively with data that’s likely to be requested soon. It is interesting to compare the performance of this three-level on chip cache organization with a comparable two- level organization from Intel. Table 18.1 shows the cache access latency, in terms of clock cycles for two Intel multicore systems running at the same clock frequency. The Core 2 Quad has a shared L2 cache, similar to the Core Duo. The Core i7 improves on L2 cache performance with the use of the dedicated L2 caches, and provides a relatively high-speed access to the L3 cache. The Core i7-990X chip supports two forms of external communications to other chips. The DDR3 memory controller brings the memory controller for the DDR main memory2 onto the chip. The interface supports three channels that are 8 bytes wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s. With the memory controller on the chip, the Front Side Bus is eliminated. Core 0 32 kB L1-I 32 kB L1-D 32 kB L1-I 32 kB L1-D 32 kB L1-I 32 kB L1-D 32 kB L1-I 32 kB L1-D 32 kB L1-I 32 kB L1-D 32 kB L1-I 32 kB L1-D 256 kB L2 Cache Core 1 256 kB L2 Cache Core 2 256 kB L2 Cache Core 3 256 kB L2 Cache Core 4 256 kB L2 Cache Core 5 256 kB L2 Cache 12 MB L3 Cache DDR3 Memory Controllers QuickPath Interconnect 3 ! 8B @ 1.33 GT/s 4 ! 20B @ 6.4 GT/s Figure 18.10 Intel Core i7-990X Block Diagram Table 18.1 Cache Latency (in clock cycles) CPU Clock Frequency L1 Cache L2 Cache L3 Cache Core 2 Quad 2.66 GHz 3 cycles 15 cycles — Core i7 2.66 GHz 4 cycles 11 cycles 39 cycles 2The DDR synchronous RAM memory is discussed in Chapter 5. Jan2014 Computer Architecture 527 NKK-HUST ARM11 MPCore 680 CHAPTER 18 / MULTICORE COMPUTERS Interrupt Handling The Distributed Interrupt Controller (DIC) collates interrupts from a large number of sources. It provides • Masking of interrupts • Prioritization of the interrupts • Distribution of the interrupts to the target MP11 CPUs • Tracking the status of interrupts • Generation of interrupts by software The DIC is a single functional unit that is placed in the system alongside MP11 CPUs. This enables the number of interrupts supported in the system to Snoop control unit (SCU) L1 cache CPU/VFP Timer CPU inter- faceWdog L1 cache CPU/VFP L1 cache CPU/VFP L1 cache CPU/VFP Timer CPU inter- faceWdog Timer CPU inter- faceWdog Timer CPU inter- faceWdog Distributed interrupt controller Configurable number of hardware interrupt lines Instruction and data 64-bit bus Coherency control bits Instruction and data 64-bit bus Read/write 64-bit bus IRQ IRQ IRQ IRQ Per CPU private fast interrupt (FIQ) lines Optional 2nd R/W 64-bit bus Coherency control bits Instruction and data 64-bit bus Coherency control bits Instruction and data 64-bit bus Coherency control bits Figure 18.11 ARM11 MPCore Processor Block Diagram Jan2014 Computer Architecture 528 Bài giảng Kiến trúc máy tính Jan2014 Nguyễn Kim Khánh DCE-HUST 133 NKK-HUST 9.3. Đa xử lý bộ nhớ phân tán n Máy tính qui mô lớn (Warehouse Scale Computers or Massively Parallel Processors – MPP) n Máy tính cụm (clusters) SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 617 As a consequence of these and other factors, there is a great deal of interest in building and using parallel computers in which each CPU has its own private mem- ory, not directly accessible to any other CPU. These are the multicomputers. Pro- grams on multicomputer CPUs interact using primitives like send and receive to explicitly pass messages because they cannot get at each other’s memory with LOAD and STORE instructions. This difference completely changes the pro- gramming model. Each node in a multicomputer consists of one or a few CPUs, some RAM (conceivably shared among the CPUs at that node only), a disk and/or other I/O de- vices, and a communication processor. The communication processors are con- nected by a high-speed interconnection network of the types we discussed in Sec. 8.3.3. Many different topologies, switching schemes, and routing algorithms are used. What all multicomputers have in common is that when an application pro- gram executes the send primitive, the communication processor is notified and transmits a block of user data to the destination machine (possibly after first asking for and getting permission). A generic multicomputer is shown in Fig. 8-36. CPU Memory Node Communication processor Local interconnect Disk and I/O Local interconnect Disk and I/O High-performance interconnection network Figure 8-36. A generic multicomputer. 8.4.1 Interconnection Networks In Fig. 8-36 we see that multicomputers are held together by interconnection networks. Now it is time to look more closely at these interconnection networks. Interestingly enough, multiprocessors and multicomputers are surprisingly similar in this respect because multiprocessors often have multiple memory modules that must also be interconnected with one another and with the CPUs. Thus the mater- ial in this section frequently applies to both kinds of systems. The fundamental reason why multiprocessor and multicomputer intercon- nection networks are similar is that at the very bottom both of them use message Jan2014 Computer Architecture 529 NKK-HUST Mạng liên kết SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 619 (a) (c) (e) (g) (b) (d) (f) (h) Figure 8-37. Various topologies. The heavy dots represent switches. The CPUs and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree. (d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube. Interconnection networks can be characterized by their dimensionality. For our purposes, the dimensionality is determined by the number of choices there are to get from the source to the destination. If there is never any choice (i.e., there is only one path from each source to each destination), the network is zero dimen- sional. If there is one dimension in which a choice can be made, for example, go Jan2014 Computer Architectur 530 NKK-HUST Massively Parallel Processors n Hệ thống qui mô lớn n Đắt tiền: nhiều triệu USD n Dùng cho tính toán khoa học và các bài toán có số phép toán và dữ liệu rất lớn n Siêu máy tính Jan2014 Computer Architecture 531 NKK-HUST IBM Blue Gene/P 624 PARALLEL COMPUTER ARCHITECTURES CHAP. 8 coherency between the L1 caches on the four CPUs. Thus when a shared piece of memory resides in more than one cache, accesses to that storage by one processor will be immediately visible to the other three processors. A memory reference that misses on the L1 cache but hits on the L2 cache takes about 11 clock cycles. A miss on L2 that hits on L3 takes ab ut 28 cycles. Finally, a miss on L3 that as to go to the main DRAM takes about 75 cycles. The four CPUs are connected via a high-bandwidth bus to a 3D torus network, which requires six connections: up, down, north, south, east, and west. In addition, each processor has a port to the collective network, used for broadcasting data to all processors. The barrier port is used to speed up synchronization operations, giv- ing each processor fast access to a specialized synchronization network. At the next level up, IBM designed a custom card that holds one of the chips shown in Fig. 8-38 along with 2 GB of DDR2 DRAM. The chip and the card are shown in Fig. 8-39(a)–(b) respectively. 1 Chip 4 CPUs 2 GB 4 processors 8-MB L3 cache 2-GB DDR2 DRAM 32 Cards 32 Chips 128 CPUs 64 GB 32 Boards 1024 Cards 1024 Chips 4096 CPUs 2 TB 72 Cabinets 73728 Cards 73728 Chips 294912 CPUs 144 TB SystemCabinetBoardCardChip: (b) (c) (d) (e)(a) Figure 8-39. The BlueGene/P: (a) chip. (b) card. (c) board. (d) cabinet. (e) system. The cards are mounted on plug-in boards, with 32 cards per board for a total of 32 chips (and thus 128 CPUs) per board. Since each card contains 2 GB of DRAM, the boards contain 64 GB apiece. One board is illustrated in Fig. 8-39(c). At the next level, 32 of these boards are plugged into a cabinet, packing 4096 CPUs into a single cabinet. A cabinet is illustrated in Fig. 8-39(d). Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is depicted in Fig. 8-39(e). A PowerPC 450 can issue up to 6 instructions/cycle, thus Jan2014 Computer Architecture 532 Bài giảng Kiến trúc máy tính Jan2014 Nguyễn Kim Khánh DCE-HUST 134 NKK-HUST Cluster n Nhiều máy tính được kết nối với nhau bằng mạng liên kết tốc độ cao (~ Gbps) n Mỗi máy tính có thể làm việc độc lập (PC hoặc SMP) n Mỗi máy tính được gọi là một node n Các máy tính có thể được quản lý làm việc song song theo nhóm (cluster) n Toàn bộ hệ thống có thể coi như là một máy tính song song n Tính sẵn sàng cao n Khả năng chịu lỗi lớn Jan2014 Computer Architecture 533 NKK-HUST PC Cluster của Google SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 635 hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are just typical values for a Google cluster. 128-port Gigabit Ethernet switch 128-port Gigabit Ethernet switch Two gigabit Ethernet links 80-PC rack OC-48 FiberOC-12 Fiber Figure 8-44. A typical Google cluster. Power density is also a key issue. A typical PC burns about 120 watts or about 10 kW per rack. A rack needs about 3 m2 so that maintenance personnel can in- stall and remove PCs and for the air conditioning to function. These parameters give a power density of over 3000 watts/m2. Most data centers are designed for 600–1200 watts/m2, so special measures are required to cool the racks. Google has learned three key things about running massive Web servers that bear repeating. 1. Components will fail so plan for it. 2. Replicate everything for throughput and availability. 3. Optimize price/performance. Jan2014 Computer Architecture 534 NKK-HUST 9.4. Bộ xử lý đồ họa tính toán đa năng n Kiến trúc SIMD n Xuất phát từ bộ xử lý đồ họa GPU (Graphic Processing Unit) hỗ trợ xử lý đồ họa 2D và 3D: xử lý dữ liệu song song n GPGPU – General purpose Graphic Processing Unit n Hệ thống lai CPU/GPGPU n CPU là host: thực hiện theo tuần tự n GPGPU: tính toán song song Jan2014 Computer Architecture 535 NKK-HUST Bộ xử lý đồ họa trong máy tính Jan2014 Computer Architecture 536 Bài giảng Kiến trúc máy tính Jan2014 Nguyễn Kim Khánh DCE-HUST 135 NKK-HUST GPGPU: NVIDIA Tesla n Streaming multiprocessor n 8 × Streaming processors Jan2014 Computer Architecture 537 NKK-HUST GPGPU: NVIDIA Fermi 7 Hardware Execution CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks; and CUDA cores and other execution units in the SM execute threads. The SM executes threads in groups of 32 threads called a warp. While programmers can generally ignore warp execution for functional correctness and think of programming one thread, they can greatly improve performance by having threads in a warp execute the same code path and access memory in nearby addresses. An Overview of the Fermi Architecture The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread global scheduler distributes thread blocks to SM thread schedulers. Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache). Jan2014 Comput r Architecture 538 NKK-HUST NVIDIA Fermi 8 Third Generation Streaming Multiprocessor The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient. 512 High Performance CUDA cores Each SM features 32 CUDA processors—a fourfold increase over prior SM designs. Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). Prior GPUs used IEEE 754-1985 floating point arithmetic. The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no loss of precision in the addition. FMA is more accurate than performing the operations separately. GT200 implemented double precision FMA. In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count. 16 Load/Store Units Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock. Supporting units load and store the data at each address to cache or DRAM. Dispatch Unit Warp Scheduler Instruction Cache Dispatch Unit Warp Scheduler Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core SFU SFU SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST Interconnect Network 64 KB Shared Memory / L1 Cache Uniform Cache Core Register File (32,768 x 32-bit) CUDA Core Operand Collector Dispatch Port Result Queue FP Unit INT Unit Fermi Streaming Multiprocessor (SM) 8 l i i i i i i i l l i i i i l i l ( ) i i i l l i i i i . i l i l ( ) i i i l i li i iti it i l fi l i t , it l f r i i i t iti . i r r t t rf r i t r ti r t l . i l t l r i i . I , t i t r li it t - it r i i f r lti l r ti ; r lt, lti-i tr ti l ti r r ir f r i t r rit ti . I r i, t l i i t r rt f ll - it r i i f r ll i tr ti , i t t it t r r r i l r ir t . i t r i l ti i t ffi i tl rt - it t r i i r ti . ri i tr ti r rt , i l i l , ift, , r , rt, it-fi l tr t, it-r r i rt, l ti t. / t r it ach has 16 l a /st re units, all in s urce an estinati n a resses t e calculate for sixteen threa s er clock. u orting units loa an store the ata at each a ress to cache or . r r r r / / / / / / / / / I t r t t r 64 are e ry / 1 ac e if r ac e lt i I i r i tr i lti r r ( ) n Có 16 Streaming Multiprocessors (SM) n Mỗi SM có 32 CUDA cores. n Mỗi CUDA core (Cumpute Unified Device Architecture) có 01 FPU và 01 IU Jan2014 Computer Archit cture 539 NKK-HUST GPGPU: NVIDIA Kepler StreamingMultiprocessor(SMX)Architecture KeplerGK110’snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemost powerfulmultiprocessorwe’vebuilt,butalsothemostprogrammableandpowerͲefficient. SMX:192singleͲprecisionCUDAcores,64doubleͲprecisionunits,32specialfunctionunits(SFU),and32load/storeunits (LD/ST). An Overview of the GK110 Kepler Architecture KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperforming parallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcompute horsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerand generatingmuchlessheatoutput. AfullKeplerGK110implementationincludes15SMXunitsandsix64Ͳbitmemorycontrollers.Different productswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14 SMXs. Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude: x ThenewSMXprocessorarchitecture x Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/O implementation. x Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities KeplerGK110Fullchipblockdiagram Jan2014 Computer Architecture 540 Bài giảng Kiến trúc máy tính Jan2014 Nguyễn Kim Khánh DCE-HUST 136 NKK-HUST Hết Jan2014 Computer Architecture 541
File đính kèm:
- bai_giang_kien_truc_may_tinh_nguyen_kim_khanh.pdf