HỆ THỐNG MÁY TÍNH

Bài giảng Hệ thống máy tính Aug2015 NKK-HUST TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI Hanoi University of Science and Technology Contact Information n  n  HỆ THỐNG MÁY TÍNH n  Computer Systems Address: 502-B1 Mobile: 091-358-5533 e-mail: khanhnk@soict.hust.edu.vn khanh.nguyenkim@hust.edu.vn Nguyễn Kim Khánh Bộ môn Kỹ thuật máy tính Viện Công nghệ thông tin Truyền thông Department of Computer Engineering (DCE) School of Information and Communication Technology (SoICT) Version: CS-HEDSPI2015 CS-HEDSPI2015 NKK-HUST n  n  Mục tiêu học phần Hai học phần liên thông: n  n  NKK-HUST Mục tiêu n  Computer Systems n  Kiến trúc máy tính (Computer Architecture) Hệ thống máy tính (Computer Systems) Sinh viên trang bị kiến thức kiến trúc tập lệnh tổ chức máy tính Sau học xong hai học phần, sinh viên có khả năng: n  n  n  n  n  CS-HEDSPI2015 n  Tìm hiểu kiến trúc tập lệnh xử lý cụ thể Lập trình hợp ngữ Đánh giá hiệu máy tính Khai thác quản trị hiệu hệ thống máy tính Phân tích thiết kế thành phần máy tính Computer Systems Nguyễn Kim Khánh DCE-HUST Kiến trúc máy tính n  Kiến trúc tập lênh n  Chương trình nguồn dịch thành mã máy ? n  Phần cứng thực chương trình mã máy ? Hệ thống máy tính n  Đánh giá hiệu hệ thống máy tính n  Tổ chức thành phần hệ thống máy tính n  Các kiến trúc máy tính song song CS-HEDSPI2015 Computer Systems Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Tài liệu học tập Nội dung học phần n  Bài giảng Hệ thống máy tính: CS-HEDSPI2015 download tại: ftp://dce.hust.edu.vn/khanhnk/CS-HEDSPI/ n  Sách giáo trình: Chương Tổng quan hệ thống máy tính Chương Bộ nhớ máy tính Chương Hệ thống vào-ra Chương Các kiến trúc song song [1] David A Patterson, John L Hennessy Computer Organization and Design – 2012, Revised 4th edition n  Sách tham khảo: [2] William Stallings Computer Organization and Architecture – 2013, 9th edition [3] David Money Harris, Sarah L Harris Digital Design and Computer Architecture – 2013, 2nd edition [4] Andrew S Tanenbaum Structured Computer Organization – 2013, 6th edition CS-HEDSPI2015 Computer Systems NKK-HUST CS-HEDSPI2015 Computer Systems NKK-HUST Hệ thống máy tính Nội dung chương 1.1 Các thành phần máy tính 1.2 Hoạt động máy tính 1.3 Bus máy tính 1.4 Hiệu máy tính Chương TỔNG QUAN HỆ THỐNG MÁY TÍNH Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST CS-HEDSPI2015 Computer Systems Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST 1.1 Các thành phần máy tính n  CPU Bộ xử lý trung tâm (Central Processing Unit – CPU) Bộ nhớ n  n  n  n  n  Điều khiển hoạt động máy tính xử lý liệu n  n  Chứa chương trình thực Nguyên tắc hoạt động bản: n  Trao đổi thông tin máy tính với bên n  điều khiển hoạt động máy tính xử lý liệu CPU hoạt động theo chương trình nằm nhớ Là thành phần nhanh hệ thống Bus hệ thống (System bus) n  CS-HEDSPI2015 Chức năng: Hệ thống vào-ra (Input/Output) Hệ thống vào-ra n  n  Bộ nhớ (Main Memory) Bus hệ thống n  Bộ xử lý trung tâm (CPU) Kết nối vận chuyển thông tin Computer Systems NKK-HUST CS-HEDSPI2015 Computer Systems 10 NKK-HUST Các thành phần CPU n  Đơn vị số học logic n  n  Bus hệ thống n  Tập ghi n  Chức năng: nhớ chương trình liệu (dưới dạng nhị phân) Các thao tác với nhớ: n  n  n  Arithmetic and Logic Unit (ALU) Thực phép toán số học phép toán logic Thao tác ghi (Write) Thao tác đọc (Read) Các thành phần chính: n  n  n  Register File (RF) Gồm ghi chứa thông tin phục vụ cho hoạt động CPU Computer Systems Nguyễn Kim Khánh DCE-HUST n  Tập ghi n  CS-HEDSPI2015 Control Unit (CU) Điều khiển hoạt động máy tính theo chương trình định sẵn Đơn vị số học logic n  n  n  Đơn vị điều khiển n  Đơn vị điều khiển Bộ nhớ máy tính Bộ nhớ (Main memory) Bộ nhớ đệm (Cache memory) Thiết bị lưu trữ (Storage Devices) CPU 11 CS-HEDSPI2015 Bộ nhớ đệm Bộ nhớ Computer Systems Các thiết bị lưu trữ 12 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Bộ nhớ (Main memory) n  n  n  n  n  n  Tồn máy tính Chứa lệnh liệu chương trình thực Sử dụng nhớ bán dẫn Tổ chức thành ngăn nhớ đánh địa (thường đánh địa cho byte nhớ) Nội dung ngăn nhớ thay đổi, song địa vật lý ngăn nhớ cố định CPU muốn đọc/ghi ngăn nhớ cần phải biết địa ngăn nhớ CS-HEDSPI2015 Bộ nhớ đệm (Cache memory) Nội dung Địa 0100 1101 00 0000 0101 0101 00 0001 1010 1111 00 0010 0000 1110 00 0011 0111 0100 00 0100 1011 0010 00 0101 0010 1000 00 0110 1110 1111 00 0111 n  n  n  n  n  n  0110 0010 11 1110 0010 0001 11 1111 Computer Systems 13 NKK-HUST CS-HEDSPI2015 n  14 Hệ thống vào-ra Còn gọi nhớ Chức đặc điểm n  Lưu giữ tài nguyên phần mềm máy tính Được kết nối với hệ thống dạng thiết bị vào-ra Dung lượng lớn n  Tốc độ chậm n  n  n  Computer Systems NKK-HUST Thiết bị lưu trữ (Storage Devices) n  Bộ nhớ có tốc độ nhanh đặt đệm CPU nhớ nhằm tăng tốc độ CPU truy cập nhớ Dung lượng nhỏ nhớ Sử dụng nhớ bán dẫn tốc độ nhanh Cache thường chia thành số mức (L1, L2, L3) Cache thường tích hợp chip xử lý Cache có không n  n  n  n  Các loại thiết bị lưu trữ n  n  n  CS-HEDSPI2015 Chức năng: Trao đổi thông tin máy tính với giới bên Các thao tác bản: n  Bộ nhớ từ: ổ đĩa cứng HDD Bộ nhớ bán dẫn: ổ thể rắn SSD, ổ nhớ flash, thẻ nhớ Bộ nhớ quang: CD, DVD Computer Systems Nguyễn Kim Khánh DCE-HUST Thiết bị vào-‐ra Mô-đun vào-ra Thiết bị vào-‐ra Vào liệu (Input) Ra liệu (Output) Các thành phần chính: n  n  15 Bus hệ thống CS-HEDSPI2015 Các thiết bị vào-ra (IO devices) Các mô-đun vào-ra (IO modules) Mô-đun vào-ra Computer Systems Thiết bị vào-‐ra 16 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Các thiết bị vào-ra n  n  n  Mô-đun vào-ra Còn gọi thiết bị ngoại vi (Peripherals) Chức năng: chuyển đổi liệu bên bên máy tính Các loại thiết bị vào-ra: n  Thiết bị vào (Input Devices) Thiết bị (Output Devices) Thiết bị lưu trữ (Stotage Devices) Thiết bị truyền thông (Communication Devives) n  n  n  n  n  n  n  n  CS-HEDSPI2015 Computer Systems 17 NKK-HUST n  Mỗi cổng vào-ra đánh địa xác định Các thiết bị vào-ra kết nối trao đổi liệu với máy tính thông qua cổng vào-ra CPU muốn trao đổi liệu với thiết bị vào-ra, cần phải biết địa cổng vào-ra tương ứng Computer Systems 18 NKK-HUST Thực chương trình n  n  Mỗi mô-đun vào-ra có một vài cổng vào-ra (I/O Port) CS-HEDSPI2015 1.2 Hoạt động máy tính n  Chức năng: nối ghép thiết bị vào-ra với máy tính Thực chương trình Hoạt động ngắt Hoạt động vào-ra n  Là hoạt động máy tính Máy tính lặp lặp lại chu trình lệnh gồm hai bước: n  n  n  Hoạt động thực chương trình bị dừng nếu: n  n  n  CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST 19 Nhận lệnh Thực lệnh CS-HEDSPI2015 Thực lệnh bị lỗi Gặp lệnh dừng Tắt máy Computer Systems 20 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Nhận lệnh n  n  n  n  n  Minh họa trình nhận lệnh Bắt đầu chu trình lệnh, CPU nhận lệnh từ nhớ CPU Bộ đếm chương trình PC (Program Counter) ghi CPU dùng để giữ địa lệnh nhận vào Lệnh đọc từ nhớ đưa vào ghi lệnh IR (Instruction Register) PC lệnh 301 302 lệnh i 302 lệnh i+1 303 lệnh i+2 304 CPU lệnh 300 PC lệnh 301 303 lệnh i 302 lệnh i+1 303 lệnh i+2 304 IR lệnh i Sau nhận lệnh i Trước nhận lệnh i Sau lệnh nhận vào, nội dung PC tự động tăng để trỏ đến lệnh Computer Systems 21 NKK-HUST CS-HEDSPI2015 Computer Systems 22 NKK-HUST Thực lệnh n  300 IR CPU phát địa từ đếm chương trình PC tìm ngăn nhớ chứa lệnh CS-HEDSPI2015 n  lệnh Ngắt (Interrupt) Bộ xử lý giải mã lệnh nhận phát tín hiệu điều khiển thực thao tác mà lệnh yêu cầu n  Các kiểu thao tác lệnh: n  n  n  CS-HEDSPI2015 Khái niệm chung ngắt: Ngắt chế cho phép CPU tạm dừng chương trình thực để chuyển sang thực chương trình có sẵn nhớ n  Trao đổi liệu CPU với nhớ CPU với mô-đun vào-ra n  Các loại ngắt: n  Thực phép toán số học phép toán logic với liệu n  Chuyển điều khiển chương trình: rẽ nhánh nhảy đến vị trí khác Computer Systems Nguyễn Kim Khánh DCE-HUST 23 Chương trình xử lý ngắt (Interrupt handlers) CS-HEDSPI2015 Biệt lệ (exception): gây lỗi thực chương trình (VD: tràn số, mã lệnh sai, ) Ngắt từ bên (external interrupt): thiết bị vào-ra (thông qua mô-đun vào-ra) gửi tín hiệu ngắt đến CPU để yêu cầu trao đổi liệu Computer Systems 24 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Hoạt động với ngắt từ bên Hoạt động ngắt (tiếp) Sau hoàn thành lệnh, xử lý kiểm tra tín hiệu ngắt Nếu ngắt, xử lý nhận lệnh chương trình Nếu có tín hiệu ngắt: n  n  n  n  n  n  n  n  CS-HEDSPI2015 Chương trình thực lệnh Tạm dừng (suspend) chương trình thực Cất ngữ cảnh (các thông tin liên quan đến chương trình bị ngắt) Thiết lập đếm chương trình PC trỏ đến chương trình xử lý ngắt tương ứng Chuyển sang thực chương trình xử lý ngắt Khôi phục ngữ cảnh trở tiếp tục thực chương trình bị tạm dừng Computer Systems 25 NKK-HUST Ngắt n  n  n  lệnh RETURN lệnh CS-HEDSPI2015 Computer Systems 26 Interrupt handler X User program Xử lý với nhiều tín hiệu yêu cầu ngắt (tiếp) Khi ngắt thực hiện, ngắt khác bị cấm (disabled interrupt) Bộ xử lý bỏ qua yêu cầu ngắt User program lệnh lệnh i n  Các yêu cầu ngắt đợi kiểm tra sau ngắt xử lý xong Các ngắt thực lệnh lệnh lệnh NKK-HUST Xử lý ngắt n  lệnh lệnh i+1 Xử lý với nhiều tín hiệu yêu cầu ngắt n  Chương trình xử lý ngắt lệnh Xử lý ngắt ưu tiên n  n  n  Interrupt handler X Interrupt Các ngắt định nghĩa mức ưu handler tiênY khác Ngắt có mức ưu tiên thấp bị ngắt ngắt có mức ưu tiên cao Xẩy ngắt lồng (a) Sequential interrupt processing User program Interrupt handler X Interrupt handler Y Interrupt handler Y (a) Sequential interrupt processing CS-HEDSPI2015 User program Computer Systems Interrupt handler X 27 (b) Nested interrupt processing CS-HEDSPI2015 Figure 3.13 Computer Systems Transfer of Control with Multiple Interrupts 28 82 Interrupt handler Y Nguyễn Kim Khánh DCE-HUST Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Hoạt động vào-ra n  n  1.3 Bus máy tính Luồng thông tin máy tính Hoạt động vào-ra: hoạt động trao đổi liệu mô-đun vào-ra với bên máy tính Các mô-đun máy tính: n  n  Các kiểu hoạt động vào-ra: n  n  n  CPU trao đổi liệu với mô-đun vào-ra lệnh vào-ra chương trình n  à cần kết nối với CPU trao quyền điều khiển cho phép mô-đun vào-ra trao đổi liệu trực tiếp với nhớ (DMA - Direct Memory Access) CS-HEDSPI2015 Computer Systems CPU Mô-đun nhớ Mô-đun vào-ra 29 NKK-HUST CS-HEDSPI2015 Computer Systems 30 NKK-HUST Kết nối mô-đun nhớ Kết nối mô-đun nhớ (tiếp) n  n  địa n  Mô-đun nhớ liệu Địa đưa đến để xác định ngăn nhớ Dữ liệu đưa đến ghi Dữ liệu lệnh đưa đọc n  liệu lệnh n  Nhận tín hiệu điều khiển: n  Tín hiệu điều khiển đọc Bộ nhớ không phân biệt lệnh liệu n  Điều khiển đọc (Read) Điều khiển ghi (Write) Tín hiệu điều khiển ghi CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST 31 CS-HEDSPI2015 Computer Systems 32 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Kết nối mô-đun vào-ra Kết nối mô-đun vào-ra (tiếp) n  liệu từ bên liệu bên liệu từ bên liệu vào bên n  Địa đưa đến để xác định cổng vào-ra Ra liệu (Output) n  Mô-đun vào-ra địa n  n  n  Các tín hiệu điều khiển ngắt n  tín hiệu điều khiển ghi n  n  CS-HEDSPI2015 Vào liệu (Input) Các tín hiệu điều khiển thiết bị n  tín hiệu điều khiển đọc Computer Systems 33 NKK-HUST Nhận liệu từ bên (CPU nhớ chính) Đưa liệu thiết bị vào-ra Nhận liệu từ thiết bị vào-ra Đưa liệu vào bên (CPU nhớ chính) Nhận tín hiệu điều khiển từ CPU Phát tín hiệu điều khiển đến thiết bị vào-ra Phát tín hiệu ngắt đến CPU CS-HEDSPI2015 Computer Systems 34 NKK-HUST Kết nối CPU Kết nối CPU (tiếp) n  lệnh địa n  n  CPU liệu liệu n  Các tín hiệu điều khiển nhớ vào-ra Các tín hiệu điều khiển ngắt n  n  CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST 35 Phát địa đến mô-đun nhớ hay môđun vào-ra Đọc lệnh từ nhớ Đọc liệu từ nhớ mô-đun vào-ra Đưa liệu (sau xử lý) đến nhớ mô-đun vào-ra Phát tín hiệu điều khiển đến mô-đun nhớ mô-đun vào-ra Nhận tín hiệu ngắt CS-HEDSPI2015 Computer Systems 36 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Cấu trúc bus n  n  Bus: tập hợp đường kết nối để vận chuyển thông tin mô-đun máy tính với Các bus chức năng: n  n  n  n  Sơ đồ cấu trúc bus CPU Bus địa (Address bus) Bus liệu (Data bus) Bus điều khiển (Control bus) Computer Systems Mô-đun vào-ra bus liệu 37 CS-HEDSPI2015 Computer Systems 38 NKK-HUST Bus địa Bus liệu Chức năng: vận chuyển địa để xác định vị trí ngăn nhớ hay cổng vào-ra Độ rộng bus địa chỉ: n  Chức năng: n  n  N bit: AN-1, AN-2, A2, A1, A0 à Số lượng địa tối đa sử dụng là: 2N địa (gọi không gian địa chỉ) n  Địa nhỏ nhất: 00 000 (2) n  Địa lớn nhất: 11 111 (2) n  n  Mô-đun vào-ra bus điều khiển NKK-HUST n  Mô-đun nhớ bus địa Độ rộng bus: số đường dây bus truyền bit thông tin đồng thời (chỉ dùng cho bus địa bus liệu) CS-HEDSPI2015 n  Mô-đun nhớ n  n  CS-HEDSPI2015 n  n  Nguyễn Kim Khánh DCE-HUST M bit: DM-1, DM-2, D2, D1, D0 M thường 8, 16, 32, 64 bit Ví dụ: Máy tính có bus liệu kết nối CPU với nhớ 64-bit à Có thể trao đổi byte nhớ thời điểm n  Máy tính sử dụng bus địa 32-bit (A31-A0), nhớ đánh địa cho byte à Có khả đánh địa cho 232 bytes nhớ = 4GiB Computer Systems Độ rộng bus liệu: số bit truyền đồng thời n  Ví dụ: vận chuyển lệnh từ nhớ đến CPU vận chuyển liệu thành phần máy tính với 39 CS-HEDSPI2015 Computer Systems 40 10 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Hoạt động vào liệu: nhìn từ CPU Hoạt động vào liệu: nhìn từ mô-đun vào-ra n  n  n  n  n  n  Mô-đun vào-ra nhận tín hiệu điều khiển đọc từ CPU Mô-đun vào-ra nhận liệu từ thiết bị vào-ra, CPU làm việc khác Khi có liệu à mô-đun vào-ra phát tín hiệu ngắt CPU CPU yêu cầu liệu Mô-đun vào-ra chuyển liệu đến CPU CS-HEDSPI2015 Computer Systems n  n  n  Cất ngữ cảnh (nội dung ghi liên quan) n  Thực chương trình xử lý ngắt để vào liệu n  Khôi phục ngữ cảnh chương trình thực n  173 NKK-HUST n  CS-HEDSPI2015 Computer Systems 174 NKK-HUST Các vấn đề nảy sinh thiết kế n  Phát tín hiệu điều khiển đọc Làm việc khác Cuối chu trình lệnh, kiểm tra tín hiệu yêu cầu ngắt Nếu bị ngắt: Các phương pháp nối ghép ngắt Làm để xác định mô-đun vào-ra phát tín hiệu ngắt ? CPU làm có nhiều yêu cầu ngắt xẩy ? n  n  n  n  CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST 175 Sử dụng nhiều đường yêu cầu ngắt Hỏi vòng phần mềm (Software Poll) Hỏi vòng phần cứng (Daisy Chain or Hardware Poll) Sử dụng điều khiển ngắt (PIC) CS-HEDSPI2015 Computer Systems 176 44 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Nhiều đường yêu cầu ngắt Thanh ghi yêu cầu ngắt INTR3 n  n  n  Cờ ngắt INTR2 INTR1 INTR0 Mô-đun vào-ra CPU n  Hỏi vòng phần mềm Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra CPU n  Mỗi mô-đun vào-ra nối với đường yêu cầu ngắt CPU phải có nhiều đường tín hiệu yêu cầu ngắt Hạn chế số lượng mô-đun vào-ra Các đường ngắt qui định mức ưu tiên CS-HEDSPI2015 Computer Systems n  n  177 NKK-HUST INTR Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra CPU thực phần mềm hỏi mô-đun vào-ra Chậm Thứ tự mô-đun hỏi vòng thứ tự ưu tiên CS-HEDSPI2015 Computer Systems 178 NKK-HUST Hỏi vòng phần cứng Hỏi vòng phần cứng (tiếp) n  Bus liệu Cờ ngắt CPU INTR n  INTA Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra n  CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST 179 CPU phát tín hiệu chấp nhận ngắt (INTA) đến mô-đun vào-ra Nếu mô-đun vào-ra không gây ngắt gửi tín hiệu đến mô-đun xác định mô-đun gây ngắt Thứ tự mô-đun vào-ra kết nối chuỗi xác định thứ tự ưu tiên CS-HEDSPI2015 Computer Systems 180 45 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Bộ điều khiển ngắt lập trình Đặc điểm vào-ra điều khiển ngắt INTR n n  Bus liệu INTRn-1 INTR CPU PIC Phần cứng: gây ngắt CPU n  Phần mềm: trao đổi liệu CPU với mô-đun vào-ra INTR1 INTA n  INTR0 Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra Mô-đun vào-ra n  n  n  n  PIC – Programmable Interrupt Controller PIC có nhiều đường vào yêu cầu ngắt có qui định mức ưu tiên PIC chọn yêu cầu ngắt không bị cấm có mức ưu tiên cao gửi tới CPU CS-HEDSPI2015 Computer Systems n  181 NKK-HUST CS-HEDSPI2015 Computer Systems 182 Sơ đồ cấu trúc DMAC Vào-ra chương trình ngắt CPU trực tiếp điều khiển: n  n  CPU trực tiếp điều khiển vào-ra CPU đợi mô-đun vào-ra, hiệu sử dụng CPU tốt NKK-HUST DMA (Direct Memory Access) n  Có kết hợp phần cứng phần mềm Bộ đếm liệu Các đường liệu Chiếm thời gian CPU Để khắc phục dùng kỹ thuật DMA n  CS-HEDSPI2015 Các đường địa Sử dụng mô-đun điều khiển vào-ra chuyên dụng, gọi DMAC (Controller), điều khiển trao đổi liệu mô-đun vào-ra với nhớ Computer Systems Nguyễn Kim Khánh DCE-HUST Thanh ghi liệu Thanh ghi địa Điều khiển đọc Yêu cầu bus Chuyển nhượng bus Ngắt 183 CS-HEDSPI2015 Điều khiển ghi Logic điều khiển Đọc Yêu cầu DMA Ghi Chấp nhận DMA Computer Systems 184 46 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Các thành phần DMAC Hoạt động DMA n  n  n  n  n  CPU “nói” cho DMAC n  Thanh ghi liệu: chứa liệu trao đổi Thanh ghi địa chỉ: chứa địa ngăn nhớ liệu Bộ đếm liệu: chứa số từ liệu cần trao đổi Logic điều khiển: điều khiển hoạt động DMAC n  n  n  n  n  n  CPU làm việc khác DMAC điều khiển trao đổi liệu Sau truyền từ liệu thì: n  n  n  CS-HEDSPI2015 Computer Systems 185 NKK-HUST n  n  nội dung ghi địa tăng nội dung đếm liệu giảm Khi đếm liệu = 0, DMAC gửi tín hiệu ngắt CPU để báo kết thúc DMA CS-HEDSPI2015 Computer Systems 186 NKK-HUST Các kiểu thực DMA n  Vào hay Ra liệu Địa thiết bị vào-ra (cổng vào-ra tương ứng) Địa đầu mảng nhớ chứa liệu à nạp vào ghi địa Số từ liệu cần truyền à nạp vào đếm liệu Cấu hình DMA (1) DMA truyền theo khối (Block-transfer DMA): DMAC sử dụng bus để truyền xong khối liệu DMA lấy chu kỳ (Cycle Stealing DMA): DMAC cưỡng CPU treo tạm thời chu kỳ bus, DMAC chiếm bus thực truyền từ liệu DMA suốt (Transparent DMA): DMAC nhận biết chu kỳ CPU không sử dụng bus chiếm bus để trao đổi từ liệu CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST System Bus CPU n  I/O Module I/O Module Memory Mỗi lần trao đổi liệu, DMAC sử dụng bus hai lần n  n  187 DMAC CS-HEDSPI2015 Giữa mô-đun vào-ra với DMAC Giữa DMAC với nhớ Computer Systems 188 47 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Cấu hình DMA (2) Cấu hình DMA (3) System Bus System Bus CPU DMAC DMAC I/O Module n  n  CPU Memory CS-HEDSPI2015 I/O Module I/O Module n  n  Computer Systems n  189 n  I/O Module Giữa DMAC với nhớ CS-HEDSPI2015 Computer Systems 190 NKK-HUST Đặc điểm DMA n  I/O Module Bus vào-ra tách rời hỗ trợ tất thiết bị cho phép DMA Mỗi lần trao đổi liệu, DMAC sử dụng bus lần Giữa DMAC với nhớ NKK-HUST n  Memory IO Bus I/O Module DMAC điều khiển vài mô-đun vào-ra Mỗi lần trao đổi liệu, DMAC sử dụng bus lần n  DMAC Bộ xử lý vào-ra n  CPU không tham gia trình trao đổi liệu DMAC điều khiển trao đổi liệu nhớ với mô-đun vào-ra (hoàn toàn phần cứng)à tốc độ nhanh Phù hợp với yêu cầu trao đổi mảng liệu có kích thước lớn CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST n  n  191 Việc điều khiển vào-ra thực xử lý vào-ra chuyên dụng Bộ xử lý vào-ra hoạt động theo chương trình riêng Chương trình xử lý vào-ra nằm nhớ nằm nhớ riêng CS-HEDSPI2015 Computer Systems 192 48 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Nối ghép song song 3.3 Nối ghép thiết bị vào-ra Các kiểu nối ghép vào-ra n  Nối ghép song song n  Nối ghép nối tiếp Đến bus hệ thống n  n  Computer Systems 193 NKK-HUST Đến thiết bị ngoại vi Truyền nhiều bit song song Tốc độ nhanh Cần nhiều đường truyền liệu n  CS-HEDSPI2015 Mô-đun vào-ra song song CS-HEDSPI2015 Computer Systems 194 NKK-HUST Nối ghép nối tiếp Các cấu hình nối ghép n  Điểm tới điểm (Point to Point) n  Đến bus hệ thống Mô-đun vào-ra nối tiếp Đến thiết bị ngoại vi n  Thông qua cổng vào-ra nối ghép với thiết bị Điểm tới đa điểm (Point to Multipoint) Thông qua cổng vào-ra cho phép nối ghép với nhiều thiết bị n  Ví dụ: n  n  n  n  n  CS-HEDSPI2015 Truyền bit Cần có chuyển đổi từ liệu song song sang nối tiếp hoặc/và ngược lại Tốc độ chậm Cần đường truyền liệu Computer Systems Nguyễn Kim Khánh DCE-HUST n  n  n  195 CS-HEDSPI2015 USB (Universal Serial Bus): 127 thiết bị IEEE 1394 (FireWire): 63 thiết bị Thunderbolt Computer Systems 196 49 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Thunderbolt 7.7 / THE EXTERNAL INTERFACE: THUNDERBOLT AND INFINIBAND 251 COMPUTER Memory Graphics Subsystem Processor Hết chương Platform controller hub (PCH) DisplayPort DisplayPort PCIe x4 TC Thunderbolt controller Thunderbolt connector Thunderbolt 20 Gbps (max) Daisy chain TC TC Figure 7.17 Example Computer Configuration with Thunderbolt CS-HEDSPI2015 NKK-HUST Computer Systems 197 THUNDERBOLT PROTOCOL ARCHITECTURE Figure 7.18 illustrates the Thunderbolt protocol architecture The cable and connector layer provides transmission medium access This layer specifies the physical and electrical attributes of the connector port The Thunderbolt protocol physical layer is responsible for link maintenance including hot-plug3 detection and data encoding to provide highly efficient data transfer The physical layer has been designed to introduce very minimal overhead and provides full-duplex 10 Gbps of usable capacity to the upper layers The common transport layer is the key to the operation of Thunderbolt and what makes it attractive as a high-speed peripheral I/O technology Some of the features include: CS-HEDSPI2015 Computer Systems 198 NKK-HUST Hệ thống máy tính Nội dung học phần • A high-performance, low-power, switching architecture • A highly efficient, low-overhead packet format with flexible quality of service (QoS) support that allows multiplexing of bursty PCI Express transactions Chương Tổng quan hệ thống máy tính Chương Bộ nhớ máy tính Chương Hệ thống vào-ra Chương Các kiến trúc song song Chương CÁC KIẾN TRÚC SONG SONG The term hot plug is defined as pulling out a component from a system and plugging in a new one while the main power is still on It allows an external drive, network adapter, or other peripheral to be plugged in without having to power down the computer Nguyễn Kim Khánh Trường Đại học Bách khoa Hà Nội CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST 199 CS-HEDSPI2015 Computer Systems 200 50 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST Nội dung chương 4.1 Phân loại kiến trúc máy tính Phân loại kiến trúc máy tính (Michael Flynn -1966) 4.1 Phân loại kiến trúc máy tính 4.2 Đa xử lý nhớ dùng chung 4.3 Đa xử lý nhớ phân tán 4.4 Bộ xử lý đồ họa đa dụng CS-HEDSPI2015 Computer Systems 201 NKK-HUST n  SISD - Single Instruction Stream, Single Data Stream n  SIMD - Single Instruction Stream, Multiple Data Stream n  MISD - Multiple Instruction Stream, Single Data Stream n  MIMD - Multiple Instruction Stream, Multiple Data Stream CS-HEDSPI2015 Computer Systems 202 NKK-HUST SIMD SISD CU n  n  n  n  n  n  n  CS-HEDSPI2015 IS PU DS MU PU1 CU: Control Unit PU: Processing Unit MU: Memory Unit Một xử lý Đơn dòng lệnh Dữ liệu lưu trữ nhớ Chính Kiến trúc von Neumann (tuần tự) Computer Systems Nguyễn Kim Khánh DCE-HUST CU IS PU2 DS DS LM1 LM2 PUn 203 CS-HEDSPI2015 Computer Systems DS LMn 204 51 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST SIMD (tiếp) n  n  n  n  MISD Đơn dòng lệnh điều khiển đồng thời đơn vị xử lý PUs Mỗi phần tử xử lý có nhớ liệu riêng LM (local memory) Mỗi lệnh thực tập liệu khác Các mô hình SIMD n  n  n  n  n  n  Vector Computer Array processor CS-HEDSPI2015 Computer Systems 205 NKK-HUST n  n  CS-HEDSPI2015 Computer Systems 206 NKK-HUST MIMD n  Một luồng liệu truyền đến tập xử lý Mỗi xử lý thực dãy lệnh khác Chưa tồn máy tính thực tế Có thể có tương lai MIMD - Shared Memory Đa xử lý nhớ dùng chung (shared memory mutiprocessors) Tập xử lý Các xử lý đồng thời thực dãy lệnh khác liệu khác Các mô hình MIMD n  n  CU1 CU2 Multiprocessors (Shared Memory) Multicomputers (Distributed Memory) Computer Systems Nguyễn Kim Khánh DCE-HUST IS CUn CS-HEDSPI2015 IS 207 CS-HEDSPI2015 PU1 PU2 DS DS IS PUn Bộ nhớ dùng chung DS Computer Systems 208 52 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST MIMD - Distributed Memory Phân loại kỹ thuật song song Đa xử lý nhớ phân tán (distributed memory mutiprocessors or multicomputers) Song song mức lệnh n  n  n  CU1 CU2 IS IS PU1 PU2 CUn DS DS IS PUn LM1 LM2 DS Song song mức liệu n  Mạng liên kết hiệu cao n  n  Computer Systems 209 MIMD Song song mức yêu cầu n  n  598 CS-HEDSPI2015 SIMD Song song mức luồng n  LMn pipeline superscalar Cloud PARALLEL computing COMPUTER ARCHITECTURES CHAP Memory consistency is not a done deal Researchers are still proposing new Computer Systems 210 models (Naeem et al., 2011, Sorin et al., 2011, and Tu et al., 2010) CS-HEDSPI2015 8.3.3 UMA Symmetric Multiprocessor Architectures The simplest multiprocessors are based on a single bus, as illustrated in Fig 8-26(a) Two or more CPUs and one or more memory modules all use the same bus for communication When a CPU wants to read a memory word, it first checks to see whether the bus is busy If the bus is idle, the CPU puts the address of the word it wants on the bus, asserts a few control signals, and waits until the memory puts the desired word on the bus NKK-HUST NKK-HUST 4.2 Đa xử lý nhớ dùng chung n  n  n  SMP hay UMA (Uniform Memory Access) Hệ thống đa xử lý đối xứng (SMPSymmetric Multiprocessors) Hệ thống đa xử lý không đối xứng (NUMA – Non-Uniform Memory Access) Bộ xử lý đa lõi (Multicore Processors) CPU CPU M Shared memory Private memory Shared memory CPU CPU M CPU CPU M Cache Bus (a) (b) (c) Figure 8-26 Three bus-based multiprocessors (a) Without caching (b) With caching (c) With caching and private memories CS-HEDSPI2015 Computer Systems Nguyễn Kim Khánh DCE-HUST 211 If the bus is busy when a CPU wants to read or write memory, the CPU just waits until the bus becomes idle Herein lies the problem with this design With two or three CPUs, contention for the bus will be manageable; with 32 or 64 it will be unbearable The system will be totally limited by the bandwidth of the bus, and most of the CPUs will be idle most of the time CS-HEDSPI2015 Computer Systems 212 The solution is to add a cache to each CPU, as depicted in Fig 8-26(b) The cache can be inside the CPU chip, next to the CPU chip, on the processor board, or some combination of all three Since many reads can now be satisfied out of the local cache, there will be much less bus traffic, and the system can support more CPUs Thus caching is a big win here However, as we shall see in a moment, keeping the caches consistent with one another is not trivial Yet another possibility is the design of Fig 8-26(c), in which each CPU has not only a cache but also a local, private memory which it accesses over a dedicated (private) bus To use this configuration optimally, the compiler should place all the 53 tributed shared memory but implemented by the hardware using a small page size One of the first NC-NUMA machines (although the name had not yet been coined) was the Carnegie-Mellon Cm*, illustrated in simplified form in Fig 8-32 (Swan et al., 1977) It consisted of a collection of LSI-11 CPUs, each with some memory addressed over a local bus (The LSI-11 was a single-chip version of the DEC PDP-11, a minicomputer popular in the 1970s.) In addition, the LSI-11 systems were connected by a system bus When a memory request came into the (specially modified) MMU, a check was made to see if the word needed was in the local memory If so, a request was sent over the local bus to get the word If not, the request was routed over the system bus to the system containing the word, NKK-HUST which then responded Of course, the latter took much longer than the former While a program could run happily out of remote memory, it took 10 times longer to execute than the same program running out of local memory Bài giảng Hệ thống máy tính NKK-HUST SMP (tiếp) n  n  n  n  n  n  n  NUMA (Non-Uniform Memory Access) Một máy tính có n >= xử lý giống Các xử lý dùng chung nhớ hệ thống vào-ra Thời gian truy cập nhớ với xử lý Các xử lý thực chức giống Hệ thống điều khiển hệ điều hành phân tán Hiệu năng: Các công việc thực song song Khả chịu lỗi CS-HEDSPI2015 Computer Systems CPU Memory MMU Typically, a daemon process called a page scanner runs every few seconds to examine the usage statistics and move pages around in an attempt to 214 Computer Systems improve performance If a page appears to be in the wrong place, the page scanner unmaps it so that the next reference to it will cause a page fault When the fault occurs, a decision is made about where to place the page, possibly in a different memory To prevent thrashing, usually there is some rule saying that once a page is placed, it is frozen in place for a time ∆T Various algorithms have been studied, but the conclusion is that no one algorithm performs best under all circumstances NKK-HUST (LaRowe and Ellis, 1991) Best performance depends on the application 18.3 / MULTICORE ORGANIZATION L1 instruction cache L2 cache CPU Core CPU Core n L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L2 cache L2 cache Main memory I/O I/O Main memory Execution units and queues L1 instruction cache (b) Dedicated L2 cache (a) Dedicated L1 cache L1 data cache L2 cache Processor (superscalar or SMT) L1-I L1-D Processor n (superscalar or SMT) Processor (superscalar or SMT) L1-I L1-D CPU Core increase the performance of the system by adding complexity In the case of pipelining, simple three-stage pipelines were replaced by pipelines with five stages, and then many more stages, with some implementations having over a dozen stages There is a practical limit to how far this trend can be taken, because with more stages, there is the need for more logic, more interconnections, and more control signals With superscalar organization, increased performance can be achieved by increasing the number of parallel pipelines Again, there are diminishing returns as the number of pipelines increases More logic is required to manage hazards and to stage instruction resources Eventually, a single thread of execution reaches the point where hazards and resource dependencies prevent the full use of the multiple L1-D L1-I CPU Core CPU Core n L1-D L1-I L1-D L1-I L2 cache L2 cache Main memory Figure 18.8 CS-HEDSPI2015 L2 cache L3 cache I/O Main memory (c) Shared L2 cache Figure 18.1 Alternative Chip Organizations Computer Systems 215 to For each of these innovations, designers have over the years attempted CPU Core n L1-D L1-I L1-I L1-D Processor (superscalar or SMT) L1-I L1-D (b) Simultaneous multithreading L2 cache Nguyễn Kim Khánh DCE-HUST CPU Core n L2 cache Registers n PC PC n Issue logic (c) Multicore CS-HEDSPI2015 CPU Core 675 L1 data cache (a) Superscalar Instruction fetch unit Local bus Its job is CS-HEDSPI2015 Issue logic Program counter Single-thread register file Instruction fetch unit Execution units and queues n  Local bus Các dạng tổ chức xử lý đa lõi CHAPTER 18 / MULTICORE COMPUTERS Register Tuần tự n  Pipeline n  Siêu vô hướng n  Đa luồng n  Đa lõi: nhiều CPU chip Local bus CPU Memory 8-32 A NUMA gian machine based two levels of buses The Cm* the CPU CóFigure không địa onchỉ chung cho tấtwascả first multiprocessor to use this design n  Mỗi CPU truy cập từ xa sang nhớ Memory coherence is guaranteed in an NC-NUMA machine because no caching isCPU present.khác Each word of memory lives in exactly one location, so there is no danger of one copy having stale data: there are no copies Of course, it now matn  nhập nhớ từ xamemory chậm hơnthetruy nhậppenalty ters aTruy great deal whichbộ page is in which because performance for being in cục the wrong place is so high Consequently, NC-NUMA machines use nhớ elaborate software to move pages around to maximize performance 213 Thay đổi xử lý: Local bus CPU Memory n  Bộ xử lý đa lõi (multicores) 666 CPU Memory System bus NKK-HUST n  Aug2015 I/O (d ) Shared L3 cache Multicore Organization Alternatives Computer Systems Interprocessor communication is easy to implement, via shared memory locations The use of a shared L2 cache confines the cache coherency problem to the L1 cache level, which may provide some additional performance advantage A potential advantage to having only dedicated L2 caches on the chip is that each core enjoys more rapid access to its private L2 cache This is advantageous for threads that exhibit strong locality As both the amount of memory available and the number of cores grow, the use of a shared L3 cache combined with either a shared L2 cache or dedicated percore L2 caches seems likely to provide better performance than simply a massive shared L2 cache Another organizational design decision in a multicore system is whether the individual cores will be superscalar or will implement simultaneous multithreading 216 54 Intel has introduced a number of multicore products in recent years In this section, we look at two examples: the Intel Core Duo and the Intel Core i7-990X Intel Core Duo n  n  n  Thermal control APIC APIC Core Core Core Core Core 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 32 kB 32 kB L1-I L1-D 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache 256 kB L2 Cache … … Local interconnect DDR3 Memory Controllers QuickPath Interconnect ؋ 8B @ 1.33 GT/s ؋ 20B @ 6.4 GT/s 617 Figure 18.10 Intel Core i7-990X Block Diagram 217 The general structure of the Intel Core i7-990X is shown in Figure 18.10 Each core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache Systems mechanism Intel uses to makeComputer its caches more effective is prefetching, in which the hardware examines memory access patterns and attempts to fill the caches speculatively with data that’s likely to be requested soon It is interesting to compare the performance of this three-level on chip cache organization with a comparable twolevel organization from Intel Table 18.1 shows the cache access latency, in terms of clock cycles for two Intel multicore systems running at the same clock frequency The Core Quad has a shared L2 cache, similar to the Core Duo The Core i7 NKK-HUST improves on L2 cache performance with the use of the dedicated L2 caches, and provides a relatively high-speed access to the L3 cache The CoreSEC i7-990X chip supports two forms of external communications to 8.4 MESSAGE-PASSING MULTICOMPUTERS 619 other chips The DDR3 memory controller brings the memory controller for the DDR main memory2 onto the chip The interface supports three channels that are bytes wide for a total bus width of 192 bits, for an aggregate data rate of up to 32 GB/s With the memory controller on the chip, the Front Side Bus is eliminated CS-HEDSPI2015 One 218 Mạng liên kết Node Disk and I/O Core Bus interface 4.3 Đa xử lý nhớ phân tán … CHAPTER 18 / MULTICORE COMPUTERS 12 MB L3 Cache As a consequence of these and other factors, there is a great deal of interest in Front-side bus building and using parallel computers in which each CPU has its own private memory, not directly accessible to any other CPU These are the18.9 multicomputers ProFigure Intel Core Duo Block Diagram grams on multicomputer CPUs interact using primitives like send and receive to explicitly pass messages because they cannot get at each other’s memory with LOAD and STORE instructions Computer This difference CS-HEDSPI2015 Systems completely changes the programming model Each node in a multicomputer consists of one or a few CPUs, some RAM (conceivably shared among the CPUs at that node only), a disk and/or other I/O devices, and a communication processor The communication processors are connected by a high-speed interconnection network of the types we discussed in Sec 8.3.3 Many different topologies, switching schemes, and routing algorithms are used What all multicomputers have in common is that when an application proNKK-HUST gram executes the send primitive, the communication processor is notified and transmits a block of user data to the destination machine (possibly after first asking for and getting permission) A generic multicomputer is shown in Fig 8-36 Local interconnect Intel Core i7-990X 678 MB L2 shared cache MESSAGE-PASSING MULTICOMPUTERS Memory NKK-HUST Power management logic 2MiB shared L2 cache CPU Aug2015 32-kB L1 Caches Thermal control 32KiB instruction and 32KiB data SEC 8.4 Execution resources n  2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core Arch state n  Execution resources Intel - Core Duo Arch state NKK-HUST The Intel Core Duo, introduced in 2006, implements two x86 superscalar processors with a shared L2 cache (Figure 18.8c) The general structure of the Intel Core Duo is shown in Figure 18.9 Let us consider the key elements starting from the top of the figure As is common in multicore systems, each core has its own dedicated L1 cache In this case, each core has a 32-kB instruction cache and a 32-kB data cache Each core has an independent thermal control unit With the high transistor density of today’s chips, thermal management is a fundamental capability, especially for laptop and mobile systems The Core Duo thermal control unit is designed to manage chip heat dissipation to maximize processor performance within thermal constraints Thermal management also improves ergonomics with a cooler system and lower fan acoustic noise In essence, the thermal management unit monitors digital sensors for high-accuracy die temperature measurements Each core can be defined as an independent thermal zone The maximum temperature for each 32-kB L1 Caches Bài giảng Hệ thống máy tính Disk and I/O (a) (b) Table 18.1 Cache Latency (in clock cycles) Communication processor CPU High-performance interconnection network Figure 8-36 A generic multicomputer n  n  Máy tính qui mô lớn (Warehouse Scale Computers Interconnection NetworksProcessors – MPP) or 8.4.1 Massively Parallel In Fig 8-36 we see that multicomputers are held together by interconnection Máy tínhNow cụm (clusters) networks it is time to look more closely at these interconnection networks Interestingly enough, multiprocessors and multicomputers are surprisingly similar in this respect because multiprocessors often have multiple memory modules that must also be interconnected with one another and with the CPUs Thus the materCS-HEDSPI2015 ial in this section frequently appliesComputer to bothSystems kinds of systems The fundamental reason why multiprocessor and multicomputer interconnection networks are similar is that at the very bottom both of them use message Nguyễn Kim Khánh DCE-HUST 219 Clock Frequency L1 Cache L2 Cache Core Quad 2.66 GHz cycles 15 cycles — Core i7 2.66(c)GHz cycles 11 cycles 39 cycles (d) L3 Cache The DDR synchronous RAM memory is discussed in Chapter CS-HEDSPI2015 (e) (f) (g) (h) Figure 8-37 Various topologies The heavy dots represent switches The CPUs Computer Systems and memories are not shown (a) A star (b) A complete interconnect (c) A tree (d) A ring (e) A grid (f) A double torus (g) A cube (h) A 4D hypercube 220 Interconnection networks can be characterized by their dimensionality For our purposes, the dimensionality is determined by the number of choices there are to get from the source to the destination If there is never any choice (i.e., there is only one path from each source to each destination), the network is zero dimensional If there is one dimension in which a choice can be made, for example, go 55 Bài giảng Hệ thống máy tính 624 NKK-HUST Massively Parallel Processors n  n  n  n  PARALLEL COMPUTER ARCHITECTURES Aug2015 CHAP coherency between the L1 caches on the four CPUs Thus when a shared piece of memory resides in more than one cache, accesses to that storage by one processor will be immediately visible to the other three processors A memory reference that misses on the L1 cache but hits on the L2 cache takes about 11 clock cycles A miss on L2 that hits on L3 takes about 28 cycles Finally, a miss on L3 that has to go to the main DRAM takes about 75 cycles The four CPUs are connected via a high-bandwidth bus to a 3D torus network, which requires six connections: up, down, north, south, east, and west In addition, NKK-HUST each processor has a port to the collective network, used for broadcasting data to all processors The barrier port is used to speed up synchronization operations, giving each processor fast access to a specialized synchronization network At the next level up, IBM designed a custom card that holds one of the chips shown in Fig 8-38 along with GB of DDR2 DRAM The chip and the card are shown in Fig 8-39(a)–(b) respectively IBM Blue Gene/P Hệ thống qui mô lớn Đắt tiền: nhiều triệu USD Dùng cho tính toán khoa học toán có số phép toán liệu lớn Siêu máy tính 2-GB DDR2 DRAM Chip: processors 8-MB L3 cache (a) Card Chip CPUs GB Board 32 Cards 32 Chips 128 CPUs 64 GB Cabinet 32 Boards 1024 Cards 1024 Chips 4096 CPUs TB System 72 Cabinets 73728 Cards 73728 Chips 294912 CPUs 144 TB (b) (c) (d) (e) Figure 8-39 The BlueGene/P: (a) chip (b) card (c) board (d) cabinet (e) system The cards are mounted on plug-in boards, with 32 cards per board for a total of 32 chips (and thus 128 CPUs) per board Since each card contains GB of DRAM, the boards contain 64 GB apiece One board is illustrated in Fig 8-39(c) At the next level, 32 of these boards are plugged into a cabinet, packing 4096 CPUs into a single cabinet A cabinet is illustrated in Fig 8-39(d) Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is depicted in Fig 8-39(e) A PowerPC 450 can issue up to instructions/cycle, thus CS-HEDSPI2015 Computer Systems 221 NKK-HUST n  n  n  n  n  n  Computer Systems 222 NKK-HUST Cluster n  CS-HEDSPI2015 PC Cluster Google SEC 8.4 Nhiều máy tính kết nối với mạng liên kết tốc độ cao (~ Gbps) Mỗi máy tính làm việc độc lập (PC SMP) Mỗi máy tính gọi node Các máy tính quản lý làm việc song song theo nhóm (cluster) Toàn hệ thống coi máy tính song song Tính sẵn sàng cao Khả chịu lỗi lớn CS-HEDSPI2015 Computer Systems MESSAGE-PASSING MULTICOMPUTERS 635 hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are just typical values for a Google cluster OC-12 Fiber OC-48 Fiber 128-port Gigabit Ethernet switch 128-port Gigabit Ethernet switch Two gigabit Ethernet links 80-PC rack Figure 8-44 A typical Google cluster 223 CS-HEDSPI2015 Power density is also a key Computer issue A typical PC burns about 120 watts or about Systems 10 kW per rack A rack needs about m2 so that maintenance personnel can install and remove PCs and for the air conditioning to function These parameters give a power density of over 3000 watts/m2 Most data centers are designed for 600–1200 watts/m2 , so special measures are required to cool the racks Google has learned three key things about running massive Web servers that bear repeating 224 Components will fail so plan for it Nguyễn Kim Khánh DCE-HUST Replicate everything for throughput and availability Optimize price/performance 56 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST 4.4 Bộ xử lý đồ họa đa dụng n  n  n  n  Bộ xử lý đồ họa máy tính Kiến trúc SIMD Xuất phát từ xử lý đồ họa GPU (Graphic Processing Unit) hỗ trợ xử lý đồ họa 2D 3D: xử lý liệu song song GPGPU – General purpose Graphic Processing Unit Hệ thống lai CPU/GPGPU n  n  CS-HEDSPI2015 CPU host: thực theo GPGPU: tính toán song song Hardware Execution Computer Systems 225 CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks; and CUDA cores and other execution units in the SM execute threads The SM executes threads in groups of 32 threads called a warp While programmers can generally ignore warp execution for functional correctness and think of programming one thread, they can greatly improve performance by having threads in a warp execute the same code path and access CS-HEDSPI2015 Computer Systems memory in nearby addresses 226 An Overview of the Fermi Architecture The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores A CUDA core executes a floating point or integer instruction per clock for a thread The NKK-HUST NKK-HUST 512 CUDA cores are organized in 16 SMs of 32 cores each The GPU has six 64-bit memory GPGPU: NVIDIA Tesla GPGPU: NVIDIA Fermi partitions, for a 384-bit memory interface, supporting up to a total of GB of GDDR5 DRAM memory A host interface connects the GPU to the CPU via PCI-Express The GigaThread global scheduler distributes thread blocks to SM thread schedulers n Streaming multiprocessor × Streaming processors n  CS-HEDSPI2015 Computer Systems 227 CS-HEDSPI2015 Fermi’s 16 SM are positioned around a common L2 cache Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion Computer Systems (execution units), and light blue portions (register file and L1 cache) 228 Nguyễn Kim Khánh DCE-HUST 57 Bài giảng Hệ thống máy tính Aug2015 NKK-HUST NKK-HUST NVIDIA Fermi Instruction Cache n  n  Third Generation Streaming Multiprocessor Có 16 Streaming Multiprocessors (SM) Mỗi SM có 32 CUDA cores Mỗi CUDA core (Cumpute Unified Device Architecture) có 01 FPU 01 IU The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient CS-HEDSPI2015 Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 x 32-bit) Core Core Core Core Core Core Core Core Hết LD/ST LD/ST 512 High Performance CUDA cores n  Warp Scheduler SFU LD/ST LD/ST Each SM features 32 CUDA LD/ST CUDA Core Core Core Core Core Dispatch Port LD/ST processors—a fourfold Operand Collector LD/ST increase over prior SM Core Core Core Core LD/ST designs Each CUDA FP Unit INT Unit LD/ST processor has a fully Core Core Core Core LD/ST Result Queue pipelined integer arithmetic LD/ST logic unit (ALU) and floating Core Core Core Core LD/ST point unit (FPU) Prior GPUs used IEEE 754-1985 LD/ST floating point arithmetic The Fermi architecture Core Core Core Core LD/ST implements the new IEEE 754-2008 floating-point LD/ST standard, providing the fused multiply-add (FMA) Core Core Core Core LD/ST instruction for both single and double precision arithmetic FMA improves over a multiply-add Interconnect Network (MAD) instruction by doing the multiplication and 64 KB Shared Memory / L1 Cache addition with a single final rounding step, with no Uniform Cache Cache Uniform loss of precision in the addition FMA is more Fermi Streaming Multiprocessor (SM) accurate than performing the operations separately GT200 implemented double precision FMA SFU SFU SFU Computer 229 In GT200, the integer ALU was Systems limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements The integer ALU is also optimized to efficiently support 64-bit and extended precision operations Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population count CS-HEDSPI2015 Computer Systems 230 16 Load/Store Units Each SM has 16 load/store units, allowing source and destination addresses to be calculated for sixteen threads per clock Supporting units load and store the data at each address to cache or DRAM 8 Nguyễn Kim Khánh DCE-HUST 58

Định dạng
Số trang	58
Dung lượng	7,61 MB