Parallel Processing on Matrix Multiplying Unit (MMU):- Need for Matrix Multiplication.


Originally written on April 15th, 2019.
Updated on:- December,2019

In this tutorial, we will take a look into Parallel Processing on MMUs and the need for matrix multiplication.


Parallel Processing on MMUs

Typical RISC processors provide instructions for simple calculations such as multiplying or adding numbers. These are  scalar processors, as they process a single operation (= scalar operation) with each instruction. Even though CPUs run at clock speeds in the gigahertz range, it can still take a long time to execute large matrix operations via a sequence of scalar operations. One effective and well-known way to improve the performance of such large matrix operations is through vector processing, where the same operation is performed concurrently across a large number of data elements at the same time. CPUs incorporate instruction set extensions that express such vector operations. The streaming multiprocessors (SMs) of GPUs are effectively vector processors, with many such SMs on a single GPU die. Machines with vector processing support can process hundreds to thousands of operations in a single clock cycle.



A CPU is a scalar machine, which means it processes instructions one step at a time. CPUs can perform these matrix operations pretty well. 

A CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time Luckily, GPUs (Graphics Processing Unit) can perform matrix operations orders of magnitude better than CPUs.





A GPU is composed of hundreds of cores that can handle thousands of threads simultaneously. That is  because GPUs were designed for 3d game rendering, which often involves parallel operations -The ability of a GPU with 100+ cores to process thousands of threads can accelerate some software by 100x over a CPU alone. What’s more, the GPU achieves this acceleration while being more power- and cost-efficient than a CPU. So when neural networks run on GPUs, they run much faster than on CPUs



A GPU is a vector machine. You can give it a long list of data — a 1D vector — and run computations on the entire list at the same time. This way, we can perform more computations per second, but we have to perform the same computation on a vector of data in parallel. GPUs are general purpose chips. They don't just perform matrix operations, they can really do any kind of computation. GPUs are optimized for taking huge batches of data and performing the same operation over and over very quickly

Why Matrix Multiplication ?


Google claimed their TPU is orders of magnitude better in performance and energy efficiency than CPUs and GPUs; further, it attributed the success to a domain-specific design. In the blog post AI Drives the Rise of Accelerated Computing in Data Centers, NVIDIA responded to these claims with their own performance metrics and concluded that TPUs and GPUs share the common themes of domain-specific acceleration of tensor computations. Tensors are high-dimensional data arrays used to represent layers of Deep Neural Networks. A Deep Learning task can be described as a Tensor Computation Graph:




A TPU is domain-specific to matrix computations in Deep Learning. A GPU has evolved from being domain-specific to 3d graphics into a general-purpose parallel computing machine. What makes GPU domain-specific to Deep Learning is its highly optimized matrix library.

When a GPU is used for Deep Learning, tensors are unfolded into 2-dimensional matrices, and matrix computations are handled by calling matrix kernels from the host CPU; matrix kernels refer to GPU programs implementing different types of matrix computations. Matrix multiplication comprises many MAC (multiply accumulate) operations. These MAC operations are the most time consuming part of Deep Learning. Even though the GPU environment allows programmers to write their own matrix kernels, they predominately use the pre-built matrix multiplication kernels as black boxes.

Matrix Machine

Since,the data of a neural network is arranged in a matrix, a 2D vector. So, we’ll need to build a matrix machine. And we really only care about multiply-accumulate, so we’ll prioritize that over other instructions that a processor would normally support. We’ll devote most of our chip to the MACs that perform matrix multiplication, and mostly ignore other operations.

In the next tutorial, we are going to talk about Matrix Machines and Systolic Arrays in-depth.


Citations
1) Cloud Google.

Comments