Tensor Processing Units:- Architecture

Originally written on April 15th, 2019.
Updated on:- December,2019


In the 2nd part, we take a look into TPUs and their Architecture.
Continuing from the 1st tutorial, a Neural network models consist of matrix multiplies of various sizes — that’s what forms a fully connected layer, or in a CNN, it tends to be smaller matrix multiplies. This architecture is about doing those things — when you’ve accumulated all the partial sums and are outputting from the accumulators, everything goes through this activation pipeline. The non-linearity is what makes it a neural network even if it’s mostly linear algebra.

Neural networks are just a series of matrix operations applied to input data And if there’s a lot of data to input, that’s a lot of matrix operations to compute. Like a lot. Matrices full of numbers all being multiplied in parallel -Most of the math is just 'multiply a bunch of numbers, and add the results' .We can connect these two together in a single operation called multiply-accumulate (MAC). And if we don’t need to do anything else, we can multiply-accumulate really, really fast. Lets take an in-depth look at Google's TPU architecture.


The TPU



  1. Google took 15 months for the TPUv1, and that was astonishingly fast for an ASIC.
  2. ASICs are initially expensive, requiring specialized engineers and manufacturing costs that start at around a million dollars.
  3. And they are inflexible: there’s no way to change the chip once it’s finished. But if you know you’ll be doing one particular job in enough volume, the recurring benefits can make up for the initial drawbacks.
  4. ASICs are generally the fastest and most energy-efficient way to accomplish a task. 3 Generation of TPUS. Describe a TPU. Then Simple CPU VS TPU Add benchmark. Then Japanese MLP GPU vs TPU benchmark. Then Tensorflow Shakespeare.





  1. The data of a neural network is arranged in a matrix, a 2D vector.
  2. So, Google decided they needed to build a a matrix machine (The tensor processing unit or TPU) And they really only care about multiply-accumulate, so they prioritized that over other instructions that a processor would normally support.
  3. Google wanted to design a chip specifically for the matrix operations that neural networks require so that it would run them even more efficiently.



  1. TPU hardware is comprised of four independent chips.
  2. The following block diagram describes the components of a single chip.
  3. Each chip consists of two compute cores called Tensor Cores.
  4. A Tensor Core consists of scalar, vector and matrix units (MXU).
  5. In addition, 8 GB of on-chip memory (HBM) is associated with each Tensor Core.
  6. The bulk of the compute horsepower in a Cloud TPU is provided by the MXU.
  7. Each MXU is capable of performing 16K multiply-accumulate operations in each cycle.
  8. While the MXU's inputs and outputs are 32-bit floating point values, the MXU performs multiplies at reduced bfloat16 precision.
  9. Bfloat16 is a 16-bit floating point representation that provides better training and model accuracy than the IEEE half-precision representation. -From a software perspective, each of the 8 cores on a Cloud TPU can execute user computations (XLA ops) independently. (XLA is a just-in-time compiler that takes as input High Level Optimizer (HLO) operations that are produced by the TensorFlow server. XLA generates binary code to be run on Cloud TPU, including orchestration of data from on-chip memory to hardware execution units and inter-chip communication.)
  10. High-bandwidth interconnects allow the chips to communicate directly with each other.






 Two matrices can be multiplied when the number of columns in one is the same as the number of rows in the other; otherwise, they are incompatible. The idea behind matrix units is to look at this fact in isolation: a matrix unit is a matrix with dimensions, but with the entries scooped out. More on Systolic array later. In the next part, we are going to take a look into  Parallel processing on MMUs.

Citations:-
1)Cloud Google.

Comments