Tensor Processing Units:- Architecture
Originally written on April 15th, 2019.
Updated on:- December,2019
In the 2nd part, we take a look into TPUs and their Architecture.
Continuing from the 1st tutorial, a Neural
network models consist of matrix multiplies of various sizes — that’s what
forms a fully connected layer, or in a CNN, it tends to be smaller matrix
multiplies. This architecture is about doing those things — when you’ve
accumulated all the partial sums and are outputting from the accumulators,
everything goes through this activation pipeline. The non-linearity is what
makes it a neural network even if it’s mostly linear algebra.
Neural
networks are just a series of matrix operations applied to input data And if
there’s a lot of data to input, that’s a lot of matrix operations to compute. Like a lot. Matrices full of numbers all being multiplied in
parallel -Most of the math is just 'multiply a bunch of numbers, and add the
results' .We can connect these two together in a single operation called
multiply-accumulate (MAC). And if we don’t need to do anything else, we can
multiply-accumulate really, really fast. Lets take an in-depth look at Google's TPU architecture.
The TPU
- Google
took 15 months for the TPUv1, and that was astonishingly fast for an ASIC.
- ASICs
are initially expensive, requiring specialized engineers and manufacturing
costs that start at around a million dollars.
- And
they are inflexible: there’s no way to change the chip once it’s finished.
But if you know you’ll be doing one particular job in enough volume, the
recurring benefits can make up for the initial drawbacks.
- ASICs
are generally the fastest and most energy-efficient way to accomplish a
task. 3 Generation of TPUS. Describe a TPU. Then Simple CPU VS TPU Add
benchmark. Then Japanese MLP GPU vs TPU benchmark. Then Tensorflow
Shakespeare.
- The
data of a neural network is arranged in a matrix, a 2D vector.
- So,
Google decided they needed to build a a matrix machine (The tensor
processing unit or TPU) And they really only care about
multiply-accumulate, so they prioritized that over other instructions that
a processor would normally support.
- Google
wanted to design a chip specifically for the matrix operations that neural
networks require so that it would run them even more efficiently.
- TPU
hardware is comprised of four independent chips.
- The
following block diagram describes the components of a single chip.
- Each
chip consists of two compute cores called Tensor Cores.
- A
Tensor Core consists of scalar, vector and matrix units (MXU).
- In
addition, 8 GB of on-chip memory (HBM) is associated with each Tensor
Core.
- The
bulk of the compute horsepower in a Cloud TPU is provided by the MXU.
- Each
MXU is capable of performing 16K multiply-accumulate operations in each
cycle.
- While
the MXU's inputs and outputs are 32-bit floating point values, the MXU
performs multiplies at reduced bfloat16 precision.
- Bfloat16
is a 16-bit floating point representation that provides better training
and model accuracy than the IEEE half-precision representation. -From a
software perspective, each of the 8 cores on a Cloud TPU can execute user
computations (XLA ops) independently. (XLA is a just-in-time compiler that takes as input High Level Optimizer (HLO) operations that are produced by the TensorFlow server. XLA generates binary code to be run on Cloud TPU, including orchestration of data from on-chip memory to hardware execution units and inter-chip communication.)
- High-bandwidth
interconnects allow the chips to communicate directly with each other.
Citations:-
1)Cloud Google.
Comments
Post a Comment