Making AI efficient

0 likes182 views

The document discusses optimizing AI efficiency by comparing energy consumption between biological neurons and computer components, revealing that neurons are significantly more efficient. It emphasizes the importance of well-structured code, selecting optimal data structures, and using appropriate libraries to improve performance. Key recommendations include pruning networks, quantizing variables, and being mindful of resource constraints while maximizing hardware utilization.

Technology

Making AI Efficient
Dr Janet Bastiman
@yssybyl
STORYSTREAM.AI

Neurons
86 Billion Neurons, 20 Watts
Multiple pathways
Visual cortex estimated
at 13 Billion neurons
Visual system ~ 3 Watts
(2.3 x 10-11 W per neuron)
@yssybyl

Computers…
@yssybyl
Device GPU CUDA cores Power W W per CUDA core Transistors W per transistor
Dell XPS 1050 Ti 768 130 17 x 10-2 3.3 x109 3.9 x10-8
Dell Server 1080Ti 3584 1100 31 x 10-2 1.2 x1010 9.2 x10-8
DGX-1 8 Tesla P100 28672 6400 22 x 10-2 1.2 x1011 7.7x10-8
HGX-2 16 Tesla V100 81920 12800 16 x 10-2 3.3 x1011 3.9X10-8
Neurons are 1000x more energy efficient than transistors
and a billion times more efficient than a single CUDA core
If we are to get the most out of machines
we need to recognise the cost of what we use

How did we get here?
• Abstractions
• Higher level languages
• Growing resources
• Laziness
But what if you can’t use the latest and greatest…?
@yssybyl

What is efficient?
“Soon algorithms will be measured by the amount of intelligence they provide
per Watt” – Taco Cohen, Qualcomm
Minimal memory requirements
Speed may be more important than accuracy in some cases
Every flop is sacred
@yssybyl

Start with the basics
Learn how to code well in whatever language you choose
Understand the boundaries and the frameworks
Optimise the code flow
Discrete mathematics – know your computational linear algebra
@yssybyl

Stop cutting and pasting from
Stack Overflow without
understanding
STORYSTREAM.AI Dr Janet Bastiman @yssybyl

Optimise…
@yssybyl
Calling a library that just does the looping does not count as optimisation 

$Simple Python Performance tricks Pythonic code is more readable and usually faster by design - Know the basic data structures – dicts have O(1) lookup - Reduce memory footprint and avoid + operator on strings - Use built in functions - Calculations outside of loops - Keep code small @yssybyl for i in big_it: m = re.search(r'd{2}-d{2}-d{4}', i) if m: ... date_regex = re.compile(r'd{2}-d{2}-d{4}’) for i in big_it: m = date_regex.search(i) if m: ... newlist = [] for word in oldlist: newlist.append(word.upper()) newlist = map(str.upper, oldlist) # slow msg = 'hello ' + my_var + ' world’ # faster msg = 'hello %s world' % my_var # or better: msg = 'hello {} world'.format(my_var) msg = 'line1n’ msg += 'line2n’ msg += 'line3n' msg = ['line1', 'line2', 'line3’] 'n'.join(msg)$

Hardware
We can’t all afford Nvidia’s latest and greatest 
Most of us are restricted by in-house hardware or a budget for cloud services
Horse sized duck vs duck sized horses?
Performance on benchmarks is not necessarily indicative of your own networks.
Value?
@yssybyl

Top down
Tight code will always run faster
Maximise CPU and GPU utilisation – split your program
• Parallelise I/O operations
• Parallelise data transformations
Understand the real requirements of your GPU with allow_growth
Use multiple GPUs if necessary
Put your code on the most efficient part of the system
@yssybyl
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

Remove unnecessary imports
Don’t import the world – use what you need
numpy adds 320MB
Do you really need numpy, pandas, sklearn and tf?
Putting imports within rarely called functions may be beneficial
Use the best tool for the job
@yssybyl

Limit data in
This is what we do!
Large images means large networks
Don’t learn the noise
Focussed data will mean better results
with fewer examples
@yssybyl

Optimise Implementation
Not every problem needs tensorflow
PCA, naïve Bayes, SVM, Fourier transforms…
Pre-optimised networks – let someone else do the hardwork…
Understand your problem, understand your data, pick the best tool for the job
@yssybyl

Bayesian Optimisation
@yssybyl
https://arxiv.org/abs/1502.03492
https://arxiv.org/abs/1712.02902
https://github.com/SheffieldML/GPyO
pt
Established methodology, typically
using a Gaussian process
Some scaling problems
Speed up training

Pruning
@yssybyl
https://arxiv.org/abs/1707.06168
https://github.com/yihui-he/channel-pruning
Aim to reduce channels in feature map
While minimising error
LASSO regression
Initial problem is NP-hard
Yihui He et all added constraints to simplify that may not be relevant
5x speed increase
0.3% error increase

Quantisation
Trained models have a lot of floats.
Can reduce precision to 8-bits
Floats to store maximum and minimum values
Quantised has linear spread and effectively represents
an arbitrary magnitude of ranges
Impact on accuracy so use with care
@yssybyl

Priority
1. Be mindful of resources
2. Code well, use correct data structures
3. Use the right libraries for the right tasks
4. Use structures other people have already optimised
5. Minimise inputs
6. Optimise parameters
7. Prune your network
8. Quantise your variables
@yssybyl

STORYSTREAM.AI
STORYSTREAM.AI
Dr Janet Bastiman @yssybyl
Thank You
@yssybyl
janjanjan.uk
https://uk.linkedin.com/in/janetbastiman

Making AI efficient

1. Making AI Efficient Dr Janet Bastiman @yssybyl STORYSTREAM.AI

2. Neurons 86 Billion Neurons, 20 Watts Multiple pathways Visual cortex estimated at 13 Billion neurons Visual system ~ 3 Watts (2.3 x 10-11 W per neuron) @yssybyl

3. Computers… @yssybyl Device GPU CUDA cores Power W W per CUDA core Transistors W per transistor Dell XPS 1050 Ti 768 130 17 x 10-2 3.3 x109 3.9 x10-8 Dell Server 1080Ti 3584 1100 31 x 10-2 1.2 x1010 9.2 x10-8 DGX-1 8 Tesla P100 28672 6400 22 x 10-2 1.2 x1011 7.7x10-8 HGX-2 16 Tesla V100 81920 12800 16 x 10-2 3.3 x1011 3.9X10-8 Neurons are 1000x more energy efficient than transistors and a billion times more efficient than a single CUDA core If we are to get the most out of machines we need to recognise the cost of what we use

4. How did we get here? • Abstractions • Higher level languages • Growing resources • Laziness But what if you can’t use the latest and greatest…? @yssybyl

5. What is efficient? “Soon algorithms will be measured by the amount of intelligence they provide per Watt” – Taco Cohen, Qualcomm Minimal memory requirements Speed may be more important than accuracy in some cases Every flop is sacred @yssybyl

6. Start with the basics Learn how to code well in whatever language you choose Understand the boundaries and the frameworks Optimise the code flow Discrete mathematics – know your computational linear algebra @yssybyl

7. Stop cutting and pasting from Stack Overflow without understanding STORYSTREAM.AI Dr Janet Bastiman @yssybyl

8. Optimise… @yssybyl Calling a library that just does the looping does not count as optimisation 

9. Simple Python Performance tricks Pythonic code is more readable and usually faster by design - Know the basic data structures – dicts have O(1) lookup - Reduce memory footprint and avoid + operator on strings - Use built in functions - Calculations outside of loops - Keep code small @yssybyl for i in big_it: m = re.search(r'd{2}-d{2}-d{4}', i) if m: ... date_regex = re.compile(r'd{2}-d{2}-d{4}’) for i in big_it: m = date_regex.search(i) if m: ... newlist = [] for word in oldlist: newlist.append(word.upper()) newlist = map(str.upper, oldlist) # slow msg = 'hello ' + my_var + ' world’ # faster msg = 'hello %s world' % my_var # or better: msg = 'hello {} world'.format(my_var) msg = 'line1n’ msg += 'line2n’ msg += 'line3n' msg = ['line1', 'line2', 'line3’] 'n'.join(msg)

10. Hardware We can’t all afford Nvidia’s latest and greatest  Most of us are restricted by in-house hardware or a budget for cloud services Horse sized duck vs duck sized horses? Performance on benchmarks is not necessarily indicative of your own networks. Value? @yssybyl

11. Top down Tight code will always run faster Maximise CPU and GPU utilisation – split your program • Parallelise I/O operations • Parallelise data transformations Understand the real requirements of your GPU with allow_growth Use multiple GPUs if necessary Put your code on the most efficient part of the system @yssybyl config = tf.ConfigProto() config.gpu_options.allow_growth = True session = tf.Session(config=config, ...)

12. Remove unnecessary imports Don’t import the world – use what you need numpy adds 320MB Do you really need numpy, pandas, sklearn and tf? Putting imports within rarely called functions may be beneficial Use the best tool for the job @yssybyl

13. Limit data in This is what we do! Large images means large networks Don’t learn the noise Focussed data will mean better results with fewer examples @yssybyl

14. Optimise Implementation Not every problem needs tensorflow PCA, naïve Bayes, SVM, Fourier transforms… Pre-optimised networks – let someone else do the hardwork… Understand your problem, understand your data, pick the best tool for the job @yssybyl

15. Bayesian Optimisation @yssybyl https://arxiv.org/abs/1502.03492 https://arxiv.org/abs/1712.02902 https://github.com/SheffieldML/GPyO pt Established methodology, typically using a Gaussian process Some scaling problems Speed up training

16. Pruning @yssybyl https://arxiv.org/abs/1707.06168 https://github.com/yihui-he/channel-pruning Aim to reduce channels in feature map While minimising error LASSO regression Initial problem is NP-hard Yihui He et all added constraints to simplify that may not be relevant 5x speed increase 0.3% error increase

17. Quantisation Trained models have a lot of floats. Can reduce precision to 8-bits Floats to store maximum and minimum values Quantised has linear spread and effectively represents an arbitrary magnitude of ranges Impact on accuracy so use with care @yssybyl

18. Priority 1. Be mindful of resources 2. Code well, use correct data structures 3. Use the right libraries for the right tasks 4. Use structures other people have already optimised 5. Minimise inputs 6. Optimise parameters 7. Prune your network 8. Quantise your variables @yssybyl

19. STORYSTREAM.AI STORYSTREAM.AI Dr Janet Bastiman @yssybyl Thank You @yssybyl janjanjan.uk https://uk.linkedin.com/in/janetbastiman

Editor's Notes

#3: The human brain is incredibly efficient – we have evolved to use our resources effectively and, at some point pumping more energy into our brains gave us that advantage, but it isn’t that much in the scheme of things. The brain is 2% of your body weight but 20% of the body’s energy. For an average 2400 kcal daily consumption this is about 23 J/s. The brain has about 86 Billion neurons of which 26 billion are in the cerebral cortex and about half of those are for the visual system, so our impressive visual system runs on abut 3 Watts. Now before we get into comparisons with artificial neural networks of any flavour it’s important to note that we have immensely specialised neurons. These aren’t just multi input transistors. There’s also multiple redundant pathways that get the signals through. Also, although neurons signal with an action potential that is all or nothing, the connections between neurons are analogue with the diffusion of neuro transmitters. At low values of signal to noise, analogue switches perform in a far more energy efficient manner than digital devices and there’s a great review in the proceedings of the IEEE : power consumption during neuronal computation that studies this beautifully. Even if we consider the power consumption of the brain as a whole rather than just the visual system, it’s still pretty efficient. My Dell XPS laptop that I run some of my tensorflow models on, takes 130 W and is nowhere near as capable as my brain.
#4: So we are making a few unfair comparisons here – assuming full utilisation of all resources in the server and I’m also not restricting to just the card… which may ve a little unfair in comparison, but when you look at the fact that a neuron is a billion times more efficient that a single CUDA core and even if we go to the transistor level, which is possibly a fairer comparison then they are still 1000x more energy efficient I’ve not added TPUs to this mainly because it’s a bit trickier to compare but based on the size of the TPU being less than half og the Intel Haswell CPU we can estimate that it’s ~332mm squared and has about 2.5 billion transistors with a 1600W power supply this would give a W per transistor of 4 x 10-7 which is still less energy efficient but gives faster performance by 1-2 orders of magnitude even going full body comparison, we are still several orders of magnitude more efficient. So we are starting with an imperfect artificial system. Further more we are not using these to capacity because we keep running inefficient applications on them.
#5: Nobody does machine code any more, we’ve moved away from low level languages into interpreted high level languages with bulky libraries and now into frameworks. With each level of abstraction comes an overhead of processing, of time and of power. We carry things we don’t need. Every level we move away from the hardware instruction set adds inefficiency in runtime but we gain so much more in ease and speed of development. So we’ve abstracted away from the hardware, adding layer and layer of inefficiency, but we’ve got more powerful CPUs, RAM and GPUs so we’ve not noticed. Because we’re getting a benefit from the hardware increases we don’t notice the bloat, and this is across the board not just deep learning. Unless you’ve ever done any embedded device programming you have probably never considered the requirements of the software you write. We have got lazy. Then what do you do when you can’t afford the biggest and best hardware, but still want to compete? What if you need to deploy to a resource limited device? If you don’t know how to make your models efficient you’ll be stuck
#6: I’ve been talking about energy requirements but let’s take a broader look at efficiency. I was at a conference in September and there was a great presentation on optimisation of networks by Taco Cohen of Qualcomm and a technique that I’ll go through shortly, but it was great to see that other people are thinking along the same lines as I am. Energy efficiency is great, but what about other resources. You may not have the on board RAM for very large models – where do you make the compromise. Similarly, speed of classification may be more important than the number of resources needed to created it . We see all the headlines about how fast things are going but the resources you need and the accuracy you get is appalling.. Start with the premise that every flop is sacred – work out what is expendable. If you minimise wastage at every level then your AI will run faster with lower requirements. So hopefully you’re all on board with being a little less lazy – whether that’s to save you money, get real time predictions or come up with the next crazy AI wearable…
#7: I can’t overemphasise this first one. If you’ve picked up python through self-teaching or just by following a few examples then you probably don’t know what you don’t know. Do coding competitions, look at and understand other people’s code If you’re coding in python then don’t underestimate this – so much is out of your control that coding well is difficult Every time you import a third party library you are giving up control to someone else. Someone else who has made decisions about how to code things. Someone else who can change how functions work without your permission or knowledge. Don’t assume that these developers are holding themselves to the highest standards. These projects have all the same flaws as the rest of us. Just because Google developed Tensorflow, don’t assume that it’s the best code you’ll ever see. Similarly TF will be updated for Google’s needs, meaning the underlying algorithms will be tuned and coded to optimise their solutions. This is true of all libraries. Use them knowing that they are inefficient and include lots of code you don’t need but will speed up your development. Take some time to work out what’s going on in your code and optimise the flow. Read the pragmatic programmer Diving into to code without thought is probably the worst way to get good code unless you’re happy to rewrite…. One of my better coding habits is to pseudocode everything first and then I can rearrange it before I code it up properly. Some of the bigger projects I’ve worked on don’t get really good until they’re on their third full rewrite. A minor bad decision that’s been built on can become almost impossible to remove. Another generic point to consider is the mathematics of your algorithms. If you do not understand the fundamental mathematics behind what you are doing then you’ll be inefficient at best or at worst wrong 
#8: You’d’ve thought this was obvious but I cringe every time I see people cutting and pasting code. Understand what you are including. Use this amazing resource to solve your problems by all means, but unless your problem is exactly the same you could be adding all sorts of inefficiencies. I’ve seen people break their code completely having “changed nothing” only to find they’d pasted something in and trashed a very important local variable. Similarly minimum working examples posted on SO can in themselves be locally efficient but generally inefficient. This is particularly true if you’re trying to optimise algorithms
#9: Let’s do a few maths examples. Starting simply. Let’s look at the Towers of Hanoi problem. As the number of discs increases, the minimum number of moves also increases on a three pin game. If you implement the first version of the algorithm it will naturally lend itself to what look slike a fairly tight recursion all the way down to M_1 = 1. This works fine with small numbers but you’ll quickly encounter problems when n starts to get large. A naïve solution would be to increase the python recursion limit and just accept the run time …. And yeah it works so they move on. If you understand the algorithm then you can simplify it to a single calculation that runs quickly and efficiently even for very large numbers. The second example is a bit more involved but I’d like you to think about how you’d implement this in python (or whatever language) You’re going to get a loop and it’s going to be annoying as n and therefore k get large. Using basic geometric series theory we can simplify this. First we split it into two series and then solve the sums inner and then outer to get a nice non loopy solution. Key point is, understand the maths of what you are doing. Do you really need lots of loopy or recursive functions? A single call to a library that just implements a loop is not an optimisation unless these steps are built in 
#10: There are some great resources for understanding what data structures to use, but dicts use hash tables to once you’ve set them up they’re super fast compared to searching Simple things like concatenating strings should be done using join and inserting variables with the string functions. Built in functions like map are pre-optimised in C and are much faster than anything else you can use in Python Similarly it might be obvious to do some transforms outside of loops, but definitions should also be done outside of the loop. Regexes might not feel like an overhead but in this example you’ll be redefining the regex with every iteration of the loop before you use it. In the second, you’re calling a local variable that has a predefined regex and is much faster.
#11: How you design and build your systems will be constrained by the hardware you have available both for training and deploy, the speed requirement of your network and the accuracy requirements. I’m assuming that you all know how to get accuracy so we’ll continue to focus on speed and size. At the risk of teaching you to suck eggs, if you have mechanical drives, make sure they’re defragged, but don’t defrag SSDs or the only thing you’ll speed up is their demise. Nobody has an infinite budget so consider how you will architect your services. If you think back to the energy requirements, the W per core and W per transistor were all pretty similar. So you can look at a combination of cost and on board memory . Rather than a behemoth of a system, you may find that many smaller systems will do the trick. If you keep your code tight then you will require fewer resources. Faster cards with lower RAM may be better value. Don’t buy large farms of physical servers unless you really need them, but make sure your team is not constrained - they’ll need as much as possible to do their research and having local devices that do the job is pretty much essential imho
#12: Small efficient code is always going to outperform larger codebases GPUs are great at all the things we need – large matrix operations for example, but don’t have the versatility we need for generic operations. This is why Google have developed the TPU - specialist chips have better performance. To be fair that’s exactly what’s happening with different neurons too. So we don’t want to have the GPU waiting to do the tasks it can do, we want it to be fully optimised. Similarly, we want to make as much use of our CPU as possible. By default TF maps all available GPU memory to the process to reduce memory fragmentation – set allow_growth and understand how much memory you actually need for final running of your network https://medium.com/@lisulimowicz/tensorflow-cpus-and-gpus-configuration-9c223436d4ef – nice example of playing with TF memory https://www.tensorflow.org/performance/performance_models – HPC tensorflow
#13: Even if you don’t call the libraries they will extend the demands on your system as they are loaded in ready to be used. If you don’t use something cut it. If you only need a subset of the features only import that subset. Numpy adds 320 MB even if it’s not called If you have some rare cases then you may want to put imports within the functions. This will make the function take longer but, may be beneficial depending on use. Python caches so it won’t reimport every time but it will still need to check that the library has been imported. However, profile first – usually you get far faster speed ups in other areas That said, pure python is not great at numerical analysis. https://realpython.com/numpy-tensorflow-performance/ Renato Candido did a great blog post for a simple linear regression problem – the timings are interesting for like for like code. Key here is to use the best tool for the job. Image copyright – Spaceballs, meme version https://funnyjunk.com/funny_pictures/4263235/Runnin+low+on+tp/ - covered by fair use
#14: We are actively discarding large amounts of our sensory inputs as we focus on what is critical in the moment. Our eyes see only about 10% of what is in front of us and our eyeballs are constantly moving – we make the rest up – this is why we’re fooled by optical illusions. This is not a video or gif it is a completely static jpeg. Your eyes are predicting movement because of the blur as your eyeballs move around. So we have far less real data coming into our brains than we think we do and our brains are making the rest of it up. Let’s apply the same techniques to our artificial brains. Your network will grow to accommodate the complexity of the data you are pushing through it. If you want speed and efficiency then simplify your problem and analyse the simpler problem. This requires deep knowledge of your problem space so you do not oversimplify and miss the nuances in your data. Images – reduce resolution, crop out the region of interest Time series – look at what encapsulates your patterns As was raised during questions, there is also dimensionality reduction as a limiter – for me this was beyond the scope of the talk as there are too many basic things that are not being done in AI. If you want to go further then look into dimensionality reduction, progressive networks and some of the trade-offs for size and accuracy. Image - www.ritsumei.ac.jp/~akitaoka/
#16: Tuning hyperparameters is one of the dark arts of ML. Unless you’ve got one of those brains that can see in hyperparameter space then you’ll pretty much pick something based on your gut feel. There are techniques you can use and Bayesian optimisation is one – you’ll get more accurate results faster. There are libraries you can bring in – GPyOpt has been around a while and is maintained. Amazon have been building on this for large scale deployment.
#17: So you’ve made your network, and it works, but where’s the redundancy? Could you have created a better architecture? How would you know. Well, back in 2015 a group from Harvard led by Ryan Adams started publishing papers on optimising networks using Bayesian pruning. Their spin out was bought out to stop this technology becoming easily available, but there are papers coming out now with similar ideas. Pruning channels is difficult because channels you remove from one layer will alter the input to the next layer, but you can get significant speed ups with only a minor increase in accuracy Nice paper from Yihui He et al and git hub repo – they used least absolute shrinkage and selection operator (LASSO) VGG-16 model and Applied both single channel and whole model pruning techniques. There are other optimisations techniques… Tensor factorisation, principle component iteration to determine the sub tensors that are important e.g. Accelerating Convolutional Neural Networks for Mobile Applications (tensor_optimisations.pdf)
#19: By far the biggest efficiency gains you will get will be from understanding your problem and coding well. These are the things I have to teach when I build teams. Following that, there are thousands of labs desperately trying to optimise the speed of training and speed of inference of a whole host of benchmarked networks and datasets – let them do the research and implement their techniques – you don’t need a fully fledged in house research team, just people who are capable of following the research. The most important thing is to treat your resources with care – be aware of when and why you are being wasteful

Making AI efficient

More Related Content

What's hot (19)

Similar to Making AI efficient (20)

More from Dr Janet Bastiman (8)

Recently uploaded (20)

Making AI efficient

Editor's Notes