![title slide](img/title_slide.png)
## 🚀 Performance 🚀
Two primary performance targets:
* Task Completion (User-experience)
* Throughput (Scale)
## 🏗 Task Completion 🏗
"Work is done in a timely manner."
Often includes (and dominated by) to Asynchronous IO in Node.js
`task_end - task_start`
## 🏭 Throughput 🏭
"Work isn't blocking other work."
Blocking is usually Javascript execution or Garbage Collection in Node.js
`requests-per-second`
## 🏭 Throughput 🏭
**IMPORTANT**:
Blocking operations can cause cascading blockage resulting in significant task completion delays.
Note:
A,B,C scheduled at the same time, each will take 10ms
A takes 10ms
B takes 10ms plus A's 10ms so 20ms
C takes 10ms plus A and B, so 30ms
## 🖼 Glitching .gif Images 🖼
### Case Study
* [readimage](http://npm.im/readimage) reads gif/png/jpg and converts to a common format
* [glitcher](http://npm.im/glitcher) various image glitches and manpiulations
* [writegif](http://npm.im/writegif) write that common format to an animated gif
📝 Workflow 📝
- ❔ Is it fast enough?
- 🔍 Identify the nature of the problem. (🏗 vs 🏭)
- 🔬 Select tools based on the problem.
- 📐 Measure.
- 📍 Identify the location of the problem.
- 👓 Make the slower parts faster.
- 📐 Measure again.
- 🔁 Go back to step 0.
## 🚨 WARNING 🚨
Do **NOT** skip straight to step 5. Randomly applied V8 compiler tricks are unlikely to have a significant impact.
Note:
Saving 20ms on a task that includes a 5 second database query won't provide a significant improvement.
## ❔ 0. Is it fast enough? ❔
The logs show some images take well over a second to manipulate with the manipulation `replaceBackground`, and this is a blocking operation.
## 🔍 1. Identify the nature of the problem 🔍
Based on logs and the code, I have identified the bulk of the time is the image manipulation. It's not doing any network calls or other asynchronous work.
## 🙋 What if it was async? 🙋
* 🛠 Tools
* Logging
* APM (NewRelic, Dynatrace, etc.)
* 💪 Solutions
* Caching
* Query optimization
* etc.
(out of scope for this presentation)
Note:
These are usually infrequent and slow enough you can always log. This is essentially what APM vendors do for you automatically.
## 💻 Our test harness 💻
```js
var THRESHOLD = 40
var image = /* read file, call readimage */
function fillBlack(frame) {
return frame.fill(0)
}
replaceBackground(image.frames, fillBlack, THRESHOLD)
/* call writegif on image, write to file */
```
Note:
replaceBackground calculates the median frame by comparing every pixel in every frame, then for each frame calculates the difference from that background frame
## ⏳ Before ⏳
![Hotline Bling](img/bling2.gif)
## ⌛ After ⌛
![Hotline Bling Manipulated](img/median_bling.gif)
## 🔬 2. Select tools 🔬
Most tools fall into one of three categories:
* How slow is it?
* Where is the slowness at?
* Why is it slow?
## 🛠 How slow is it? 🛠
* `/usr/bin/time`
* Benchmark tools (ab, siege, etc.)
## 🛠 Where is the slowness at? 🛠
* Kernel tools (perf, dtrace, etc.)
* V8 Instrumentation (v8-profiler, [NSolid](http://downloads.nodesource.com))
## 📐 3. Measure 📐
With our selected poor-performing image, we'll use `/usr/bin/time` to measure a baseline:
$ \time node harness.js ~/Downloads/bling2.gif
8.67user 0.06system 0:08.71elapsed 100%CPU (0avgtext+0avgdata 181988maxresident)k
0inputs+752outputs (0major+41833minor)pagefaults 0swaps
## 📍 4. Identify the location 📍
## 🛠 Using `perf` 🛠
* The `perf` tool on Linux is a kernel-level CPU profiling tool
* Captures full stack, C++ & JS execution
* Follow [these instructions](https://gist.github.com/trevnorris/9616784)
(They are not up-to-date but the steps are all there)
## 🔥 How to read a flamegraph 🔥
* X axis is % of total time
* Y axis is stack depth
* Look for:
* plateaus
* fat pyramids
## 🔥 Perf Flamegraph 🔥
[![flamegraph from perf tool](img/median-flame.svg)](img/median-flame.svg)
## 🛠 v8-profiler 🛠
* See [v8-profiler README](http://npm.im/v8-profiler)
* Uses V8-provided instrumentation hooks
* Profiles JS only
* View by loading into Chrome Dev Tools
## 📈 v8-profiler results 📈
![v8-profiler results](img/median-cdt.png)
## 💢 Focus on avg() 💢
The `avg()` function averages every frame to make a simulated background frame.
Both `perf` and `v8-profiler` indicate we're spending the bulk of the time there.
## 🛠 Why is it slow? 🛠
If the problem is slow JavaScript execution:
* Compiler tracing (IRHydra, --prof, --trace-deopt)
* Code review
* Static analysis?
## 🙋 What if it was garbage collection? 🙋
* 🛠 Tools
* heap snapshots
* post-mortem analysis (mdb, lldb, etc.)
* 💪 Solutions
* fix errant closures
* hold less data/parts of objects vs whole objects
* code/static analysis
* etc.
(out of scope for this presentation)
## 👓 5. Make the slower parts faster 👓
## 💩 Reasons for Poor Performance 💩
* Wrong tool for the job
* Doing unnecessary things
* Poor algorithm choice
* Not cooperating with the runtime/compiler
Note:
* Node isn't great at everything: ssl termination, use nginx
* don't create functions in loops, avoid Promises
* decrease instruction count
* optimizing compiler makes assumptions, work with it, avoid invalidating assumptions
## ⚙ Optimizing Compilers (Simplified) ⚙
* JavaScript is extremely flexible. Most code doesn't use that flexibility.
* Optimizing compilers make assumptions on how you use JS and cut corners with optimized assembly code.
* Deoptimization is when you invalidate the assumptions it made, forcing it to discard the assembly version.
Note:
* Parse and analyze your code as it is executed making optimized assembly based on what it encounters with guards against cases it skipped.
* If you deoptimize a function too many times it will give up on optimization.
## 🚂 V8 Compiler Output 🚂
* Lots of tips out there on how to optimize your code for V8.
* Nothing beats V8 telling you what it didn't like.
* My favorite tool: [IRHydra2](http://mrale.ph/irhydra/2/)
* Follow the instructions on that page
* Load them into IRHydra2 (it's a web app)
## 🌡 IRHydra Results 🌡
(Load the two files in assets/irhydra into IRHydra to explore)
![IRHydra Result preview](img/median-hydra.png)
## 🔎 Analysis 🔎
* the `avg` function had an eager deoptimization
* ... but it was the inlined Buffer constructor in node core ...
* (So I filed a [PR against node](https://github.com/nodejs/node/pull/4158), released today in 5.2.0)
* otherwise nothing too interesting ¯\\\_(ツ)_/¯
## 📝 Code Analysis 📝
`avg()` calls `medianPixel()`
The algorithm to calculate `median` requires a sort
Note:
We don't see medianPixel in the flamegraph or v8-profiler output because it was inlined.
## ⚖ Some Math ⚖
Our image is 800 x 450 pixels with 51 frames
800 * 450 = 360000 pixels per frame
So 360000 51 pixel `sortPixels` calls. 😨
## 📊 Median Frame 📊
![Median Frame](img/median_frame.gif)
## 🔔 Mean 🔔
Calculating `mean` doesn't require a sort. Maybe it will work?
## 🔔 Mean Frame 🔔
![Mean Frame](img/mean_frame.gif)
## ⌛ After 2 ⌛
![Mean Frame](img/mean_bling.gif)
## 📐 6. Measure again 📐
$ \time node harness.js ~/Downloads/bling2.gif
3.40user 0.04system 0:03.43elapsed 100%CPU (0avgtext+0avgdata 182048maxresident)k
0inputs+704outputs (0major+44011minor)pagefaults 0swaps
8.67 seconds to 3.40 seconds!
## 🔥 Perf Flamegraph (Mean) 🔥
[![flamegraph from perf tool using mean](img/mean-flame.svg)](img/mean-flame.svg)
## 📉 v8-profiler results 📉
![v8-profiler results using mean](img/mean-cdt.png)
## 🔁 7. Go back to step 0 🔁
* Fix `copy`: remove call to it?
* Reduce instruction count
* Even better algorithms?
* Optimize `inxsearch` function?
## 📠 Let's remove the call to Buffer::Copy 📠
$ \time node harness.js ~/Downloads/bling2.gif
2.21user 0.06system 0:02.28elapsed 99%CPU (0avgtext+0avgdata 183944maxresident)k
10584inputs+704outputs (0major+43860minor)pagefaults 0swaps
Another second saved!
## 💖 Check out NSolid! 💖
Capture flamegraphs of a production process with the click of a button!
Bryce B. Baril - [http://brycebaril.com](http://brycebaril.com)
Twitter: [@brycebaril](http://twitter.com/brycebaril)
NSolid: [Try NSolid!](http://downloads.nodesource.com)