Node.js Performance Optimization Case Study

# Node.js Performance Optimization ❔🔍🔬📐📍👓📐🔁 Bryce B. Baril -- [@brycebaril](https://twitter.com/brycebaril) Note: Just tweeted the slide deck location

## 🚀 Performance 🚀 Two primary performance concepts: * Task Completion (User-experience) * Throughput (Scale)

## 🏗 Task Completion 🏗 "Work is done in a timely manner." Often includes (and is dominated by) Asynchronous IO in Node.js `task_end - task_start`

## 🏭 Throughput 🏭 "Work isn't blocking other work." Blocking work is usually Javascript execution or Garbage Collection in Node.js `requests-per-second`

## 🏭 Throughput 🏭 JavaScript execution is single-threaded and on-CPU tasks will be serialized. **IMPORTANT**: This can cascade into significant task completion delays. Note: A,B,C scheduled at the same time, each will take 10ms A takes 10ms B takes 10ms plus A's 10ms so 20ms C takes 10ms plus A and B, so 30ms

## 🖼 Glitching .gif Images 🖼 ### Case Study Collaborative art project that did user-driven image glitching and manipulation. Core manipulation logic is in [glitcher](http://npm.im/glitcher).

📝 Workflow 📝

❔ Is it fast enough?
🔍 Identify the nature of the problem. (🏗 vs 🏭)
🔬 Select tools based on the problem.
📐 Measure.
📍 Identify the location of the problem.
👓 Make the slower parts faster.
📐 Measure again.
🔁 Go back to step 0.

## 🚨 WARNING 🚨 Do **NOT** skip straight to step 5. Randomly applied V8 compiler tricks are unlikely to have any significant impact. Note: Saving 20ms on a task that includes a 5 second database query won't provide a significant improvement.

## ❔ 0. Is it fast enough? ❔ The logs show some images take well over a second with the manipulation `replaceBackground`.

## 🔍 1. Identify the nature of the problem 🔍 Based on logs and the code, the bulk of the time is the image manipulation. It's not doing any network calls or other asynchronous work. Note: Feels like a cop-out... how to do this with tooling?

## 🙋 What if it was async? 🙋 * 🛠 Tools * Logging * APM (NewRelic, Dynatrace, etc.) * 💪 Solutions * Caching * Query optimization * etc. (out of scope for this presentation) Note: These are usually infrequent and slow enough you can always log. This is essentially what APM vendors do for you automatically. Caching: recurring theme--don't do things if you don't have to

## 💻 Our test harness 💻 ```js var THRESHOLD = 40 var image = /* read file, call readimage */ function fillBlack(frame) { return frame.fill(0) } replaceBackground(image.frames, fillBlack, THRESHOLD) /* call writegif on image, write to file */ ``` Note: replaceBackground calculates the median frame by comparing every pixel in every frame, then for each frame calculates the difference from that background frame

## ⏳ Before ⏳ ![Hotline Bling](img/bling2.gif)

## ⌛ After ⌛ ![Hotline Bling Manipulated](img/median_bling.gif)

## 🔬 2. Select tools 🔬 Most tools fall into one of three categories: * How slow is it? * Where is the slowness at? * Why is it slow?

## 🛠 How slow is it? 🛠 * `/usr/bin/time` * Benchmark tools (ab, siege, etc.)

## 🛠 Where is the slowness at? 🛠 * Kernel tools (perf, dtrace, etc.) * V8 Instrumentation (v8-profiler, [NSolid](http://downloads.nodesource.com))

## 🛠 Why is it slow? 🛠 If the problem is slow JavaScript execution: * Compiler tracing (IRHydra, --prof, --trace-deopt) * Code review

## 📐 3. Measure 📐 With our selected poor-performing image, we'll use `/usr/bin/time` to measure a baseline: $ \time node harness.js ~/Downloads/bling2.gif 8.67user 0.06system 0:08.71elapsed 100%CPU (0avgtext+0avgdata 181988maxresident)k 0inputs+752outputs (0major+41833minor)pagefaults 0swaps

## 🛠 Using `perf` 🛠 * The `perf` tool on Linux is a kernel-level CPU profiling tool * Captures full stack, C++ & JS execution * Followed [these instructions](https://gist.github.com/trevnorris/9616784) (not up to date!) * Or check out the new tool [0x](http://npm.im/0x) Note: 0x was not an option when this slide deck was created, haven't tried it yet.

## 🔥 How to read a flamegraph 🔥 * X axis is % of total time * Y axis is stack depth * Look for: * plateaus * fat pyramids

## 🔥 Perf Flamegraph 🔥 ![flamegraph from perf tool](img/median-flame.svg)]

## 🛠 v8-profiler 🛠 * See [v8-profiler README](http://npm.im/v8-profiler) * Uses V8-provided instrumentation hooks * Profiles JS only * View by loading into Chrome Dev Tools

## 📈 v8-profiler results 📈 (open assets/median.cpuprofile in Chrome DevTools) ![v8-profiler results](img/median-cdt.png) Note: We can see garbage collection in V8-Profiler output, can eliminate that concern

## 🙋 What if it was garbage collection? 🙋 * 🛠 Tools * heap snapshots * post-mortem analysis (mdb, lldb, etc.) * 💪 Solutions * fix errant closures * hold less data/parts of objects vs whole objects * code/static analysis * etc. (out of scope for this presentation)

## 📍 4. Identify the location 📍

## 💢 Focus on avg() 💢 The `avg()` function averages every frame to make a simulated background frame. Both `perf` and `v8-profiler` indicate we're spending the bulk of the time there.

## 👓 5. Make the slower parts faster 👓

## 💩 Reasons for Poor Performance 💩 * Wrong tool for the job * Doing unnecessary things * Poor algorithm choice * Not cooperating with the runtime/compiler Note: * Node isn't great at everything: ssl termination, use nginx * don't create functions in loops, avoid Promises * decrease instruction count * optimizing compiler makes assumptions, work with it, avoid invalidating assumptions

## 🆕 COOL NEW THINGS! 🆕 * 🆕 Transpile to ES7! * 🆕 ES6! * 🆕 ES5! (wait, what?) These things can absolutely make your code easier _**for you**_ to work with.

## 😭 However... 😭 Generally these: * Add additional code that will add to execution time * Are not yet optimizable by V8

## ⏪ Welcome to ES3! ⏪ ![Romancing The Jit](img/romancing_shoe.gif) Note: Even nice ES5 features such as Array.forEach Array.map, etc. are slower

## 😅 Oh right, focus on `avg()` 😅 Focus your effort on high return-on-investment. Don't sacrifice dev convenience by refactoring everything to speed up _fast enough_ code. Note: Some things could cause systemic slowdowns, hard to see if every function got slower.

## ⚙ Optimizing Compilers (Simplified) ⚙ * JavaScript is extremely flexible. * Most code doesn't use that flexibility. * Observations -> Assumptions -> Optimize to pure assembly * Assumption invalid -> Deoptimize -> Discard assembly Note: * Parse and analyze your code as it is executed making optimized assembly based on what it encounters with guards against cases it skipped. * "trust but verify" * If you deoptimize a function too many times it will give up on optimization.

## ⚙ Optimizing Compilers ⚙ The optimizing compiler doesn't care if your code is *GOOD* * Code that does things the wrong/unoptimal way can be "optimized" * Code that does the wrong thing can be "optimized" Note: Even if it's highly optimized assembly code, the optimizing compiler can't save you from doing stupid things. Unncecessary work always takes time

## 🚂 V8 Compiler Output 🚂 * Lots of tips out there on how to optimize your code for V8. * Nothing beats V8 telling you what it didn't like. * My favorite tool: [IRHydra2](http://mrale.ph/irhydra/2/) * Follow the instructions on that page * Load results into IRHydra2 (it's a web app)

## 🌡 IRHydra Results 🌡 (Load the two files in assets/irhydra into IRHydra to explore) ![IRHydra Result preview](img/median-hydra.png)

## 🔎 Analysis 🔎 * the `avg` function had an eager deoptimization * ... but it was the inlined Buffer constructor in node core ... * (So I fixed it in Node core...) * otherwise nothing too interesting ¯\\\_(ツ)_/¯

## 📝 Code Analysis 📝 `avg()` calls `medianPixel()` The algorithm to calculate `median` requires a sort Note: We don't see medianPixel in the flamegraph or v8-profiler output because it was inlined.

## ⚖ Some Math ⚖ Our image is 800 x 450 pixels with 51 frames 800 * 450 = 360000 pixels per frame So 360_000 `sortPixels` calls, each sorting 51 pixels. 😨

## 📊 Median Frame 📊 ![Median Frame](img/median_frame.gif)

## 🔔 Mean 🔔 Calculating `mean` doesn't require a sort. Maybe it will work?

## 🔔 Mean Frame 🔔 ![Mean Frame](img/mean_frame.gif)

## ⌛ After 2 ⌛ ![Mean Frame](img/mean_bling.gif)

## 📐 6. Measure again 📐 $ \time node harness.js ~/Downloads/bling2.gif 3.40user 0.04system 0:03.43elapsed 100%CPU (0avgtext+0avgdata 182048maxresident)k 0inputs+704outputs (0major+44011minor)pagefaults 0swaps 8.67 seconds to 3.40 seconds!

## 🔥 Perf Flamegraph (Mean) 🔥 ![flamegraph from perf tool using mean](img/mean-flame.svg)]

## 📉 v8-profiler results 📉 ![v8-profiler results using mean](img/mean-cdt.png)

## 🔁 7. Go back to step 0 🔁 * Fix `copy`: remove call to it? * Reduce instruction count * Even better algorithms? * Optimize `inxsearch` function?

## 📠 Let's remove the call to Buffer::Copy 📠 $ \time node harness.js ~/Downloads/bling2.gif 2.21user 0.06system 0:02.28elapsed 99%CPU (0avgtext+0avgdata 183944maxresident)k 10584inputs+704outputs (0major+43860minor)pagefaults 0swaps Another second saved!

## 💖 Check out NSolid! 💖 Capture flamegraphs of a production process with the click of a button! Bryce B. Baril - [http://brycebaril.com](http://brycebaril.com) Twitter: [@brycebaril](http://twitter.com/brycebaril) NSolid: [Try NSolid!](http://downloads.nodesource.com)