Wednesday, 12 April 2017

Facebook Tech Talk Melbourne - Mobile Performance at Scale / Scalable High Quality Animations For Mobile


I was lucky enough to be invited a recent Facebook Tech Talk.

The first topic was Mobile Performance at Scale, and the speaker was Joel Pobar, a Director of Engineering in the client side performance team.

He started by saying there's a correlation between performance and engagement, obviously, the more responsive an app is, the more engaged you will be while using it. Though some analysis, they found that for the Facebook app, the biggest things that impacted user engagement were:


  • cold start (how long it took to open the app from "fresh", i.e. if you hadn't opened it for a while)
  • scroll performance
  • app interactions
  • how you compare against your neighbours (i.e. similar apps to yours)
In order to improve your app's performance, you need to be able to measure your performance. In order to improve your performance, you also need the tools to be able to do so. Joel recommends focusing on telemetry and tools.



Start simple, just run a profiler against your code. That will help you find the easy things that you can fix.

After that, the next thing they did was try to get a sample of performance data from production, so they added some start / stop markers in their code and measured the experience of their various users out in the wild. From this, they were also able to grab device information, such as phone type, OS version.

They found that the worst user experience was in developing countries, where people tended to have phones that were quite old, and so they didn't have the same grunt as the phones they have in the labs, or the phones that their developers are using. Some of their users in India had to wait over 60 seconds for the Facebook app to start up! I know that if I had to wait 60s for something to start up, I'd probably assume it had crashed or something.

Now that they had a better understanding of the users experiencing the greatest number of problems, they could focus on targetting fixes towards them.

Below is a summary of their build process:



As part of the process, they do continuous performance test run - though they have too many commits to be able to run a test per commit, so instead they run one test every ~30 commits, and if that test flags something, they'll break that chunk of 30 commits into diffs and try to isolate the problem to a specific commit. Once they track down the problem commit, they raise an issue with the developer. Joel mentioned that when they do this, they try to put in as much data as they can gather in order to make it easier for the developer to work out what went wrong. The more information they're able to include at this stage, the higher the likelihood of the issue being fixed.



He showed us what the lab looks like for these performance tests. It started out as one iPhone plugged into a Mac Mini, but that soon wasn't enough, so they scaled out to this:


And eventually that wasn't enough, so they scaled out to multiple instances of this:


Each rack has a series of phones plugged in, running the performance tests. It's inside a faraday cage to isolate external issues. There's a camera watching all of the phones, so that if one flags a problem, they are able to zoom in on it to investigate further. There are about 8000 phones in total that are just used for running these tests.

One issue they do have with these automated tests is the problem of noise. An app may run slowly because it has other stuff running in the background competing for CPU time, or the phone itself might be hot, so the phone slows down to try and regulate temperature. There can also be a difference in the chip speed - there has been a measured difference in up to 20% for chips in the same model of iPhones! It can also run slower just because of network latency / load. In order to try and rule these out, each test is run 10 times, and if there's too much variation between tests, that particular group of runs is discarded.

You might have noticed "Employee Dogfooding" listed as one of the techniques. That's where employees must use the latest master version of the app. This can get frustrating at times when the app breaks, but I guess this has the side-effect of causing people to be a bit more vigilant about their commits!

As I mentioned before, they get stats from production. This is what they use in dynamic instrumentation step. They use even tracing, using tools such as Dtrace or ETW. They've also written a program called Loom (which is soon to be open-sourced), which allows them to measure things like:
  • garbage collection
  • disk I/O
  • background info on other tasks
  • page cache misses / hits
While every Facebook app user will have Loom config on their phone as part of the app, it isn't always on, and they tend to activate it for particular demographics of users, e.g. users in India, so for everyone else, it'll just be inactive.

They did find that one of the causes for a cold start taking a long time was a lot of cache misses. So they wrote ReDex (which I think he said has been released now), which you can run to re-order your Android APK so that code that is used together gets packaged together to reduce the number of cache misses.

Someone in the audience asked how he was able to get a team so large to be able to focus on performance. Joel mentioned that the culture at Facebook is very bottom-up, i.e. if someone has an idea and is able to sell it to management, then they will get backing for it. There are countless apps and things written by Facebook developers which improve their quality of life, and a lot of them started just because someone said something like, "It's really annoying to have to do ________, I think I could write something to fix that", and their bosses would be perfectly happy for them to go off and do it, even if it takes months. If it doesn't work out, they might be pulled aside and told, "Hey, that idea you had isn't really working out, we'd like you to go back to working on this other thing instead." But innovation is encouraged.

The other thing he mentioned is that after telling Mark Zuckerberg that performance is linked to engagement, Mark told all the engineers that it was an important facet, and so when the performance team does end up raising an issue, it isn't just shunted off into a corner and ignored, people take notice and try to fix it. Without that kind of buy-in, Joel thinks the team wouldn't be nearly as successful as it is.

------------------------------

The next speaker was Daniel Grech, a software engineer in the Android Messenger team. He spoke about Scalable High Quality Animations For Mobile. Similar to what Joel said, the Facebook team found that app size was correlated to user engagement. So one of the struggles they have in the Messenger team is the battle between the designers who want to create fancy animations, and the engineers who want to keep the app small and responsive. 

Some of the goals they need to achieve when making animations:
  • resizeable / high fidelity (so you don't want to "resize" by simply bundling the same image / animation in different sizes)
  • reusable
  • small file size
  • performance 
They looked at a number of existing technologies:


The best candidate at the time was Custom drawings, however, they were incredibly complex to create, and if you wanted to edit an image, you basically had to start from scratch.

What they ended up doing was creating their own framework called Keyframes (which is available on their public github). It's an extended script for Adobe After Effects (which is a tool commonly used by designers, so they are already familiar with it) which takes the AE animation, strips out all the things they don't need, and exports in JSON. 

An animation as a PNG was roughly 10 times the size of their minified keyframes file. In fact, the file is so small that they don't always bundle it with the app itself, and can send it as part of the network request. This allows them to dynamically change it, which is what they did for Halloween, with their spooky-themed Facebook reactions.

Unfortunately for the designers who use Origami, there are no current plans to build something for that. And the current Keyframes implementation doesn't support all features of AE at the moment, including masking or transparency, but they do hope to build it out in the future.

One of the issues he mentioned was when they tried to use hardware acceleration to try and improve performance. They found that it didn't really work that well, as it resulted in a lot of re-drawing. In addition, while there is an overall spec for Android phones, the implementation details are left up to the manufacturers, which can lead to a lot of variation between phones. They received a lot of bug reports about the crying emoticon missing an eye on some phones due to an attempt to use hardware acceleration.

Note: Their solution isn't perfect, as it doesn't cover the entire range of phones. I can't remember what he said the minimum OS version was, as it requires some of the newer features of the Android library. For older phones, they just display a static image instead.


He also recommended looking at

Also, Joel suggested reading the book Masters of Doom about id software and John Carmack. Apparently John Carmack is an avid developer and commits tons of code every day. Joel asked him how much of the book is true, and he said about 90% of it, and he was even able to remember which chapters he felt were very accurate, and which chapters were not-so-accurate.

No comments: