Survivor: API - Bitmasks Get Voted Off the Island

Published on 12 May 2014

In January, I joined Smartsheet as the new developer for the API, replacing Kyan, who has moved on to help spearhead the new Platform team. It's been a fun first few months learning the ins and outs of the Smartsheet codebase, and working on my first set of API enhancements (see the release notes here).

The Tribes

The interesting thing about working on the API is that it's basically at the intersection of three different teams: the platform team, which owns the roadmap and product management responsibilities for the API and represents our developer community; the mobile client team (Android and iOS), the main internal "customer" for the API; and the core dev team, on whose server-side code the API is built. As a result, I've often found myself in the middle of impassioned (but friendly :) debates involving people across those teams, often with strong opinions and conflicting concerns and perspectives.

I had only been at Smartsheet for a week when I found myself in the middle of my first such debate. The topic was the adding of formatting information to columns, rows, and cells retrieved via the API. Formatting, as in "this cell's text is red, yellow-highlighted, bold, italic, 24 point Verdana" (you know, for when something is REALLY important!). The natural and obvious design for that would be to add a new JSON object to the Column, Row, and Cell objects, which would have a series of attributes representing the different format styles and their values, e.g. { "fontWeight": "bold", "fontSize": 24, "fontFamily": "Arial" (...and on and on...) }.

That was a no go for the mobile devs though. Unlike us server-side developers who have high speed Internet connections and truckloads of memory at our disposal, Android and iOS devs really need to count their bytes—both over the wire (or air, I should say—think slow cell networks), and in memory (the latest iPhone has a whopping 1GB of RAM). Thus, the thought of loading 20,000 heavily formatted cells with that weighty JSON object attached to each one gave them serious heebie jeebies.

The Challenge: Add Formatting to the API

A meeting was held and developers from platform, mobile, and core proposed solutions. It being only my second week on the job, I spent most of the time just listening and trying to keep up. At the end of the meeting, I was tasked with prototyping the solutions under consideration and running benchmarks against them.

There were a number of solutions proposed, of which these were the frontrunners:

  1. JSON object (described above): verbose and heavyweight, but easy to construct and consume, because if you're using the API, you know how to use JSON!
  2. Bitmask: similar to how we internally store format data on the back-end. Would basically look like a long numeric value like "3475892745870102". Each set of bits in the number would correspond to a format style (e.g. first bit might be 0 for normal and 1 for bold, bits 5 through 12 might represent the different foreground colors etc.). Compact, but somewhat painful to parse and assemble as it would require bitwise arithmetic to read and construct the value.
  3. Numeric index list: a list of numeric indexes into predefined lookup tables, with some optimizations to further reduce size (i.e. omitting an index value if it is unchanged from the default value). The format example from earlier in this post would look like this: "3,8,1,1,,,,,27,21,,,,,,", where the first index represents a value in the predefined font family lookup table, the second is a font size lookup table value, third is bold, fourth is italic, etc. Also compact; not as convenient to work with as a plain ol' JSON object, but less painful than a bitmask.

So the question to answer came down to the bandwidth savings between the proposed solutions. If size were no concern, we'd choose #1 because of its ease of use and consistency with the rest of the API. Similarly, #2 appeared to be the most difficult to work with, but we'd go with it if its size savings were significant enough over the other two. Our mobile devs could deal with it, and we'd hopefully hide the ugliness from external developers with our upcoming SDKs. #3 seemed to be a happy medium: compact, yet still somewhat human-readable.

(As an aside, you may be wondering why simply using gzip compression wouldn't solve our problem. The answer is that we were already using gzip, but the mobile clients still have to either keep all those bulky JSON objects in memory and/or parse all of them, neither of which is cheap. Also, as you'll see below, the benchmarks still showed significant savings for the compact formats over gzipped JSON.)

The Contestants Duke It Out

I coded up quick and dirty versions of the options and started collecting my benchmark data using some test sheets with differing amounts of data and formatting that I put together. Here were the results of getting a sheet using a heavily formatted large sheet (25 columns, 2000 rows, every cell formatted, 100 different unique styles):

Format Data Raw Size (in MB) Gzipped Size (in MB)
No formatting included 8.5 1.7
JSON 24.5 2.7
Bitmask 10.2 2.1
Index list 10.7 2.1

As expected, the JSON approach bloated the payload size by a significant amount, nearly tripling the raw size of the unformatted sheet's JSON. But the most interesting thing to note was that the bitmask and index list options resulted in pretty similar payload sizes. This might seem counterintuitive to a developer; a single number being roughly the same size as an array of numbers? But it makes perfect sense when you remember that we're not transferring raw data structures over the wire, we're transferring string representations of those data structures (i.e. JSON). As a string, the number "123456" is the same size as the comma-separated list of smaller numbers "12,3,4".

In my opinion, this was the clincher. I may be hurting my developer cred by admitting this, but I hate doing bitwise math. I dislike working with numbers that I can't read without the help of a binary calculator. Left and right shifting bits and applying XOR's hurt my poor base-10-centric head. I think I'm not alone. I think all things being equal, given the choice between a list of integers that are indexes into lookup tables, or a bitmask with pages of documentation describing what bits 1 through 4 and 5 through 14 represent, etc., most developers (the non-masochists) would choose the former.

I ran numbers on a few more sheets of differing size and formatting, and they provided similar results: bitmask was the most compact, followed closely by index list. I reported the results to the group.

Tribal Council: A Survivor Is Named

It took a long email thread and some further debates about other details of the implementation, but we eventually did settle on the index list option. It provided the best tradeoff between size and ease of use. As a developer convenience, we decided to provide all of the format lookup tables through a new API operation (GET /serverinfo), (a) to save developers the trouble of having to hand code them from documentation, and (b) to ensure that developers always have the most up-to-date tables (not that we intend to be changing them often).

After all that, I finally got down to writing the darned code! It proved smoother sailing than the design phase, and the fruits of my labor went out the door with the recent Smartsheet release. You can find more on formatting and the new server information resource in the API Documentation. I hope you find it useful!

- Kevin