Lessons From Optimizing the MindTouch Web Platform
Posted in: Monitoring   -   August 23, 2013

Here at MindTouch we’re focused on creating the ideal software help SaaS. Like any service, speed is a major factor in user satisfaction: How quickly are pages loading? Is every action, from searching to rating pages, happening smoothly?

As background, the MindTouch help platform is built on several layers:

  • Backend API: Our RESTful API handles all commands (page updates, attachment storage, searching, template rendering, etc.).
  • Static middle-tier: We have a PHP layer that performs API requests and assembles the HTML of a page based on the customer’s skin (including page contents, headers, footers, etc.)
  • Dynamic front-end: Dynamic user actions (editing, rating, commenting) happen in Javascript, often directly against our API.

We had a few moving parts in the mix. Here’s how we made improvements for the most user gain for the least development pain.

Step 1: Measure

Since we’re API-driven, we instrument every request that a page load initiates.

Now, it’s one thing to have mountains of logfiles to crawl through. We have them, and process them with Splunk, but it’s for relatively ad-hoc queries. It’s not something the average dev wants to get into when debugging performance issues.

The key is making things user-friendly. Here’s what we’ve done:

  • Instrument all API calls on the backend for total time taken, Memcache hits/misses, database requests generated, and so on.
  • Aggregate the performance information in an easy-to-query format. We have a CouchBase app which can query the performance metrics for all API calls in seconds:MindTouch Log Viewer
  • Display the relevant info in an easy-to-digest format. The CouchDB app is excellent for seeing the slowest/most intensive calls across the site. When on a specific page, we have a custom Google Chrome extension that shows a performance summary for the page you’re on:MindTouch_Support_Homepage_-_MindTouch_HelpYou’ve seen sites with HTML comments similar to “rendered in x.xx seconds”. We’ve expanded the in-page info to a full, machine-readable JSON object. Our Chrome plugin converts the object into an easy-to-read summary table, right in the Dev Tools tab.

On the front end, Google PageSpeed is an excellent way to measure your performance.

For all tools, simplicity is key: make it simple and visual to monitor your backend and front-end performance metrics. Nobody wants to trawl through gigabytes of logfiles to see what items could be relevant to a performance issue.

Step 2: Prioritize

With measuring tools in place, we can decide which optimizations to work on first. Our general approach:

  • High-priority Google Page Speed optimizations
  • CPU- and database-intensive API requests
  • Infrequent but high-latency API requests
  • PHP structural changes

The Page Speed optimizations are pretty forward: minimize the number of requests, compress everything you can, and have long-lived cache expirations.

Some changes, like compression, are mostly web server configurations and are easy to deploy. For us, minimizing requests and having unique URLs with long-expiring caches required some refactoring. It was well-worth it though: we now have a single request for CSS and Javascript, at a unique hash-based URL.

On the API side, we constantly examine our CouchDB reporting tool and prioritize the bottlenecks we see. In general, we prioritize by the total time spent in a function, which shows which calls are hammering the server the most, in aggregate. They’re an easy choice for optimization, followed by slow-running (but infrequent) calls which may give an individual user a bad experience.

We’ve found and removed several unnecessary API calls, and improved caching on many more. We’re planning to have our PHP layer check Memcache for commonly-accessed data (such as site settings), reducing internal traffic.

The last, most careful steps, have been around refactoring the middle-tier. Our PHP layer has evolved over 6 years and several product editions; it’s been due for refactoring to minimize unneeded code paths and modernize the patterns used.

Step 3: Optimize

Following the general priorities above, we moved onto optimizing the platform. Note: don’t try everything at once!

Our dev teams run in 2-week sprints. We chunk work items into small tasks (no more than 1-2 days each) and aim to make visible progress on a goal each sprint. So, instead of a complete backend overhaul, we might begin instrumenting certain functions in our API. Then, work on sending the results to CouchDB. And then emit per-request information in the page as JSON. Then, build a Chrome extension to render it.

After building a cadence and rhythm of continuous improvement, we pick off the low-hanging fruit as it arrives. Sure, sometimes a major overhaul is needed (i.e., architecture changes) but we avoid that work until the benefit is clear. Here’s a few changes we’ve made:

  • API: Dramatically reduced the total number of unique API requests, and the time spent in heavily-used functions. Many API requests are now served directly from Memcache without hitting the database.
  • PHP Layer: We performed a major refactoring which made the following changes much easier:
    • Minimized (and cached) dozens of previous API requests
    • Created a single Javascript and CSS file for a user (depending on their role). This URL is unique, based on a hash of the file contents, and has a long-lasting expiration.
    • Ensured compression was enabled for all text resources
    • Optimized the location of non-essential javascript
    • Eliminated unnecessary code paths, such as avoiding page renders on endpoints which only process POST requests
  • Dynamic Frontend: We’re continually moving user actions into Javascript to make the user experience responsive. These JS-driven actions also benefit from the API improvements above.

This iterative process has worked for us. The question is how to keep it going.

From a development side, we’re keeping optimizations surgical, reversible, and in a style that builds momentum. New optimization strategies are always developing; pick the ones you can complete in a 2-week sprint (having historical data on how much work can reasonably be completed in a sprint is very valuable here).

From a QA perspective, performance should be a release requirement. With your API instrumentation, you can verify there are no performance regressions as new features are introduced (monitor against a baseline). On the frontend, Google Page Speed has an API to programmatically measure your site performance. (We have staging servers updated after every build so this is simple to verify.)

Lastly, make your wins public: we use Pingdom and other monitoring tools on our status page. Monitor, Prioritize, Optimize — a few simple steps to a faster site. And note: Paralyze is not one of the steps. The good optimization done today is better than the perfect one that never ships.

Kalid Azad is a web developer for MindTouch, which focuses on providing the best online help platform for clients ranging from HTC to SAP. He blogs at BetterExplained.com, and writes intuition-first explanations of math and programming.

Tags: , , , , , ,

  • http://blog.justindorfman.com jdorfman

    Great post Kalid! The status page looks a little dated. I would checkout statuspage.io They have Pingdom support so it is super easy to port everything over.

    • http://betterexplained.com/ kalid

      Great suggestion, will have to see if we can switch over.

  • http://www.MaxCDN.com/ Chris Ueland / MaxCDN

    Thanks, Kalid!

    • http://betterexplained.com/ kalid

      Thanks Chris!