A Strongly Bounded Unit of Work

All systems have limits. Some limits are hard design decisions, some are soft caps set by stress testings, and some are unknown to even a systems creators. Inevitably, users locate the third type of limit and will push any system until it breaks. In light of this, I find myself encountering situations in which pagination needs to be introduced after a system is already live in production. Some use cases are easy to paginate, reading a collection of existing records, writing a set of new records, etc. These use cases have simple limits related to the amount of data in the request or response body. Pagination becomes much harder when a use case actually performs some actions on a dataset of unknown size. Selecting data through a complex relationship, or deleting records when their parents are removed can present a challenge when every operation needs to fit in a 30 seconds web request.

To help guide discussions around pagination in complex situations, I have found it helps to focus on a strongly bounded unit of work. This can encompass any type of work from creating, reading, updating, or deleting records. What matters is to commit to an amount of work that can be achieved within the timeout provided. Let’s take a look at a few examples to see how this works in practice.

Deleting a Collection of Records

Any collection of records which requires pagination to read will also likely take more than a single request to fully delete. There are many ways to overcome this, but the simplest is to simply delete as many records as you can in 30 seconds. After you’ve hit your 30 second timeout, return a next page token to be included in the next request. This bounds the amount of work being done in any single request ensuring it can complete in the time allocated. This will generally be slower than attempting to delete all records in one call, however it will always succeed and return a response to the user. When working with transactional databases especially, this is essential to creating systems which work at any scale.

Copying a Collection of Record

Similar to deleting a large collection of records, copying can be an expensive operation. I recently encountered a team attempting to copy folders containing a number of different asset types ranging in size from kilobytes to gigabytes. Paginating such a system so it works at any scale can seem daunting when you know so little about both the size and number of records which will be copied. Never the less, using a strongly bounded unit of work the team was able to build an endpoint which operated at essentially any scale.

The key was to dynamically calculate the page size before beginning each copy operation. A single request could only copy about 100MB worth of data making it challenging to know if even a single item would be copied per request. Using a compound next page token, the team was able to handle paginating a copy of many small records along with copies of large records in a single endpoint. When many small records were being copied, the next page token indicated the id of the record in progress. When a single large record was being copied, the next page token indicated the record’s id along with the byte offset to begin the copy from.

Searching Multiple Datastores

Taking the example of compound next page tokens to the extreme, I have designed systems which use next page tokens of unknown structure when paginating between dependent systems. The requirement was to scan an RDS cluster, a set of DynamoDB tables, and query a downstream service to locate potential matches for a query parameter. Why the data was spread across so many different systems is a story for another day, but to create a unified interface for this data, the use case required all three be scanned. Using a compound next page token and a unique query pagination size for each data store allowed the endpoint to paginate correctly without users being exposed to previous questionable design decisions.

In this case, the next page token stored the datastore name followed by a colon followed by the next page token for each of the respective dependencies to allow different code paths to trigger based on the current request. Since each query was limited to select only a finite number of records, the endpoint was guaranteed to never timeout regardless of how large the data stores being scanned actually got.

Next Page Tokens and Internal State

As you may have noticed, this blog post could equally have been entitled The Power of Compound Next Page Tokens. Indeed, the next page token is essential to allowing a strongly bounded unit of work to be executed without any concern to what work is actually being performed. In general, the next page token should aim to store the current state of whatever operation is being paginated. This could range from IDs being scanned to hosts being contacted. If the size of the state being stored is too large for a reasonable next page token, consider storing the state in a short lived database like Redis. The types of pagination which can be achieved through a compound next page token are nearly limitless.

The Pitfall

While an effective brute force strategy for paginating any operation, paginating on a strongly bounded unit of work comes with a critical drawback. Complex next page tokens are slow. Worse, the client generally cannot parallelize calls when processing. This makes the complex compound next page token a last resort in cases where other pagination strategies have failed or latency is not a limiting factor.

DynamoDB offers an interesting solution by allowing users to scan using a segment ID in addition to the next page token. Since DynamoDB operates as a collection of fairly small nodes, each segment is able to scan a different node. This approach may not be possible for all applications, but the concept of allowing callers to provide both a partition ID and a next page token can overcome some latency issues using parallelization.

Another alternative to consider is the use of async processing. Using a queue and status endpoint, clients can enqueue work to be done and poll until that work is complete. While this doesn’t require pagination or next page tokens, it is still worth considering how each work item will be bounded. Async processing has it’s own challenges, but it’s worth considering if next page tokens are getting too complex or clients are spending too much time paginating.

Conclusion

In a perfect world, all limits would be well understood and operations which might create unbounded work would be prohibited from the initial design. Inevitably, system designers need to extend legacy code with more scalable and reliable endpoints. Using the strongly bounded unit of work can be a guiding light when attempting to break up complex operations into more scalable and paginated requests.