Table partitioning to enhance queries performance

By Luay Al Assadi, Software Engineer.

Note: All the numbers mentioned in this blog post are not real and are used for illustrative purposes only.

At Toters Delivery it's very important to ensure that orders made by our customers are delivered by our delivery staff (called shoppers) in a timely manner. In order to continuously optimize our delivery operations, we store many details of shoppers' order fulfillment activities, including details about the reliability of their internet connection. This has been very important lately due to the unreliability of the internet in some of our major countries of operation.

Clearly this is a lot of data, and storing such vast amounts of data into our database is bound to affect query time. Enter: database partitioning. 

Database Partitioning

Database partitioning refers to splitting the data into multiple tables and databases.It is generally done in a transparent way and the application will always deal with one logical table. In our case we will partition the shopper logs table in order to decrease the data volume that the DB engine scans each time we run a query, and that will help us in running commands in parallel and increasing performance. This is because using a sequential scan of a partition (small table) is better than using an index on a big table which would require random-access reads scattered across the whole table.

As we are using a PostgreSQL, we looked at the partitioning forms are offered by PostgreSQl and it offers the following:

Range Partitioning

The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions, and this approach is effective when we need to access the data by date, integer ranges.

List Partitioning

The table is partitioned by explicitly listing which key value(s) appear in each partition.

Hash Partitioning

The table is partitioned by specifying a modulus and a remainder for each partition. Each partition will hold the rows for which the hash value of the partition key divided by the specified modulus will produce the specified remainder.

Since most of our current SQL statements for accessing shopper logs table searches within time ranges, we decided to partition our table data by time intervals based on the date column using range partitioning. The following section will explain the analysis that helped us decide the time ranges for partitioning the table. The purpose was to save the execution plan time. Note: bad partitioning is worse than no partitioning.

Capacity Estimation and Constraints

Let’s assume that we have 1500 daily active shoppers and on average each shopper works 10 hours to accept delivering orders. This means we have 1500 X 10 X 60 X 2 = 1.8 million records per day.

Storage Estimation: Let’s assume that on average a record is 10 bytes, so to store all the logs for one day we would need 17MB of storage.
1.8 million records * 10 bytes => 17 MB/day
Then we need for 1 year:
17 * 365 => 6.3 GB/year

Let’s assume for the sake of argument that we grow at a rate of 150% shoppers a year. So if we want to provision enough database capacity for that for the next t years, we run our current numbers against the exponential growth formula:

x(t) = x₀ * (1 + r/100)ᵗ

we plug the numbers:

6.3 GB * (1+150/100)⁵ = 615 GB in 5 years
6.3 GB * (1+150/100)¹⁰ = 58.6 TB in 10 years

This means that we will be storing a massive amount of data in the next 5 years!

Shopper Logs Table Partitioning 

Since Range partitioning form is the best choice to partition a table that is frequently scanned by a range predicate on a specific column such as date, we will partition this table schema into smaller tables based on the log record date. We will create a partition for every quarter, and the partition name will include the year and quarter it is made for like so:
shopper_logs_partitioned_child_{year}_{q}

And with this partitioning mechanism we will have about 200 partitions after 50 years, and regarding to PostgreSQL documentation this will not affect the query planning time

‘’Partitioning using these techniques will work well with up to perhaps a hundred partitions; don't try to use many thousands of partitions.’’

Continuing with our example, each partition size is 1.5 GB on average in the first year and will reach 92 GB after 5 years.

Shopper Logs Partitioning Performance

For this exercise we benchmarked the query performance of our existing table vs the partitioned table and got the following results:

We can clearly see the efficiency of using partitioning in our case [logging shopper locations]. Although the planning time is always more expensive in a partitioned table, we can see that the execution time is always less expensive. This happens because the DB engine takes longer in a partitioned table to decide which partitions should be used in a query. That said, the overall time spent for planning and executing queries are clearly much faster on a partitioned table compared to a non-partitioned one - a difference that grows even bigger as the size of the data increases.

That’s it! I hope you found this post useful. One of my own takeaways from this exercise is that while it’s fashionable for people to always jump at the latest technologies introduced by the infrastructure vendors out there (ie Dynamo DB or Aurora serverless by AWS, services which in some cases turn out to be sexy sounding but not exactly production ready) there is still a lot of tricks present in the legacy and less sexy technologies such as Postgresql!

How small steps can help streamline your CI/CD pipeline

At Toters we pride ourselves in being one of the few companies in Lebanon capable of deploying live code that impacts the user experience of our customers, partners and internal staff on a daily basis. This is commonly known in the industry as a CI/CD pipeline. Achieving this CI/CD pipeline was a result of continuous small improvements in our processes born out of discussions from various team members (product, backend, QA, etc.). This speedy delivery is especially important to us considering that we are an operational business that works in markets that is undergoing many disruptions and challenges.

This post will be one of a series of posts, each discussing one simple change we made to help evolve our CI/CD pipeline and product development lifecycle. 

Protecting our staging environment using dynamic envs

Historically our engineers would do their development on their local environments, and then send Pull Requests (PRs) to their peers for review. Once the PRs have been approved, they get merged into the dev branch. The dev branch had no corresponding testing environment. We agreed as a team to freeze all PR reviews by 4:00pm on a daily basis, at which point our “on call dude” (our backend on rotation member who is responsible for helping with the deploys) merges the dev branch unto staging. Doing so triggers a Bitbucket pipeline which runs an AWS code deploy and publishes the latest code to our staging environment. 

Once all the work is on staging (keep in mind that this was typically one day worth of work only, since typically we deployed every single day), the QA team started doing functional testing on each of the tickets published. Once all the tickets were approved, one QA did a happy path test then marked the version clear for a prod release. We kept track of what will go on the release using Jira’s versioning system. Which QA member was working on what was done by assigning the same ticket to a QA member, and changing the status of the ticket from User Acceptance Testing Queue (UAT Queue) to User Acceptance Testing (UAT) (The Queue idea is a Kanban concept, vote up in the comments if you would like us to post about it):

Problems

This worked well for a while. However at one point we noticed that our deploy frequency started slowing down, we also noticed that the number of tickets on the supposedly daily releases was increasing. According to the Kanban principles, any increase in work in progress (i.e. in the todo list) increases lead time (i.e. the time it takes from the start of a project to its delivery). Increase in lead time in turn is proven to increase defect rates, bugs in production and so on. 

Historically we’ve always had only two environments to work with: staging and production (not to mention the localhost that engineers work on). At one point we created another environment called next gen, and used this environment to rehearse a major infrastructural change we were doing. After that the product managers started taking turns in using that environment to test major initiatives such as the loyalty feature and so on. At some point, the tech team grew and got divided into squads, where each squad became an autonomous subteam having its own product manager, QA, backend/frontend/mobile dev, a designer and so on. This increased the demand for that next gen environment, and so sometimes the teams used both next gen and staging for testing. 

The last part is the problematic part. By putting a new untested and large feature on staging, it was grouped along the other smaller features such as bug fixes and minor adjustments. When that happened, all the minor bugs had to wait until the major feature was tested and released, which could take days at a time. This explained the delay that was happening to our release.

Solution

We soon came up with a solution: the devops team created an AWS cloud formation script that created an entire environment from scratch and deployed it whenever a developer created a branch name with a /feature prefix. Bitbucket pipelines would then call an AWS cloud formation cli command that created an environment for that feature. So for example an engineer would create a branch that looks like this feature/loyalty

And that would create a  complete backend environment with testing data and so on for that feature specifically. What happened after that is that once a product team starts working on a big project, they simply create their own environment and do all their testing on it. Several product teams could create several environments all at once without any blockage or queues. Similar scripts were later on to create environments for the various apps and services we offer (ie the mobile team created a CI/CD pipeline as well to rapidly release mobile apps specific to an environment and publishing such builds to various business stakeholders to get early feedback from them. 

All of this led to keeping the staging nice and clean. Only small features made its way to it (those that didn’t merit creating its own environment) and the major features that have already been thoroughly tested and went through many iterations in the custom environment. At that point the QA would do functional tests on the small tickets and a happy path on everything that will be released that day and that’s it. 

Result

The team went back to releasing everyday! The staging environment was no longer clogged by big features. This reduced our work in progress items and in turn reduced our lead times, resulting in less defect rates and increasing the rate of which the product and teams can be responsive to the needs and demands of the business.

Abdullah Bakhach l CTO