With databases size does matter!
I recently began working on a project though where PostgreSQL was chosen for the database. The project had collected data for a few months and at that point had some tables in the tens of millions of records. Now the basic structure of the application was to allow its users to view this data in a more action driven manner (think Google Analytics Dashboard). This information was collected in order to help the app’s users make complex decisions.
I have worked with MySQL databases my whole career and have worked with technologies including: SQLite, MsSQL, PostgreSQL, Redis, MongoDB and ElasticSearch. When I started working on the project we were encountering ridiculous query times. Though this was partially foreseen as the app ran jobs which performed caching of the queries so web response times were quick. The code was utilizing some rather complex query building, with multiple joins etc. So the caching system was smart, allowing for slow queries to run behind the scenes, however, these queries were so slow that things crashed before caches could be built. As a fan of simplicity I started with reducing the need for various joins with the expectation that this would improve the query times and fix the caching. In order to reduce the joins I had to perform a large update to a table of over 50 million rows. This is where my trauma began.
First I had to start exploring GUIs. I’ve never been the sort of developer who likes working with a database in terminal. Kudos to those who do, but I prefer to see what I’m working with. After numerous downloads of tons of tools I came across TablePlus which actually proved to be an awesome app for working with the database. Once I was able to see what I was working with and actually test the queries I started seeing problems. I started with something simple:
select count(*) from jobs
It was pretty quick and there were a few thousand jobs in the queue, these jobs were Laravel Jobs which were going to update the rows in a table which had over 50 million rows. However, examining the failed jobs I noticed a trend of timeouts. Which led me to testing directly against the database (and yes I had a backup). So I started with a VERY simple query, lets update some records with a value (these examples are mocked).
UPDATE things SET ref_id = 99 WHERE other_ref_id = 666;
things having a dataset of over 50 million rows I expected it to take a while. But with 218,916 rows that matched this query, I gave up on it after it attempted to run for 47 minutes. I then ended up trying to find the total number of records for the
things table that matched the
other_ref_id since I couldn’t understand why the query took so long. Doing this simple query was the equivalent of throwing my computer out the window. Everything ground to a halt, though it eventually gave me the number above 218,916.
select count(*) from things WHERE other_ref_id = 666;
This query bogged down the entire database, and drove the whole system to a halt. But, what’s interesting is that there was no increase on resources by this query. Since while doing all this I was monitoring things on New Relic. So what I understood was that updates take forever, counts break everything and yet there is no increase in resources used for the system. With this information I started reading… and read some more… and then some more…
Here is what I learned about these simple queries that all applications use and how the very architecture of PostgreSQL can cripple your application and leave you with no clues as to why.
This took extensive research to find the issue since it never occurred to me that the aggregate
count method could be slow on an indexed column. PostgreSQL’s documentation has a special note on this though:
“Users accustomed to working with other SQL database management systems might be disappointed by the performance of the count aggregate when it is applied to the entire table, will require effort proportional to the size of the table: PostgreSQL will need to scan either the entire table or the entirety of an index which includes all rows in the table.”
Given that the application being built needed to provide counts as its main function this was a damaging block of information. It meant that responses would be brutally slow as tables began to exceed 50 million rows.
As for the updating of multiple rows, I came across a blog entry from the Uber Engineering team which explained the reasoning for the incredibly slow mass updating. It turned out that there is a writing amplification that occurs with PostgreSQL and when making a change on table with multiple indexes it can become even worse. This is due to the nature of PostgreSQL’s architecture since a change in one field causes the row’s tuple to alter which means all indexes must be updated accordingly. Though as mentioned the application was not reliant on mass updating of rows for its general use, but in its current situation it was necessary to finish a migration that needed to take place.
Converting to MySQL
It became apparent that the architecture of PostgreSQL was crippling the project’s performance. Using AWS’s Database Migration tool the database was converted to MySQL and after confirming that all the indexes were set correctly, I began working on the database updates again. The performance differences were shocking. Running an update query in the MySQL database with a dataset 610,976 rows completed in 12 minutes. So, now rather than something taking over 47 minutes and showing no signs of ending MySQL finished it in about a quarter of the incomplete PostgreSQL query time.So then I started to look at counts, this is where the architecture of MySQL proves its worth. To run the simplest query of:
select count(*) from things
We get a response time of 11 seconds for approximately 69 million rows. When we add another indexed column to the query
select count(*) from things and other_ref_id = 666; It took less than 10 seconds to count the 1.9 million rows.
The application is moving along well with a performance increase I cannot even start to calculate. I have never read so much documentation on database architectures in my career before this, but I must say its paid off. I hope that these snippets of information help you in the future when you’re thinking about what database may suit your application best. If there are tricks I don’t know about help me learn by connecting with me. My next developer article will be about using ElasticSearch with this same database and how the application could start doing real time queries rather than caching the large queries on a regular basis.