In Technical Debt? Here’s how We Started Paying it Off
A startup story on introducing CI/CD Pipelines and development processes
You often read about how scaling should be your startup’s last concern before reaching a product-market fit. While I truly stand behind that statement, I feel that you do not often read about what to actually (technically) do when that product-market fit has been reached. This post addresses that and shares our story.
About a year ago, I joined a B2B SaaS startup company as one of the first employees. As with any SaaS startup, the first order of business is to find a product-market fit and build the MVP. During this phase, it is essential to iterate with potential customers and quickly deliver in order to survive. The thing with quickly built MVP:s is that it is an accumulation of technical debt that does not scale. And that is fine. Trying to anticipate future needs and build a scalable product right off the bat without paying customers will just quickly drain your valuable resources and might leave your company shattered.
At the time of my joining, the product-market fit phase had been completed, an MVP was in place and the company had a handful of paying customers. This was all great, however, there was one slight problem… Since our product is a SaaS application, if the technology cannot keep up it simply isn’t possible to grow the business in a scalable way. We would get a feature request from a customer, implement a solution, do a few manual integration tests and make a manual deployment. The problem was that within the next few days we would get a report that the feature implemented last week was now broken.
Here are some (very normal for a small startup) problems that we faced:
- No (almost) unit-tests and some were even failing
- Insufficient documentation
- No automatic integration tests
- The codebase was filled with binary files, making the repository about ~10GB in size.
- Manual deployments with an average time of about one hour to do.
All companies’ technical debt looks different depending on the application and various choices along the way. At the company I work for, iMatrics, we build tailor-made NLP auto-tagging solutions for the media industry. The tailormade bit is the tricky part as it makes the product very complex. Firstly, for each customer, we generate required models trained on the customer’s own data. Secondly, we have customers with different languages and due to the nature of natural languages, these require different resources. This means that each customer has unique resources that create unique builds. Thus, each customer build can be seen as an independent application.
Due to the product complexity, it had become difficult to maintain the different builds for each customer. Due to the difficulty, the task was consuming large amounts of time that could have been spent on developing the service. The time had come for when it was time to scale up the technology and start our journey on paying back some of the technical debt.
Continuous Integration and Continous Deployment
Being a startup we need to be flexible, listen to our customers, and react to issues fast. However, the manual deployments we had to perform did not only take valuable time but also made it a pain to release updates. We realized that we need to regularly integrate code changes and ship them faster without the hassle or the feeling that it was a ton of work. Enter CI/CD Pipelines.
A CI/CD pipeline is effectively a bunch of steps that execute sequentially (or partly in parallel). A typical pipeline usually consists of the steps build, test, and deploy. However, you are free to insert other steps and expand the pipeline as you see fit.
Arguably, implementing a CI/CD pipeline is the most important step one can take to scale the technology. Not only was this our feeling, but it can also be backed up by data since according to GitLab 2020 DevSecOps survey, 83% of developers testify to releasing code faster and more often with CI/CD.
For us, we dreamed about releasing well-tested builds and reducing the deployment time from 60 minutes to 1, thus making deployment a satisfying task.
Reducing repository size
We started the journey by identifying that the repository size was abnormally large, mainly due to many checked-in large binary files. One may think that storage is cheap, so what is really the harm? The answer is simple, the bigger the repository, the longer time it will take to perform git commands such as fetch, pull, and clone. For us, our ~10GB repository often took well over an hour to just clone. This could of course be reduced quite a bit by utilizing shallow clones, but still. The main issue for this was not for the developers working with the codebase, since cloning the repository is more like a single event rather than an everyday task. The issue arises when you add a CI pipeline to this since then a runner will need to make a clone (although shallow) regularly, possibly multiple times a day.
One of the most prominent purposes of a CI pipeline is to provide fast feedback to the developer. Making a developer having to wait an hour just for the repository to clone will make the developer try to circumvent the pipeline, possibly by pushing less often and thus effectively removing one of the largest pros for the CI pipeline. Therefore, the CI pipeline execution time must be held to a minimum, and the repository size was blocking us from this.
So we prioritized reducing the repository size. We dedicated almost a whole sprint (about 3 weeks) to this, a costly investment it might seem, but well worth it as it turned out. Unused binary files were deleted and binary files in use were migrated to an appropriate database (in our case mostly ElasticSearch) or file storage (S3). We also removed legacy code and deleted unused dependencies. The effects of moving resources to databases instead of using binary files meant that we could now update the resources live without the need to redeploy the application.
So besides us getting a smaller repository, this investment also resulted in a whole new type of agility in the service since customers can affect their tagging result and see its effects immediately. Our codebase is now ~10MB and only takes a few seconds to clone, making it a breeze for any CI pipe to use.
Setting the CI baseline
With the repository size handled, it was time to address the next issue; the CI baseline. We defined the CI baseline as the most required functionality needed right away. We identified this as an important middle step since introducing all wanted CI/CD functionality all at once can be too big of a commitment, especially for a small startup.
Firstly, we needed to decide on some sort of framework to use for our CI/CD pipelines. We considered many different options, weighing the pros and cons with respect to documentation, stability, overall-feeling, and of course, price. Ultimately we decided to go with Gitlab because we felt that it was the best option for us although there were many viable and good options out there.
We started by rewriting or simply remove unit-tests that were failing. If a test was too much work to fix into working, we simply deleted it. The philosophy was that we needed that baseline to use as a stepping stone. We also added some simple integration tests to verify that the endpoints were working with the most simple output. So the pipeline now consisted of two steps; unit-tests and integration tests. The two steps resulted in a whopping (😉) 10% code coverage. But hey, it was a start on the right path.
If your startup was anything like us, the Git history (best case) would look something like this:
See the issue? A bunch of commits with a brief explanation at best. We identified that consulting git blame (either from a terminal or inside the IDE) often did not give any insight as to why a specific piece of code was written the way it was and could not easily be mapped to any business requirements.
Therefore, a simple but effective step to improve this was to require the usage of feature branches and block direct commits onto the development branch. The only way to get code into the development branch then is to go through Gitlab’s merge request feature. By using this, we could easily set requirements that we felt would help the process. We introduced the following mandatory requirements to merge into development:
- No commits can be done directly onto the default branch (development) but must be merged as a merge request.
- All merge requests must pass the unit and integration tests.
- All merge requests must be approved by another team member.
- All feature branches must be squashed and the resulting commit message must reference a JIRA ticket.
We also introduced visual warnings that the approver then must ignore — a tough call but doable if the merge request is critical and time-sensitive for the following:
- No changes to tests.
- Left TODO:s in changed files.
- Increased bug count as defined by a third-party library (spotbugs).
These simple requirements might not be sufficient for every team, but for us, it has provided real value with very little effort. We might need to revise this process in the future and we are happy to do so when needed. Although, too much overhead is not good either and will just slow down the progress and waste resources.
Extending the CI Pipeline
After having used the baseline CI pipeline along with the new development process we identified that large parts of our QA process could also be run through the pipeline. By doing this, we would free up a ton of time to focus on other tasks.
Because of our systems nature where we tag news articles, there is not always a correct answer which makes certain aspects hard to test. Therefore, in our QA process, we have developed a few tools to help us. For example, we have a tool labeled Tagging version difference report, which effectively takes the deployed version, the merge requests version, and tags a set of news articles against it. The tool then compares the difference. Obviously, Gitlab nor any other service alike natively supports its results as a widget in the Merge Request. We then found a great tool, Danger which runs as a job in the pipeline and then generates a markdown comment on the merge request. Of course, you can also save generated artifacts in case the markdown can not in a good way hold all information. Utilizing that, we can now integrate any QA tool we can think of into the pipeline and thus coming closer and closer to our mission of releasing well-tested code often.
Seeing this type of information already in the merge request is great because we can more confidently merge into the development branch without the fear of having to return to deal with any emergencies.
We are not nearly done with extending the CI Pipeline, it must be seen as an iterative process in which one continuously adds valuable features.
I think most would agree that premature scaling is a big startup killer. However, at a certain stage, adding an adequate development process to grow with (NOT into) is only healthy:
It’s easy to dismiss an increasing bug count as “the project is getting more complex” or “we’re moving fast and breaking things” but the truth of the matter is that if you have the option of “moving fast and not breaking things”, then “moving fast and breaking things” is irresponsible.
For us, adding a CI/CD pipeline has resulted in a less error-prone codebase, quicker deployments (non-deployed code does not make any money), and frankly a better working environment where we don’t have to worry about critical errors as much.
I hope that this article has provided some insights into technical steps your team can implement for when that startup takes off.