Common Practices or Best Practices?

With the events of the last few days over the SolarWinds breach, I just wanted to give my views about how this massive of a breach could have happened and go for so long unnoticed. I have spent my career in private business; although I did have a short contract, part time for six months, in a government setting once. I have seen all sides of how private businesses run. These experiences are what has led to these conclusions of how a breach this bad could have occurred.

It is best to start with some background of the development lifecycle. There are many common practices in the development lifecycle and there are also some best practices. As odd as it may seem these two are not necessarily the same. It is the leadership of the business that will define when, or if, best practices will used. In general, best practices tend to come with a higher up-front cost, but lower long-term costs. This is not always an easy decision for leadership to make. It should be noted that the old way of development was to manage projects. The new way of development is to manage risk. This is what Barry Boehm talked about in his spiral model of software development which is what has led to today’s agile development methodologies (http://www.cse.msu.edu/~cse435/Homework/HW3/boehm.pdf). The problem is that many businesses want to manage projects, like they always have, but want to appear to be using new methods. In these cases, businesses use terms as buzz words rather than an actually thought process. When risk defines the development process what if questions become tools of improvement, not a stumbling stone to the timeline.

There should be some work done before the implementation is started such system definition and requirements. However, in many businesses I have been around there is little time, if any, allotted to these tasks. This leaves the implementation process at a bad starting point. I have found it is not unusual to start with incomplete, or few, requirements at the start of implementation. It is also common to have some developers that just want to get done fast so they can go to the next thing. All these events tend to lead to same outcome, whatever is easiest wins. Rarely, if at all, do these events lead to what is best for the system or customer.

During implementation automated unit tests should be developed. This is known as Test Driven Development, or TDD. I have been an advocate TDD since the late 1990’s and early 2000’s. Back then it was call Test First Programming, TFP, and was a key part of eXtreme Programming, XP. It took a while to really see the reason behind TDD, but that came to light when I started seeing how little time it took to make changes to a system and have confidence that there will be no side effects to change. The controversy about TDD improving quality will probably always be around since quality does not have a common definition. However, I have found that this improvement in time to fix issues or do enhancements, cannot be denied. This is why when I first found the study out of China testing process improvement, I was so excited. Like many other studies, on TDD, there was no real data showing improvement from the practice of TDD. But there was a side note to this study and that side note reinforced that I was seeing for years. The side note to this study was that by having unit tests the time to fix issues dropped to a fraction of time to fix issues without unit tests (https://www.researchgate.net/publication/221592883_Test_Driven_Development_and_Software_Process_Improvement_in_China). Not only had I seen the timeline improvement, but there was now an empirical study showing the same outcome. Even in light of all of this data I still see development teams doing little or no unit tests. The larger issue is that even within the same organization there can be this inconsistent behavior. For example, my team is very disciplined in doing unit tests, but none of the other teams in our organization are.

As part of source code management there should be a version control system. This is where changes to the code base can be tracked. It also allows only certain users to log in code. The biggest risk here is code being checked in that has issues with it. These types of problems can occur when one developer is working with out-of-date code and then checks in their changes. The process I require of my team is to compare the code being checked in with the code already in version control. There have be many times we have spotted problems before the code is even checked in.

At this point the system needs to be built. Following current DevOps practices there should be a continuous integration, CI, system in place. This is where the system is built and unit tests are run when the code is checked in. In Mythical Man Month by Frederick Brooks, there is an essay about plan to build it twice. Years ago, that was true since development would gather all the requirements up front before implementation even began. This would lead to system that would not meet the user’s requirement due to the time delay between requirements and implementation completion. Brooks thought was do the system once and learn what was wrong when the implementation was complete, then do it a second time to get it right. Needless to say, the cost of development makes this an unrealistic approach. Yet the agile thought process really takes Brooks idea and put it on steroids. In the agile world view the plan is to build it every day. This is the power of CI. It allows the whole development team to see issues with change immediately. Again, at the heart of this success is unit testing. The lack of unit testing does not allow CI to be a benefit, it only allows for a business to look like they are doing a best practice.

Once the CI build is completed, the next step is a full QA verification. When this verification is complete the final step is a deployment build and push to production. If a DevOps process is being used, this push to production will most likely be a Blue/Green environment to allow a final look before flipping the switch for production. Many of the businesses I have been around do not do a Blue/Green DevOps deployment, they just push to production most of the time by a manual process.

The lifecycle overview I have given is just to give a reference point to what has led to my conclusions. I do not believe that the original bad actor(s) was anticipating this type of success. In the Target breach it was discovered that Target was not the original target. It was a heating and air conditioning firm that was the target. Only after exploring around that firm network did the bad actors find the credentials for accessing Target (https://krebsonsecurity.com/2015/09/inside-target-corp-days-after-2013-breach/). I believe that this is what happened in this case. I believe that a phishing campaign was successful on a workstation, or work stations, and after accessed was gained it was discovered to be such a high value target. By the sophistication of the attack, it should be safe to assume it was a group not a single bad actor.

What makes this attack a real concern is that it was a supply chain attack. A supply chain attack has a way of opening many other targets for the bad actors to access. This is due to the inherit trust relationship between a vendor and customer. Is also makes these other targets real victims since these businesses truly rely on their vendor to look out for their best interest as well.

It has been stated that the code that was compromised was a SolarWinds DLL called SolarWinds.Orion.Core.BusinessLayer.dll (https://digitalguardian.com/blog/solarwinds-hacked-used-potentially-massive-supply-chain-attack). A public filing with the SEC shows that the malware infected DLL was being delivered from March to June 2020 (https://d18rn0p25nwr6d.cloudfront.net/CIK-0001739942/6dd04fe2-7d10-4632-89f1-eb8f932f6e94.pdf). When I first heard about the DLL and the supply chain attack, I thought that it had come from a successful attack to again access to the development version control system and implanting malware into the code base. Success for this attack vector would be easy since I have seen very few teams and organizations constantly verify their code base for expected code only. If a bad actor could change code in the version control system it would most likely not be seen by the development. There are a few ways that these changes might be seen. One way would be done by updating third party libraries. There are times these update cause what we call breaking changes. These breaking changes would force refactoring, or rewriting, of code due to breaking changes. During this type of an update changes can force inspection of code that is not commonly looked at. Most development teams will not keep up with updates of these third-party libraries because these updates take time away from other items on the timeline that are higher priority. I have seen third party libraires as far as five years out of date or more. This lack of updating can also allow other security issues to exists in a system. This is the type of updates that many organizations miss in their security posturing because most security-based updates center on operating systems and applications, not their own developed applications.

In the same SEC filing from above, it is stated that the “vulnerability was not evident in the Orion Platform products’ source code but appears to have been inserted during the Orion software build process”. Going by this SEC filing it seems to be a much larger problem having the vulnerability introduced during the build process. By the name of the library (DLL) is seems to be an internal SolarWinds custom library. In DevOps operations all builds are pulled form version control at the time of the build unless the item is a third-party library that there is no code for. For this attack vector to be viable, the build system does not run a build of the code for all releases of the system. This is not a good practice since it relies on the concept of building only items changed. This approach to system builds has never align with most agile methodologies which has been more down the line of CI type build processes. Another way this injection, during the build process, could be successful is if internal code wrapped a third-party library and built them all as a single library. The problem with this is that when the third-party library changes the whole DLL would have to be changed. The whole goal of DLLs is to allow the libraries to be updated without rebuilding system specific code.

One way to reduce code injection would be simply have some sort of hash on every item in the system and have a nightly automated system verify all the items and report any failures with verification. Of course, these hashes would have to be securely stored and updated with new hashes when expected changes are made. The problem of this solution is the development time and of course someone to verify the nightly process. This is not that big of the problem looked at over the long term, but it may cost more up-front to get in place. It seems that this type of verification system was not in places for SolarWinds.

The odd thing here is that the SEC filing states “The vulnerability has only been identified in updates to the OrionPlatform products delivered between March and June 2020”. This information opens another possible process concern. Before the rise of agile methodologies common practices was to have a release schedule. In these circumstances, there would regular releases or updates produced. Some businesses would set a schedule of releasing an update every quarter. So, after a release was developed and published the release would sit on a hard drive and sent out to customers as it was needed. It is not uncommon to never look at this image again after it has been created. Given this timeline it is possible that SolarWinds might still be using this release pattern. If that is the case it would take very little effort for a bad actor to insert a modified library into the release image and just let the vendor send out the malware for the bad actor. Software as a Service, SaaS, system delivery moves from this scheduled release pattern to a more fluid pattern of constant delivery. It seems that more and more systems are being delivered by this constant delivery pattern even if they are not SaaS type systems. NotePad ++ and Visual Studio are both good examples of desktop applications that use a constant delivery pattern.

Taking the SEC filing at face value, the source code injection is a non-issue. However, the evidence indicates that there may be some very common build and release problems. Where development methodologies stand at this point in time, it does not seem to make sense not to be moving towards a DevOps process. Making this type of change will take time, but if the business leadership understands the advantages of DevOps and has committed to it, then the time, money, and resources needed to have a successful DevOps process will be applied and a successful DevOps process will come in time. The basic DevOps process requires the build process to pull code for each release (QA and final production) for every build. The behavior alone will reduce the risk of modified component being released since the images are constantly being built. Although it is not clear if there is a scheduled release pattern being used; if there is one being used it should be moved to constant delivery pattern. Couple a constant delivery pattern with builds that pull code from version control will reduce risk of modified code getting into the build as well.

To be clear I have no idea of what the internal processes are at SolarWinds, I am only looking through the lens of what I have seen before. Most of these types of issues seem to always relate back to the defined priorities. The actions seen by SolarWinds seem to point to practices are that common, but not best practices. I cannot help but think, if some of the common practices were replaced by best practices if SolarWinds could have avoided this whole issue and the impact it will have on SolarWinds’s future.