TUELMaker | Daniel Holland

Approach

I tend to use a flexible-inverted-cone approach to any new project.

This serves three (3) purposes:

It ringfences the project at a high level so that we have a well-defined envelope
It provides retention and elastic points throughout the discovery and feedback phase
It enables a focused discussion to occur and drill into the core value the solution will provide

Once we have defined the funnel then we begin to take on the next phase which is the thought-prototyping phase. However, this does not merely consist of wiring-up some code and pushing it to some resources. The prototyping phase consists of several pragmatic steps to evaluate the feasibility of a solution before any code is written or any physical resources allocated. Many of these steps are merely to provoke thought and leverage not only the understanding of the key stakeholders, but also the knowledge of the team and what facilities are currently available and what may become available in the future. Among my favorite methods are:

Expertise

Rather than focus on specific technologies, I look at what the need is from a contextual standpoint and what facilities are available to satisfy the need. Technology comes and goes, but a good architecture is readily adaptable and affords the customer a timely and consistent experience over time.

There are two primary routes to take when considering a solution:

Existing Solution
New Solution

However, both routes are bound by the environment in which they will be crafted, deployed, and maintained. An existing solution may require a more deft touch to integrate with the current environment, while a new solution may allow for more varied implementation opportunities; it still must adhere to the business and engineering standards and practices. Indeed, the solutions must transcend the tone of the people creating them and reflect needs of the customer and the business first.

Application Insights
Azure
Azure DevOps
Azure Keyvault
Azure SQL
Azure Storage
Azure Service Bus
Azure App Services
Azure Event Hub
Azure Functions

C#
Javascript
Powershell
Power BI
Kusto / KQL
REST
Selenium
Visual Studio
Many other tools/languages

Management

There are several parts to managing a service including: development, delivery, operating, refinement.

Within the context of each element there are numerous processes which need to be implemented and measured to make sure that the service is satisfying its mandate.

I like the CI/CD process here, but I would modify it a bit. For example, there are several areas that CI/CD doesn't take into account, at least not directly; these include:

Feature Absorbtion Rate - releasing changes which cannot be readily absorbed by users can lead to increased friction and frustration on the part of the customer. Steps should be taken to try and alleivate this risk by providing canary environments, opt-in beta features, and training content and feedback mechanisms.
Policy Changes - this can be anything from N-1 version requirements to security policy adjustments to data retention requirements. Often these policy changes are known well beforehand, but there should always be a facility for testing whether the policies have changed to mitigate the risk of rolling out a change that conflicts with a new policy.

Personnel Changes - anytime you have a project with more than 0 people on it there must be a process to check the risk of a personnel change. The CI/CD process relies on people to perform certain actions and absent those people or their knowledge the process can fail. For example, a person might become sick, have a family emergency, the list goes on: there must be a process to evaluate the risk to future CI/CD processes and determine how they should be handled. This isn't the "Plan" part of the process, but rather is evaluated within the flow at each critical stage of CI/CD. In many instances risk is mitigated through automated processes, but these can and do fail so there must be a process to evaluate the risk of failure and how to handle it.
Disaster Recovery - last but not lease there must be some test performed to see whether the service can be recovered if a disaster happens. Regular CI/CD processes generally don't take disaster recovery into account because it is often relegated to a quartly event, or at most weekly. However, a disaster can occur and there may be artifacts or implementations within the CI/CD process that break the disaster recovery process. At the very least there should be a discussion and some plan in place to handle safeguarding the customer's data as that shall not be lost.

Incident severity can vary; however, the goal should be to address all incidents with the same level of care which is highly correlated to the level of automation you have in place and which has gone through robust testing based on FMEAs. Having a process that is well understood and curated is important to provide the customer the best possible experience when something does occur...because it will. At a certain point, solutions must depend on that which is outside of the services control and viability. For example, all service owners must understand the relative importance of their service to that of the other services: just because your service is down doesn't mean that yours is the most important overall...many times it's not and that needs to be understood and communicated with the stakeholders. A mitigation may simply be to put your service in a holding pattern while the other services are brought back online, and understanding that provides a navigation point for resilience.

Below are some observations from the trenches that a service operator needs to be aware of and refine their offering to incorporate these items:

BRS (Big Red Switches) - these enable someone to make service-level changes in a controlled manner and may provide facilities for: disabling external access, disabling internal access, disabling a portion of the service, routing requests to a queue, overriding volume thresholds to increase throughput (use cautiously). There are many more examples, but the point is that you should not rely on a deployment or uncontrolled steps to mitigate some types of incidents.
Forensic Data - when an incident occurs there should be parallel operations to mitigate the issue and also gather forensic information. This will be a hierachical process and will involve data recovery/freezing, forensic archiving, and ultimately mitigating the issue. Often the goal is to mitigate the issue as fast as possible, but this can reduce the amount of forensic data that you have available and may impact your team's ability to identify the root cause and subsequent fix.

Inter-Service Communication - understanding when dependent service is deploying and the context of those changes is an area which can cause problems for your service. In a micro-service architecture there are an abundance of calls being made as well as changes across the service landscape. These changes carry a risk and as such each service should have a relative risk they publish for each release and the time of the release. This risk value can then be evaluated by the consuming teams to determine whether they need to take further action to mitigate the risk.
White-Glove Treatment - during an incident you may have a customer that is highly sensitive to anything which impacts their processes, and as such the operations of the service should take this into consideration when building for and handling incidents. Customers have different expectation levels regarding communication, mitigation, and resolution. Rather than build for the issue, build for the customer and issue. If a customer receives a remarkable level of service they will be much more likely to continue the relationship if they know they have options for the level of care provided.

What would you rather be working on today?

Let me help you reduce friction and toil, and increase precision

Approach

Expertise

Management

Want to Reduce Friction, Toil, and Increase Precision?