About all

Rule nines burn chart. The Rule of Nines: Accurate Assessment of Burn Percentage in Adults

How does the Rule of Nines help determine the percentage of a burn injury in adults? What are the key factors to consider when documenting burn injuries for successful treatment?

Содержание

Understanding the Rule of Nines for Burn Assessment

The “Rule of Nines” is a widely used burn assessment tool that helps determine the percentage of the body surface area affected by a burn injury. This chart divides the body into distinct areas, each representing 9% of the total body surface area (TBSA). By quickly assessing which regions of the body are affected, medical professionals can estimate the overall severity of the burn and guide appropriate treatment.

Factors Influencing Burn Severity

When evaluating a burn injury, several key factors must be considered beyond just the TBSA:

  • Burn Location: Burns to the hands, feet, face, or genitalia are considered critical areas that require immediate medical attention, even for small burns.
  • Age of the Injured: Infants, young children, and the elderly are at higher risk for complications from even minor burns.
  • Health Status: Underlying physical or mental conditions, such as diabetes, can make burn injuries more complex to treat.
  • Burn Source: Smoke inhalation, toxic fumes, electricity, or chemical exposure can further complicate a burn and require urgent care.

Thermal Burns: Flame, Flash, and Scalds

Thermal burns are the most common type, caused by exposure to open flames, flash fires, or hot liquids (scalds). The severity of a thermal burn depends on the temperature and duration of exposure. For example, burns from hot oils can reach much higher temperatures than those from hot water, leading to more severe tissue damage.

Electrical Burns: Current and Thermal Damage

Electrical burns are caused by the passage of electric current through the body. The severity depends on the intensity of the current and the duration of exposure. Electrical burns can cause both thermal damage from the heat generated, as well as deeper tissue injury from the current itself.

Chemical Burns: Acid and Alkali Exposure

Chemical burns result from exposure to strong acids or alkaline substances. The type of damage and required treatment varies based on the specific chemical involved and the duration of exposure. Emergency treatment for chemical burns typically involves thoroughly flushing the affected area with large amounts of water.

Accurate Documentation for Effective Burn Treatment

Proper documentation of burn injuries is crucial for successful treatment and management. Key elements to document include:

  • Type of burn (thermal, electrical, chemical)
  • Extent of the burn, expressed as a percentage of TBSA
  • Depth of the burn (superficial, partial-thickness, or full-thickness)
  • Location of the burn on the body
  • Any interventions performed, such as dressings or debridement

By ensuring comprehensive and accurate documentation, healthcare providers can better coordinate care, monitor progress, and make informed decisions to optimize outcomes for patients with burn injuries.

The Importance of Electronic Documentation Systems

Traditional paper-based documentation methods are no longer sufficient for modern burn care. Electronic documentation systems offer several advantages, including:

  • Improved accuracy and completeness of patient records
  • Increased efficiency in data entry and retrieval
  • Better coordination of care across healthcare providers
  • Enhanced ability to analyze patient data and identify trends

Many healthcare facilities are transitioning to electronic documentation systems to ensure the highest standards of burn injury management and patient outcomes.

Outsourcing Medical Transcription for Burn Care

To meet the growing demands of accurate and timely documentation, some emergency departments are turning to outsourced medical transcription services. These specialized providers can:

  • Capture detailed treatment recordings in a consistent, standardized format
  • Ensure prompt turnaround of completed records for immediate use
  • Free up clinical staff to focus on patient care rather than administrative tasks
  • Maintain strict confidentiality and compliance with healthcare regulations

By leveraging the expertise of medical transcription professionals, healthcare facilities can enhance the quality and efficiency of burn injury documentation, ultimately leading to better patient outcomes.

Severity of Burns


Size of burn

If the burn area is larger than a silver dollar, see a doctor. In
young children, even a smaller burn can be serious.
Location of burn

Hands, feet, face or genitalia are critical areas. A doctor should
treat even small burns in these areas.
Age of injured

For infants, young children and the elderly, even small burns can
be fatal.
Health of injured

Physical and mental impairments and conditions such as diabetes, etc.
can complicate the injury.
Source of burn

Smoke inhalation, toxic fumes, electricity or chemicals all complicate
a burn injury and require immediate medical attention.

The percentage of the body burned is determined by using a burn chart,
such as the “rule of nines” burn chart.


This chart divides the body surface into areas,
each of which represents 9 percent. The “rule of nines” is generally
used for quick assessment of the total body surface area (TBSA)
that has been injured. In infants and small children, the surface
area of the head and neck is greater and the lower extremity is
smaller than an adult. Therefore, the Lund-Browder chart more
accurately determines the extent of a burn.

Thermal:

Flame, Flash,

Scalds

Scalds with viscous oils will be more severe than with hot water.
Hot oils have greater thermal energy than does hot water because the
flash point is higher. Thus hot oils can reach temperatures into the
2000C range and still remain liquid.

Electrical:

Severity of burns is a function of intensity of burning agent and
duration of exposure. Steam has about 4000 times the burning energy
than does air. This is due to the heat of vaporization of water from
liquid to gas phase.

Chemical:
Acid burn:

damage to the body caused by a strong acid. The type of burn depends
on the kind of acid and the length of time and amount of tissue exposed.
Emergency treatment includes washing the area with large amounts of
water.
Alkali burn:

tissue damage caused by an alkaline compound, as lye. Treatment includes
flushing with water to wash off the chemical. Then vinegar or another
mildly acidic substance mixed with water is put on the burn to make
neutral any alkali that is left and to lessen the pain. The patient
should be taken right away to a hospital or other treatment center
if the damage is severe.

 



THE AMERICAN SOCIETY FOR AESTHETIC
PLASTIC SURGEONS

Material Copyright 2000-2013 Ramsey J. Choucair,
M.D.

Documenting Burn Injuries – What to Know


27-February-2023 | Last modified on March 22nd, 2023 Julie Clements
Medical Transcription

For successful treatment of burns, adequate documentation is a major concern. Up-to-date burn injury documentation brings more challenges and requirements for practices. Emergency departments can consider outsourcing medical transcription to get accurate reports of the treatment recordings.

Studies have highlighted that although traditional paper-based documentation is still used in practices, it no longer meets modern requirements. Instead, proper documentation is ensured by electronic documentation systems.

Burn care mainly focuses on the six “Cs” such as – clothing, cooling, cleaning, chemoprophylaxis, covering and comforting. Factors that guide the evaluation and management of burns that leads to accurate documentation include

  • The type of burn (thermal, chemical, electrical or radiation)
  • The extent of the burn usually expressed as %TBSA, and
  • The depth of the burn

Along with documenting an appropriate E/M encounter, other factors to be documented for patients with burns are – location, size relative to total body surface area (TBSA), whether dressings or debridement were performed and by whom, specific patient characteristics such as the age of the patient, any other medical or health problems, any associated injuries, and more.

Anatomical locations documented should include laterality, left or right, and the specific part of the body involved. The depth of burns has to be documented as first degree, partial thickness, or full thickness.

Experts have clearly defined the required medical details for burn documentation.

Advanced burn documentation should include:

  • Medical history and general status of the patient with all its features
  • Recent and frequent photographic documentation to evaluate changes in the wound
  • Wound assessment with all its features
  • Course of healing
  • Documentation of therapeutic measures and their effects
  • Results of follow-ups
  • Traceability and verification of authors

Proper TBSA Estimation

Accurate burn size assessment is crucial. Determining burn severity relies on the burned surface area, depth of burn and the involved body area. Many methods are currently available to estimate the percentage of total body surface area burned. TBSA or total body surface area affected by a burn can be estimated using Lund Browder Chart, Rule of Palms or Rule of Nines.

Lund Browder Chart

“Lund Browder Chart” helps to improve the calculation of body proportions. This is the most accurate and widely used chart to calculate total body surface area affected by a burn injury. It helps evaluate the burned body surface area, by showing the boundaries of specific body regions. An adapted planimeter is used for the calculation.

Rule of Nines

This is another clinically efficient and accurate method to calculate the total body surface area of a burn. This estimation method provides an idea of how much of your total body’s surface area a burn takes up. Treatment can be chosen based on the size and intensity of the burn injury. Emergency department physicians use this estimation method the most.

Rule of Palms

It is another popular way to estimate the size of a burn. Physiopedia explains that the palm of the person who is burned is about 1% of the body.  In this method, the patient’s palm is used to measure the body surface area burned. When a quick estimate is required, the percentage body surface area will be the number of the patient’s own palm it would take to cover their injury. It is important to use the patient’s palm and not the provider’s palm.

Choosing a Perfect Documentation System

Even though critical care medical transcription services can do a lot in properly documenting burns in EHR, it is crucial for providers to use a relatively comprehensive documentation system to ensure quality patient care.

Advanced documentation systems will cover many datasets including ethology of burns, burn depth and size over time, surgical steps, first aid measures, preceding clinical treatment, condition of the patient on admission, former illnesses, any additional injuries, healing progress and outcome, complications if any, image documentation, and more.

Electronic documentation systems have proven to offer many qualitative and quantitative advantages such as –

  • enhanced documentation quality
  • reduced documentation errors
  • faster availability and access to the collected data
  • direct exchange of information
  • creation of new medical knowledge

BurnCase 3D is a standardized documentation system that provides a library of 3D models to support and enhance the documentation and diagnosis of human burn injuries. 3D models created can be adapted to sex, age, height, and weight. This system enables full documentation of the entire treatment process from initial assessment to the outcome. Users can transfer burn wounds from photos to the 3D model. The 3D models created can be moved, rotated, and scaled for better evaluation.

Determination of burn area: rule of nines and palm

Table of contents

  1. Degrees
  2. Symptoms
  3. Determination of area
  4. Rule of hundreds
  5. Rule of nines 90 006
  6. Palm rule
  7. Postinkov method
  8. Dolinin method
  9. Conclusion

Last Updated on 06/23/2017 by Perelomanet

A burn is an injury to the soft tissues of the human body resulting from negative thermal, electrical or chemical effects. For the correct provision of first aid and the choice of the method of subsequent treatment, it is necessary to determine the severity of the injury and the area affected by it. There are many techniques that allow you to accurately subtract the area of ​​burns.

The area of ​​the human body is approximately 21,000 square centimeters. Scientists have invented many schemes and formulas that help calculate the burn area in children and adults. If you correctly calculate the size of the injured area, then you can determine the severity of the injury that has arisen.

Degrees

There are several degrees of severity of this damage:

  • first degree burn – slight swelling and redness form on the skin;
  • the second degree is accompanied by the formation of minor blisters with a special internal fluid that protects the wound of infection. With a burn of this type, the skin begins to exfoliate and pain is present;
  • third degree type A – characterized by a fairly deep damage to the skin, the formation of a brown crust and pain;
  • third degree type B – with a burn of this type, complete death of the skin occurs;
  • 4th degree burns are the most serious damage to the skin, affecting blood vessels, muscles, joints, and sometimes even bones. Pain is not observed due to complete charring of the skin.

First, second and third A degrees are called superficial burns, while degrees 3B and fourth, respectively, are called deep. Superficial injuries are always associated with pain, but deep ones are not. The absence of pain in this case is explained by the complete necrosis of the affected epidermis.

Symptoms

Signs of a burn depend on the type of burn surface and the nature of the injury, but there are a number of main symptoms that most often appear with such an injury:

  • change in skin color from reddish to black. The color depends on the nature and severity of the damage;
  • the appearance of blisters (see burn blister: what to do), which are filled with a special liquid;
  • formation of a dryish crust in the injured area;
  • severe pain;
  • death of the skin;
  • charring of the skin.

Determining the area

Injury treatment is prescribed only after an accurate determination of the nature of the injury, in order to determine the depth of the injury and its severity – the area of ​​the burn should be subtracted.

Hundred rule

The simplest way to calculate the injured surface in adults is the “hundred rule”. In the event that, adding up the age of the victim and the total area of ​​\u200b\u200bthe injury, a number close to a hundred comes out, then the lesion is considered unfavorable, and it requires special treatment.

The Rule of Nines

In 1951, scientist A. Wallace invented a computational method called the Rule of Nines for Burns. This type of calculation of the wounded surface is quite fast and easy. The data obtained as a result of the calculation is inaccurate, but quite approximate.

This method consists in dividing the human body into separate zones. Each such plot in relation to the percentage is equal to nine. Neck and head – 9%, each individual limb – 9%, the torso front and back results in 36%, and 1% is allocated to the genital area.

This method is not suitable for determining burns in children, because the proportions of their bodies are slightly different.

Rule of the palm

In 1953, I. Glumov invented an even simpler method for calculating the injured surface. According to the rule of the palm, the burn zone is equal to the palm of the victim. Its size is approximately considered one percent of the entire surface of the human body. This method is used as often as the “rule of nine”.

Postinkov’s method

Postnikov’s method is a rather old determination of the burn area and is not easy. It is based on the application of a gauze bandage to the wounded surface, and a contour drawing of the injury is applied on top of it. After that, the resulting shape is superimposed on graph paper and a general calculation of the surface is carried out in relation to the damaged skin. Due to the difficulties that arise during such a calculation, it is practically not used.

Dolinin method

In 1983 the Dolinin method was invented. It consists in dividing by 100 a special stamp of rubber material, which contains the silhouette of the back and front of the human body. The front side collects 51 sections, and the back side – 49. Each of the sections in a percentage ratio is 1%. In the diagram, the affected area is painted over and, after completion, the filled-in numbers added together are counted.

Land and Browder burn areas are calculated for young children. In a child under one year old, the surface of the neck and head is equal to 21%, the torso in front and behind – 16%, the femoral region – 5%, the areas of the lower leg and feet – 9%, the place of the perineum – 1%.

Conclusion

The complexity and effectiveness of treatment depends on the place where the injury was received and the area of ​​the burn. For example, if parts of the face, hands or genital areas are affected during an injury, the ability to work is often impaired, the skin cannot be restored, complete disability is possible, and in some cases death. Lethal outcome occurs mainly when the area of ​​injury is 40% or more.

translation of the article “Calculation of service reliability” / Habr

The main task of commercial (and non-commercial too) services is to be always available to the user. Although failures happen to everyone, the question is what does the IT team do to minimize them. We have translated an article by Ben Traynor, Mike Dahlin, Vivek Rau and Betsy Beyer “Calculating Service Reliability”, which explains, including using the example of Google, why 100% is the wrong benchmark for a reliability indicator, what the “four nines rule” is, and how, in practice, to mathematically predict the acceptability of major and minor outages of a service and/or its critical components — the expected amount of downtime, failure detection time, and service recovery time.

Service Reliability Calculation

Your system is only as reliable as its components

Ben Traynor, Mike Dalin, Vivek Rau, Betsy Beyer

As described in Site Reliability Engineering: Reliability and reliability like in Google ” (hereinafter referred to as the SRE book), Google’s product and service development can achieve a high release rate of new features while maintaining aggressive SLOs (service-level objectives) for high reliability and responsiveness. SLOs require the service to be almost always up and almost always fast. At the same time, SLOs also indicate the exact values ​​\u200b\u200bof this “almost always” for a particular service. SLOs are based on the following observations:

In general, for any software service or system, 100% is the wrong benchmark for reliability, because no user can tell the difference between 100% and 99.999% availability. Between the user and the service there are many other systems (his laptop, home Wi-Fi, provider, power grid …), and all these systems in the aggregate are not available in 99.999% of cases, but much less often. Therefore, the difference between 99.999% and 100% is lost to random factors due to the unavailability of other systems, and the user does not receive any benefit from the fact that we spent a lot of effort achieving this last fraction of a percent of system availability. Serious exceptions to this rule are anti-lock brake control systems and pacemakers!

For a detailed discussion of how SLOs relate to SLIs (service-level indicators) and SLAs (service-level agreements, service level agreements), see the SRE Book’s Service Level Target chapter. This chapter also details how to select the metrics that matter for a particular service or system, which in turn determines the selection of the appropriate SLO for that service or system.

This article expands on the SLO topic to focus on the building blocks of services. In particular, we will look at how the reliability of critical components affects the reliability of a service, as well as how to design systems to mitigate the impact or reduce the number of critical components.

Most of the services offered by Google aim to provide 99.99 percent (sometimes called “four nines”) availability for users. For some services, a lower number is specified in the user agreement, but the target of 99.99% is maintained internally. This higher bar is an advantage in situations where users express dissatisfaction with the performance of the service long before the breach of agreement occurs, since the #1 goal of the SRE team is to ensure that users are satisfied with the services. For many services, internal goal 99.99% is the sweet spot that balances cost, complexity, and reliability. For some others, notably global cloud services, the internal goal is 99.999%.

99.99% Reliability: Observations and Conclusions

Let’s look at a few key observations and conclusions about designing and operating a service with 99.99% reliability, and then move on to practice.

Observation #1: Causes of failures

Failures occur for two main reasons: problems with the service itself and problems with critical components of the service. A critical component is a component that, in the event of a failure, causes a corresponding failure in the operation of the entire service.

Observation #2: The Math of Reliability

Reliability depends on the frequency and duration of downtime. It is measured in terms of:

  • Downtime frequency, or inverse of it: MTTF (mean time to failure, mean time between failures).
  • Downtime, MTTR (mean time to repair, average recovery time). The duration of downtime is determined by the user’s time: from the onset of a malfunction to the resumption of normal operation of the service.
    Therefore, reliability is mathematically defined as MTTF/(MTTF+MTTR) using the appropriate units.

Conclusion #1: The Rule of Complementary Nines

A service cannot be more reliable than all of its critical components put together. If your service is aiming for 99.99% availability, then all critical components must be available significantly more than 99.99% of the time.
Inside Google, we use the following rule of thumb: critical components should provide additional 9s compared to your service’s claimed reliability – in the example above, 99.999 percent availability – because any service will have several critical components, as well as its own specific problems. This is called the “additional nines rule”.
If you have a critical component that doesn’t deliver enough 9s (a relatively common problem!), you need to minimize the negative impact.

Conclusion #2: The Math of Frequency, Detection Time, and Recovery Time

A service cannot be more reliable than the product of incident frequency times detection and recovery time. For example, three total shutdowns per year of 20 minutes result in a total of 60 minutes of downtime. Even if the service worked perfectly during the rest of the year, 99.99 percent reliability (no more than 53 minutes of downtime per year) would be impossible.
This is a simple mathematical observation, but it is often overlooked.

Conclusion from Findings #1 and #2

If the level of reliability your service relies on cannot be achieved, efforts should be made to correct the situation, either by increasing the level of service availability or by minimizing the negative impacts as described higher. Lowering expectations (i.e., advertised reliability) is also an option, and often the best one: make it clear to your dependent service that it must either rebuild its system to compensate for the error in your service’s reliability, or reduce its own service level goals . If you do not eliminate the discrepancy yourself, a sufficiently long failure of the system will inevitably require adjustments.

Practical application

Let’s look at an example of a service with a target reliability of 99.99% and work out the requirements for both its components and its failure handling.

Digits

Assume your 99.99% available service has the following characteristics:

  • One major outage and three minor outages per year. It sounds intimidating, but note that the 99.99% reliability target implies one 20-30 minute massive downtime and several short partial outages per year. (The math indicates that a) the failure of one segment is not considered a failure of the entire system in terms of SLO and b) the overall reliability is calculated by the sum of the reliability of the segments.)
  • Five critical components as other independent services with 99.999% reliability.
  • Five independent segments that cannot fail one after the other.
  • All changes are made incrementally, one segment at a time.

The math for reliability would be:

Component requirements

  • The total error limit for a year is 0.01 percent of 525,600 minutes per year, or 53 minutes (based on a 365-day year, with worst scenarios).
  • The limit allocated to critical component outages is five independent critical components with a limit of 0.001% each = 0.005%; 0.005% of 525,600 minutes per year, or 26 minutes.
  • Your service’s remaining error limit is 53-26=27 minutes.

Outage response requirements

  • Expected downtime: 4 (1 total outage and 3 outages affecting only one segment)
  • Cumulative impact of expected outages: (1×100%) + (3×20%) = 1.6
  • Fault detection and recovery time: 27/1.6 = 17 minutes
  • Time allotted for monitoring to detect and report a failure: 2 minutes
  • Time given to the duty specialist to start analyzing the notification: 5 minutes. (The monitoring system must monitor for SLO violations and send a pager to the attendant each time the system fails. Many Google services are supported by shifts on duty SR engineers who respond to urgent issues.)
  • Remaining time to effectively mitigate adverse effects: 10 minutes

Conclusion: levers to increase service reliability

It’s worth taking a close look at these numbers because they highlight a fundamental point: there are three main levers to increase service reliability.

  • Reduce the frequency of outages through release policies, testing, periodic project design reviews, and more.
  • Reduce average downtime by sharding, geo-isolation, gradual degradation, or customer isolation.
  • Reduce recovery time – through monitoring, one-button rescue actions (e.g. rollback or adding standby power), operational readiness practices, etc.
    You can balance between these three methods to simplify the implementation of fault tolerance. For example, if reaching the 17-minute MTTR is difficult, focus your efforts on reducing your average downtime. Strategies for minimizing negative impacts and mitigating the impact of critical components are discussed in more detail later in this article.

Refinement of the “Additional 9s Rule” for nested components

The casual reader may infer that each additional link in the dependency chain requires additional 9s, so second-order dependencies require two additional 9s, third-order dependencies require three additional 9s etc.

This is not a valid conclusion. It is based on a naive tree component hierarchy model with constant branching at each level. In such a model, as shown in Fig. 1, there are 10 unique first-order components, 100 unique second-order components, 1,000 unique third-order components, and so on, resulting in a total of 1,111 unique services, even if the architecture is limited to four layers. An ecosystem of highly reliable services with so many independent critical components is clearly unrealistic.

Fig. 1 – Component Hierarchy: Invalid Model

A critical component by itself can cause an entire service (or segment of a service) to fail, no matter where it is in the dependency tree. Therefore, if a given component X appears as a dependency of multiple first-order components, X should only be counted once, as its failure will eventually cause the service to fail, no matter how many intermediate services are also affected.

The correct reading of the rule is as follows:

  • If a service has N unique critical components, then each one contributes 1/N to the unreliability of the entire service caused by that component, no matter how low it is in the component hierarchy.
  • Each component must only be counted once, even if it appears multiple times in the component hierarchy (in other words, only unique components are counted). For example, when counting the components of Service A in Fig. 2, Service B should only be counted once.

Fig. 2 – Components in the hierarchy

For example, consider a hypothetical service A with an error limit of 0.01 percent. Service owners are willing to spend half of this limit on their own errors and losses, and half on critical components. If the service has N such components, then each of them gets 1/N of the remaining error limit. Typical services often have 5 to 10 critical components, and so each can only fail to the power of one tenth or one twentieth of Service A’s error limit. Therefore, as a general rule, critical parts of a service should have one additional nine of reliability.

Error limits

The concept of error limits is covered in some detail in the SRE book, but it should be mentioned here as well. Google’s SR engineers use error limits to balance the reliability and pace of updates. This limit defines the acceptable failure rate for the service over a period of time (usually a month). The error limit is simply 1 minus the SLO of the service, so the previously discussed 99. 99 percent available service has a 0.01% “limit” for unreliability. As long as the service has not used up its error limit within a month, the development team is free (within reason) to launch new features, updates, etc.

If the error limit is used up, changes to the service are suspended (except for emergency security fixes and changes that target what caused the breach in the first place) until the service replenishes the error limit or until the month changes. Many services in Google use a sliding window method for SLO so that the error limit is restored gradually. For serious services with an SLO of more than 99.99%, it is advisable to apply a quarterly rather than a monthly reset of the limit, since the number of acceptable downtimes is small.

Error limits eliminate interdepartmental tensions that might otherwise arise between SR engineers and product developers by providing them with a common, data-driven mechanism for evaluating the risk of a product launch. They also give both SR engineers and development teams a common goal of developing methods and technologies that will allow them to innovate faster and launch products without “blowing the budget”.

Critical component reduction and mitigation strategies

So far, in this article, we have established what can be called the “Golden Rule for Component Reliability” . This means that the reliability of any critical component must be 10 times the target reliability level of the entire system in order for its contribution to the unreliability of the system to remain within the error level. It follows that, ideally, the goal is to make as many components as possible non-critical. This means that components can adhere to a lower level of reliability, giving developers the freedom to innovate and take risks.

The simplest and most obvious strategy to reduce critical dependencies is to eliminate single points of failure (SPOF) whenever possible. The larger system must be able to work acceptably without any specified component that is not a critical dependency or SPOF.
In fact, you most likely cannot get rid of all critical dependencies; but you can follow some system design guidelines to optimize reliability. While this is not always possible, it is easier and more efficient to achieve high system reliability if you build reliability into the design and planning stages, rather than after the system is running and impacting actual users.

Project Design Evaluation

When planning a new system or service, or redesigning or improving an existing system or service, an architecture or design review can reveal common infrastructure and internal and external dependencies.

Shared infrastructure

If your service uses a shared infrastructure (for example, a core database service used by multiple products available to users), consider whether that infrastructure is being used correctly. Clearly identify the owners of the shared infrastructure as additional contributors to the project. Also, beware of overloading components by carefully coordinating the launch process with the owners of those components.

Internal and external dependencies

Sometimes a product or service depends on factors outside your company’s control, such as third party software libraries or services and data. Identification of these factors will minimize the unpredictable consequences of their use.

Plan and design systems carefully
When designing your system, pay attention to the following principles:

Redundancy and isolation

You can try to reduce the impact of a critical component by creating multiple independent instances of it. For example, if storing data in a single instance provides 99.9 percent availability of that data, then storing three copies in three widely dispersed instances would, in theory, provide an availability level of 1 – 0.013, or nine nines, if instance failures are independent at zero correlation.

In the real world, the correlation is never zero (consider backbone failures that affect many cells at the same time), so the actual reliability will never get close to nine nines, but will far exceed three nines.

Similarly, sending an RPC (remote procedure call) to one pool of servers in the same cluster can achieve 99% availability of results, while sending three simultaneous RPCs to three different pools of servers and accepting the first response received helps achieve accessibility level higher than three nines (see above). This strategy can also reduce the latency tail of the response if the server pools are equidistant from the RPC sender. (Because the cost of sending three RPCs at the same time is high, Google often times these calls strategically: most of our systems wait a fraction of the allotted time before sending a second RPC, and a little more time before sending a third RPC.)

Fallback and its use

Set up startup and software migration so that systems continue to work when individual parts fail (fail safe) and isolate automatically when problems occur. The basic principle here is that by the time you get a person to turn on the reserve, you will probably have already exceeded your error limit.

Asynchrony

To prevent components from becoming critical, design them to be asynchronous wherever possible. If a service is waiting for an RPC response from one of its non-critical parts that exhibits a dramatic slowdown in response time, that slowdown will unnecessarily degrade the performance of the parent service. Setting the RPC for a non-critical component to asynchronous will free the parent service’s response times from being tied to those of that component. And although asynchrony can complicate the code and infrastructure of the service, it is still worth the trade-off.

Resource planning

Make sure all components are provided with everything you need. When in doubt, it is better to have an excess reserve – but without increasing costs.

Configuration

Standardize component configuration where possible to minimize discrepancies between subsystems and avoid one-time failure/error handling modes.

Troubleshooting

Make error detection, troubleshooting and diagnosing problems as easy as possible. Effective monitoring is essential for the timely identification of problems. Diagnosing a system with deeply embedded components is extremely difficult. Always have at the ready a way to level errors that does not require detailed intervention by the attendant.

Fast and reliable rollback

Incorporating the manual work of attendants into the disaster recovery plan significantly reduces the ability to meet hard SLO targets. Build systems that can easily, quickly and seamlessly return to a previous state. As your system improves and confidence in your monitoring method grows, you can lower your MTTR by developing a system to automatically trigger safe rollbacks.

Systematically check for all possible failure modes

Examine each component and determine how a failure in its operation can affect the entire system. Ask yourself the following questions:

  • Can the system continue to operate in degraded mode if one of them fails? In other words, design for gradual degradation.
  • How do you solve the problem of component unavailability in different situations? When starting the service? During the service?

Test extensively

Design and implement a rich testing environment that ensures that each component is covered by tests that include the main usage scenarios for that component by other components of the environment. Here are some recommended strategies for such testing:

  • Use integration testing to work out troubleshooting – make sure the system can survive if any of the components fail.
  • Perform crash testing to identify weaknesses or hidden/unplanned dependencies. Record the course of action to correct the identified deficiencies.
  • Do not test normal load. Intentionally reboot the system to see how its functionality decreases. One way or another, your system’s response to overload will become known; but it is better not to leave load testing to users, but to test the system yourself in advance.

The way forward

Expect changes to scale: A service that started out as a relatively simple binary on a single machine can develop many obvious and non-obvious dependencies when deployed at a larger scale. Each order of scale will reveal new constraints—not just for your service, but for your dependencies. Think about what happens if your dependencies can’t scale as fast as you need them to.