Beyond20: A ServiceNow Elite Partner Improved Incident Management through Swarming - Beyond20
12 minute read

How Swarming Improves Team Collaboration, Prioritization, Knowledge Sharing, and Incident Resolution

Written by David Crouch

Many of us deal with the familiar challenge of having too much work and not enough people to do it. Swarming, in particular, is an effective prioritization technique that leverages the shared knowledge of our teams to quickly assess, prioritize, discuss, resolve, and get more work done.

Swarming techniques are especially helpful to IT teams that need to rapidly resolve incidents and identify their underlying causes.  Swarming can help teams be much more efficient, getting them out of day-to-day, reactionary fire-fighting and focused on more meaningful work.  This is especially true when complex issues leave us scratching our heads wondering who is in the best position to run down the issue, all the while leaving customers without IT services.

This article covers the long history of swarming, how it has been used traditionally, and some new-ish concepts such as intelligent swarming.  I’ll also touch on the changes you will want to make to existing IT processes and tools, like having a robust IT Service Management platform, to help automate and support internal processes.  Ultimately, we’ll explore how to get started and get the most out of swarming in your own environment.

Origins of Swarming from Nature, the Military, and Sports

Swarming instances found in nature

Nature provides some of the purest forms of swarming.  We all know that bees swarm, but do you know why?  When a colony becomes too populated for the queen bee to be able to control the worker bees with pheromones, a group of worker bees starts to swarm in preparation for building a new hive and installing a new queen in it.  Same thing with termites, ants, and many other insects.  In this case, swarming helps them communicate and expand.

Birds swarm for reasons of protection – the idea is that there is strength in numbers since predators are less likely to attack individual birds when they are flying together.  Even bacteria swarm by pushing new cells to the edges of the colony to reduce cellular competition for resources (i.e., nutrition).  These natural manifestations of swarming are coordinated but unconscious (or, at least, as far as we know).

Swarming tactics in the military

Military Swarming Tactic Examples

Figure 1 – Military Swarm Command and Control Models from warontherocks.com by Paul Scharre

Although somewhat instructive, it can be hard to model conscious swarming techniques on the animal kingdom. By contrast, some of the earliest and best-known examples of conscious or planned swarming come to us from the military. For example, during the Siege of Samarkand (329 BCE), the forces of Spitamenes, vastly outnumbered by the experienced army of Alexander the Great, converged upon and surrounded a Greek relief unit with teams of archers.  By focusing a great number of resources for a time on the relief column, Spitamenes was able to pick off the opposing forces and briefly break through Alexander’s formations.

This tactic worked well since Spitamenes was able to maintain close communications with his troops, the archers themselves were remarkably mobile, and emphasis was placed on forces that were free to strike at will without waiting for orders from the top.  Throughout the history of warfare, multiple swarming tactics were often leveraged at the same time, and these techniques developed over time. There was a gradual advancement from the melee (uncoordinated, each-man-for-himself style battle) to massing (loosely coordinated strength-in-numbers tactics) to modern “Battleswarming” as pictured in Figure 1 above. (For the enthusiast, The Rand Corporation provides an interesting history of military swarming.)

Swarming techniques found in sports

We are all familiar with other forms of swarming . . . for example, in sports.  In American football, before the quarterback spirals the ball, defenders choose individual players to guard.  Once the ball is thrown and received by a runner, somewhere between four and six defenders rush the receiver from different directions since it can be difficult for a player to execute a solo tackle.  (Alas, if you’re into tennis like me, swarming will not help you here.)

Swarming applied to the Service Desk and Other IT Teams

Tiered IT Support Model

Figure 2 – Tiered IT Support Model

In the IT ecosystem, swarming is often used in the context of a complex incident, major incident, or problem.   In these cases, not only is the incident complex, it is also difficult to identify the best team or subject matter expert (SME) to investigate and resolve it.  In some cases, resolving the incident may require the knowledge of more than one SME.

Shortcomings of a Tiered IT Support Model

Traditionally, the way teams have approached this condition is for a Service Desk agent to initially field the call with a customer and, when they’re unable to resolve it themselves, escalate to a Tier 2 technical team with deeper, specialized expertise.

If the first Tier 2 team cannot resolve the issue, the incident could be reassigned to another Tier 2 team or even escalated to a Tier 3 team of more senior experts.  The problem with tiered support is that these hand-offs from team to team take precious time; and it becomes especially problematic when a mission critical IT service is down for multiple customers or an entire organization, causing work to stop. In this scenario especially, it may be the case that multiple experts need to come together quickly to resolve the incident. It is often far more efficient and effective to involve experts from several teams simultaneously and up-front to discuss the potential causes of and solutions to the incident.

When to Collaborate Using Swarming Techniques

Whether we are talking about mother nature, combatants on the fields of war or arena, or even IT, swarming tends to work best when several conditions are met:

  • Communications amongst individuals and teams is close and effective.
  • Team size is small.
  • Teams are intermittent, ad hoc, and membership is flexible.
  • Knowledge is shared.
  • Decision-making and power are decentralized.

The evolution of swarming is marked by a trend away from unconscious, uncoordinated and instinctual cooperation towards thoughtful and flexible cooperation.  Another recent trend is to leverage new technologies (especially communication technologies) to devolve decision-making power and enable teams to work in small, sometimes ad hoc, and frequently changing re-combinations.

This allows teams to get more done without having to wait unnecessarily for centralized decision making (which may not have intimate knowledge of the situation on the ground) and thus suffer endless back-and-forth routing, which can happen with tiered escalations.  With this in mind, let’s talk about some common, traditional IT swarming techniques.

The Two Most Common Use Cases for IT Swarming

Traditionally, there are two primary situations where teams find significant benefit from swarming:

  • Solving Complex Issues: For complex incidents or problems, teams need to quickly understand and resolve the issue. To improve the time it takes to get the customer up and running again or decrease the mean time to resolution (MTTR), it’s critical to quickly know where SME ownership lies.  Since it is unclear which SME(s) to involve, it makes sense to bypass tiered escalation and involve the most relevant SMEs from different functional groups from the start. As the same complex issues are encountered repeatedly, the appropriate technical owner(s) should become apparent, which can often eliminate the need for swarming over time
  • Managing Priorities: Swarming can also be used as a prioritization technique to control and reduce the backlog of work. This also helps teams decrease MTTR and, by doing so, improves the customer/user experience and employee satisfaction.

Using Swarming in Different Use Cases

As in the military examples above, IT teams that use swarming techniques often employ several types of swarming depending on the circumstance.

Swarming with Major Incidents

Major Incident Swarming (sometimes called Severity 1 Swarming) is used to quickly resolve incidents that have the highest level of impact and are complex.  Each team must define for themselves what constitutes highest impact.  Certainly, any incident that is likely to invoke disaster recovery and business continuity falls into this category.

Some of our clients define this as any incident that impacts an external customer-facing service or system or any incident that is likely to generate severe financial loss.  This can include, for example, a major retailer’s point of sale/cash register systems going down nationwide during business hours.

Conditions for Effective Severity 1 Swarming

Not every priority 1 incident will benefit from swarming.  In my experience, three out of four criteria should be met before using Major Incident Swarming:

  • The incident is considered priority 1.
  • The incident involves disaster recovery or business continuity.
  • The incident causes severe damage to customer perception and brand and/or major financial loss.
  • The incident is complex and resolving it will likely require the expertise of multiple individuals and teams (maybe even third-party vendors).

The last criterion is the most important consideration.  The main reason we swarm is because the incident is complex, and we do not know which experts to get involved.

Ad Hoc Swarming and the Incident Commander

Major Incident Swarming can be further decomposed into two sub-techniques: Ad hoc and Fixed Team.  With Ad hoc swarming, an incident is communicated to the Service Desk and an Incident Commander is appointed or alerted.

I have seen this work in different ways – sometimes the Service Desk Manager is the default Incident Commander.  I have also seen a senior Service Desk agent or senior technician serve in this role. It usually works best when the Incident Commander is known in advance of the incident occurring – either the same person always serves in the role, or somebody serves in this role on rotation (typically for one to two weeks at a time).  In any event, once the Incident Commander is alerted, they use their discretion to assemble the experts who are in the best position to investigate the incident (typically application specialists, network technicians, infrastructure and platform experts, etc. – but it can be any combination).

The Incident Commander gathers the experts on a conference bridge or teleconference and briefs them regarding key aspects of the incident.  In turn, the experts confer and offer their opinions on what is causing the incident.

In many cases, an expert will conduct several quick experiments to rule out causes.  As it becomes clear what is not causing the incident, experts will drop off the conference bridge but remain on standby in case their help is needed again.  The goal is to quickly determine which expert(s) can own the incident and resolve it.

This technique can be beneficial during major incidents.  But beware!  Speed is key.  Quickly identifying the SME(s) best situated to resolve the issue is essential.  Encumbering multiple SMEs for long periods of time is both inefficient and costly.

Fixed Team Swarming and Subject Matter Experts

The alternate version of this technique involves using a fixed team of SMEs who are on call or on rotation usually for one-to-two-week periods of time.  There is no magic number for how many SMEs comprise these teams, but most organizations tend to nominate three SMEs during each rotation.

The rotational version of major incident swarming tends to focus not only on major incidents but also on priority 1 incidents (thus the alternate name Severity 1).  The team of SMEs constantly monitors incoming tickets for major and priority 1 incidents and immediately begins to work on resolving them.  It is sometimes the case that the rotational team cannot resolve the incident themselves.  In these cases, they will consult with other SMEs in a manner similar to an Incident Commander convening a conference bridge.

In most cases, using Major Incident Swarming to handle all incidents would be costly and inefficient.  However, one of our clients, a global satellite company, does exactly this to great effect.  In their unique situation, there are no low priority incidents.  Every incident is complex, high-priority, and requires collaboration amongst multiple senior level engineers to diagnose and resolve.

Their Service Desk is unique in that it is staffed 24×7 and comprised of only six agents at any given time, each of whom is a senior engineer.  Although each engineer boasts a technical specialty, there is significant cross-functional knowledge on the team.  In a typical day, the team often fields as few as three or four incidents (usually simultaneously), though each one may take several hours to resolve.  When the team cannot resolve an incident, their only resort is to escalate to a third-party manufacturer of a specialized technology.  In this sense, the team’s investigation and resolution techniques closely mirror popular root cause analysis techniques used in Problem Management such as the Fault Tree Analysis, the Five Whys, Chronological Analysis, and Kepner-Tregoe.

Using Swarming to Prioritize a Team’s Work

Your decisions reveal your priorities. -Jeff Van Gundy

Although handling major incidents is the most common use case for employing a swarming technique, it can also be leveraged to prioritize work.  Small Team and Temporary Queuing can be of help here, and there are three types of specific techniques associated with this type of swarming.

Small Team and Temporary Queuing (Dispatch Swarms, Backlog Swarms, Drop-in Swarms)

While there always seems to be too much work, sometimes it just gets out of hand and results in a major backlog.  Maybe our Service Desk was down an agent or two because of holidays, and priority 3 and 4 incidents were temporarily neglected.  Swarming techniques can help alleviate the pain when, for whatever reason, unusually large backlogs have accumulated.  These swarms are usually staffed by temporary teams and appear in three varieties, each with its own pros and cons:

Dispatch Swarms – Small teams meet frequently throughout the day to review work in the queue (when dealing with non-complex work) and prioritize the work or select items in the queue that can be completed quickly. This type of swarm is sometimes called “cherry picking” since team members intentionally choose the easiest work items to complete.

  • Pro: Keeps the queue to a more manageable size since cherry picking prevents low priority incidents from piling up.
  • Con: Does not address high-priority incidents (another team or subset of the team should be addressing priority 1 and 2 incidents)

Backlog Swarms – When dealing with more complex work, small teams assemble on a routine or ad hoc basis to review complex work that has accumulated in the queue and requires the advice of multiple subject matter experts. The idea is to avoid escalation and reassignment of work (which takes time and is disruptive) by having the correct SME(s) address the issue from the start.  The primary difference between backlog swarms and the major incident swarm is priority, though each organization will create its own criteria.  Generally, backlog swarms do not address major incidents.  Instead, they are more likely to address priority 1 and 2 incidents or problems.

  • Pro: In cases of true high priority incidents, the backlog swarm can prevent or minimize tiered escalation, which can result in delays and “blackhole syndrome” (i.e. when nobody takes ownership of or responds to an incident or when an incident bounces from team to team).
  • Con: If the incident is not appropriately prioritized, involving deep SMEs from the start can be costly and inefficient.

Drop-In Swarms – In some cases, SMEs may continuously monitor the work backlog of their own teams as well as the work of other teams, and these SMEs become continuously available to work an issue. The SME decides whether they ought to get involved or not.

  • Pro: When the culture supports it, drop-in swarms can improve trust and lead to meaningful cross-functional collaboration.
  • Con: This approach can be a bit difficult to manage since, naturally, people tend to focus on work that has been directly assigned to them and are less likely to “look for trouble” by monitoring large cross-team queues. It is also easy to put the backlog on the back burner when other operational or project work needs to get done.  As above, this can also lead to the dreaded blackhole syndrome.  Additionally, when deep SMEs are involved too early in the resolution process or with low priority incidents, it can lead to the perception (or reality) of increased cost and inefficiency.

For those interested in learning more about these swarming techniques, they are discussed as part of a 3-day ITIL 4 Create, Deliver, and Support training course.

Intelligent Swarming and Agile Collaboration

A new take on swarming comes to us in the form of intelligent swarming, introduced to the world by the Consortium for Service Innovation.  This is the same group that created KCS (more on this later), so it comes as no surprise that intelligent swarming depends on embracing many KCS and knowledge sharing practices.  According to the Consortium for Service Innovation,

Intelligent Swarming is a set of practices that facilitates and optimizes a collaborative problem-solving process.

Intelligent swarming is also described as “collaboration on steroids” and “agile collaboration.”  As with other forms of swarming, intelligent swarming uses a collaboration-based process instead of tiered structures and escalation.  It goes even further by rejecting the terms “support agent” and “engineer” – replacing them with “knowledge worker.”

Intelligent swarming is best used with highly complex issues, and a tiered structure is most efficient when the “majority of issues are simple or known and when the first point of contact can resolve a majority of issues.”

Shifting the Ratio of Known to New Incidents with Intelligent Swarming

What makes intelligent swarming different from other forms of swarming is that organizations that use it rely on it as their only form of incident handling that involves human interaction. In other words, they have no tiered escalations at all.

In most organizations, there is a 30/70 split between new (and potentially complex) incidents and known incidents that are fielded through the Service Desk and tiered support – as shown in below.  This means that in addition to complex incidents that require SME support, 70% of the incidents are relatively simple to resolve, and it is inefficient and costly for service desk agents and SMEs to respond to them.

Incident ratios of human interaction vs. Suggested Ratio KCS

Figure 3 – Ratio of incidents requiring human intervention for most organizations (Left) vs. Suggested New/Known Ratio to support intelligent swarming (Right)

Instead, the goal is to shift this ratio from 30/70 to 70/30.  In other words, human intervention (in the form of swarming) would only be required for 30% of complex incidents.  With this shift, the demand coming from the 70% of known incidents can be fulfilled through customer self-service and automation, saving significant cost and time from having to involve IT staff.

A key aspect of intelligent swarming is moving from a linear escalation mode to a dynamic collaboration model.

Intelligent swarming not only makes incident resolution more efficient, it also creates a shift-left culture.  In a shift-left culture, work is moved closer to the source and employees at all levels are encouraged to find solutions to their problems and make decisions that help them do their jobs more effectively.  This not only frees up time so expensive, highly technical SMEs can focus on the complex work they enjoy, it also creates a culture of employee engagement and empowerment.

Step 1: Creating Useful Knowledge Articles to Support Intelligent Swarming

KCS Double Loop Process

Figure 4 – KCS Process from https://www.serviceinnovation.org/kcs/

To make intelligent swarming work, teams need to first identify which incidents represent the low-hanging fruit – these incidents are so easy to resolve that, given the appropriate knowledge and tools, even a non-technical customer could do it for themselves.

Once identified, it is time to create or update knowledge articles, written from the customer’s point of view.  Of course, this is not as simple as it sounds.  Knowledge Centered Service (KCS) – pictured in Figure 5 – integrates the creation and maintenance of knowledge into how customers and IT staff interact with that knowledge.  In other words, as knowledge workers interact with customers, they learn from them and begin to document – during those interactions – resolutions to issues.

KCS also tries to capture knowledge as close to the source as possible to ensure that documented knowledge is relevant, accurate, and user-friendly.  If your organization is not currently rich in knowledge artifacts or if the knowledge is outdated, difficult to find, or simply not useful, it will take some time to fully leverage intelligent swarming.  Even if you have a number of high-quality knowledge articles, they are only valuable if our customers are regularly using them and finding them helpful.  Thus, a culture where customers support and even prefer self-service is an important factor that supports intelligent swarming.

You can learn much, much more about these concepts in our fabulous 3-day KCS v6 Practices training course.

Step 2: Using Online Self-Help and Automation to Support Intelligent Swarming

In parallel with creating knowledge content for simple incidents, looking at ways to automate the experience for customers through self-service incident resolution and service requests can be immensely helpful.

Using purpose-built ITSM platforms like ServiceNow (pictured below in Figure 6) allows teams to create wizards that ask a customer questions to guide them through incident analysis and self-service resolution.  Initially, it can be tricky to devise the right questions and pose them in the ideal order.  However, through trial and error and working closely with customers to get their feedback and pilot wizard functionality, teams will dramatically improve self-help functionality.

Step 3: Cross Training Teams to Develop Strong Swarms

It’s now time to turn attention to organizing the swarm itself.  Intelligent swarming works best when the swarm team is composed of knowledge workers with significant technical subject matter expertise, all the while working to develop complementary and cross-over skills (what is often called “comb” or “M shaped” people).

Team size works best when it is small, generally anywhere from three to six people.  I have seen it work best where the culture supports and incentivizes knowledge sharing and the swarm is entirely voluntary.  In other words, knowledge workers constantly monitor the queue for complex incidents and, on their own initiative, begin to work on incidents they are in the best position to resolve.  In this scenario, the initial worker who responds to the incident owns it until it is resolved.  The same worker involves other peers as needed.

Creating Small, Rotational Swarming Teams

Another version of the intelligent swarm is structured around small rotational teams.  Three to six SMEs who have responsibilities outside of incident management work in temporary swarms on one-to-two-week rotations.  During this time, their primary responsibility is to monitor and respond to the queue of complex incidents.  When there are no complex incidents in the queue, they continue to work on their normal duties.

Is Intelligent Swarming a Smart Choice for My Team?

Intelligent Swarming has provided significant benefit to organizations prepared to support it, but it is not for the faint of heart.  The table below describes several factors to consider in determining whether intelligent swarming will work well.  Note: With some work and focus, any of these factors can be developed within and across teams.

Table: When Swarming is a good fit for your team

Limitations of Swarming

For all of the benefits of swarming, it has its limitations.  After all, as we all know, Alexander eventually defeated the Bactrian archers by cutting off their supply lines.  In nature, whales and other top-of-the-food-chain predators take advantage of schools of fish and other animals for a quick, efficient, bounteous meal.

Foundational Factors for Intelligent Swarming

In the context of IT, swarming is primarily focused on day-to-day work and is rarely strategic (except, perhaps, for intelligent swarming).  It is a means to an end (or several ends) . . . namely, reducing queue size, resolving incidents more quickly, using the team’s time more efficiently, and improving customer satisfaction.  Swarming in any form, when done well, is dependent upon many factors.  The absence of the following three limits its effectiveness:

  • Collaborative Culture – Employees are amenable to using self-help and technicians are willing and able to contribute knowledge. Formal KCS training can be extremely helpful here.
  • Knowledge Management – Knowledge documentation is accurate, easy to understand, and accessible.
  • Crossover skills – The skillsets of technicians are continually developed and strengthened in several areas. Deep subject matter expertise is shared by all.

Intelligent Swarming and Priority Overload

Although swarming is often considered a prioritization technique, it can (ironically) lead to priority overload.  Essentially, if the team swarms too often, it may fall into the trap of treating all incidents and work with the same priority and/or making all work priority number one.  This can lead to teams becoming slower and less effective overall.

Although, generally, collaboration is considered a good thing, there are cases when extreme collaboration is counterproductive.  Swarming often relies on the ability and willingness to communicate through multiple channels – telephone, conference call, e-mail, slack, kanban boards, SMS/text, etc.  This can often cause confusion and lead to relatively new phenomena such as “Zoom fatigue” and “slack fatigue.”  Unless carefully managed and monitored, the end result can, unfortunately, be employee burnout.  (For an informative podcast on “Collaboration Overload, listen to The Art of Manliness episode #773.) Swarming techniques (except for intelligent swarming) can rarely be applied continuously.  They often serve as a temporary solution to remove a queuing jam. As with anything, intentionality is key.

Retooling Technology to Support Swarming

Although tools themselves are rarely the full solution, robust technology can help overcome some of the limitations of swarming. Technology can provide that “single pane of glass” wherein all incident and problem records are stored alongside knowledge records (as shown in Figure 7 below) to ensure data is presented completely and contextually.

 

ServiceNow Major Incident Workbench

Figure 5 – ServiceNow Major Incident Workbench

The best platforms provide features that increase visibility into incidents and support artificial intelligence, which can rapidly assess, route, link, and even resolve tickets, saving additional time and cost.  ServiceNow, for example, has a Major Incident Workbench (pictured above), that provides a portal through which a major incident can be managed in a single location.   It can also quickly spin up a conference meeting for IT teams, allowing them to quickly swarm on a high priority issue.  The Agent Assist functionality within the workbench connects agents with knowledge documents and records, cutting down the time needed to resolve incidents, which is beneficial even if teams are not swarming.

Tips for Swarming Like a Pro

If you’ve decided you’re ready to take the plunge into swarming, here are some helpful tips that will help you as you get started:

  • Determine the scope of swarming efforts – You will first want to be clear about the scope. Answering questions like the following can help: “Will we fully commit to intelligent swarming?”  or “Will we focus on swarming around major incidents or to occasionally reduce the size of backlogs?”
  • Manage a few Communication Channels – Prevailing wisdom these days will tell you the more communication channels, the better. Proceed with caution here.  It can be difficult to manage multiple lines of communication (for example, with an omnichannel approach) from day one.  It’s better perfect the management of a few channels before adding more.
  • Block off time to focus on swarming (for example, with drop-in swarms) – If your scope includes drop-in swarms, block off dedicated time for it. Otherwise, the urgency of day-to-day operational work will take precedence and it won’t get done. Practice makes perfect on these – without this dedicated time, the drop-in swarms will fall by the wayside and teams won’t get better at managing them over time.
  • Make it easy for customers and users – Wherever possible, encourage customers to use online self-help, so they can quickly get answers and resolve issues for themselves. For this to work, however, self-help tools must be useful.  They should be intuitive to use, contain accurate, up-to-date information, and be easier to navigate than a call to the Service Desk.  Otherwise, customers will quickly become frustrated.
  • Be intentional about how you use technology – It is much easier to swarm when you have the right tools in place. You will see a shift away from everyone having to call the Service Desk when you put state-of-the-art self-help tools in place and begin educating customers on how to best use them.  With these tools are in place and happily used by customers, you can begin using advanced functionality such as AI-enabled monitoring and predictive intelligence to identify and even resolve incidents before they cause major disruption. This will get teams away from fire-fighting and lessen interruptions to customers’ work in the first place.

Make Your Swarming Knowledge Official

Our three-day ITIL 4 Create, Deliver, & Support class covers swarming has everything you need to manage IT-enabled products and services - swarming included.
Learn more about CDS

Originally published July 07 2022, updated May 05 2023
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]
[class^="wpforms-"]