Tales from the Retro: Siloing
After 15 years of running agile retrospectives, I've noticed a few patterns. Today's theme is siloing.
Out of the 8 teams that I've run retrospectives for, 7 of them[1] have complained within the first 2 retros about knowledge silos. You really can't get much more consistent than that, so let's talk about why this is a complaint.
What is siloing?
There are multiple ways of applying the idea of a silo to software engineering. The primary way we'll be talking about it today involves a team where knowledge of particular systems is focused in particular individuals.
One of the easier ways to determine if you have silos on your team is to think of each system you own. If you had a task for that system, who would you expect to implement it? Don't think about capacities or the current schedule, just who would actually end up taking the task removing all other constraints. If you are picking one or two people for each system, you have silos.
Now, the actual easiest way to determine if you have silos on your team is: yes, you do.
Aside: other types of silos
There are other ways the term can be applied in software engineering. Often, silos come up when companies do not have a reasonable inner source model. When teams are restricted to working in their own space, it creates extra overhead to achieving anything.
The common feature of all types of silos is that they create risk[2].
Why are silos bad?
I just told you that silos create risk, so let's dig into why that is.
Redundancy
Let's get the hardest topic out of the way first and talk about people leaving the team or the company. Silos make engineer departures much riskier. Think of the experts on your team and how difficult life might become if they leave. It can be terrifying to consider.
Of all the reasons that people complain about silos, I think this is the most important one. Your product needs to be resilient to engineering team changes, because you cannot predict most of them.
Operational Excellence
When you have one or two engineers who understand a system, the size of your oncall rotation doesn't matter. Those engineers are always on call. Can the on call engineer handle it if the system breaks in the middle of the night? It depends on the documentation (more on that later).
The reality of a silo-laden world involves extra stress, burnout, and risk for experts. They answer phone calls on vacations to help troubleshoot issues, and they sometimes feel guilt in really disconnecting. Hopefully, it's clear why this is bad for the engineer as well as the team and larger organization.
What can we do?
The number one request from retros when silos come up is simple, obvious, and completely ineffective: cross-training. It's easy to say: "Well, obviously, the answer lies in making sure that more people understand the system. That's just cross-training." Yeah...sure...what what IS it?
As you might have guessed, I don't really like the term "cross-training". It's vague and takes on a meaninglessness of a lot of other corporate-speak. Let's talk about actual practical things that do and don't work.
Brown Bags/Technical Talks
Managers, product managers, and anyone who loves schedules will gravitate towards brown bags as a way to spread knowledge. They like them because one engineer does a day or two of prep, and the rest of the team spends an hour being talked at.
It's hard to deny the value of this approach...if it worked. It doesn't work, though. Few (if any) people have ever gained a real appreciation for the operational characteristics of a system by listening to a talk.
Now, I won't say that talks are useless. They are often a useful precursor to more in-depth exploration of a system. It's good to get an idea of what a system does and how it's put together, but I have never seen a talk or even a series of talks provide the value of hands-on experience.
Bug Fixing
Hands-on experience really is the way. The first way an engineer can get hands-on experience with a system they don't know is through bug fixes. Bug fixes can provide a light introduction to where things are and how they interact. The value of this, though, really depends on the bug. Small, isolated bugs do not provide as much value as larger, more invasive bugs.
Ultimately, though, invasive bugs tend to lead to...
Refactoring
Here's where you really start to understand a system. Delegating a refactoring task away from an expert can provide great insight for an engineer new to the system. You can get even more value out of refactoring if the engineer learning the system gets to map out the refactor.
Expert-excluded implementation
I always like to start excluding the experts from work on a system when they start becoming too much of a focus of knowledge. In my experience, the expert should consult on the design of a new feature or refactor (see above), but that should be the limit. Let them work on something else. If things go off the rails, the expert can assist. The rest of the time, the team is learning and serving the roadmap.
What doesn't work?
I do want to address a couple things that haven't seemed to work. They come up a lot as suggestions in retros. I don't like to steer the conversation too aggressively in retros that I facilitate, but this can be a case where it's necessary.
First, we have paired programming. I "grew up" as a software engineer during the XP hype. I have never seen an engineering exercise XP ideas (like paired programming) consistently. I hear that they exist. I haven't seen them. Pairing usually means a brief session of sitting together to figure out a particular problem. That doesn't cause a lot of knowledge transfer.
Additionally, the whole idea feels kind of wasteful. We're asking the expert to sit with an engineer learning the system to work on a task. That expert is...what...expected to tell the other engineer what to do? Not tell the other engineer what to do? Some balance where the expert knows the solution but shouldn't say it? Long-term paired programming requires a more level playing field. This is probably why paired programming, in practice, tends to be targeted and brief.
The other suggestion coming from retros goes something like this: "Just try to avoid having the expert pick up tasks for that system." It's well-meaning, but we get back into schedule pressure. If it takes another engineer longer, the buffer in delivery schedules can disappear. It's also too much hope pinned on good intentions rather than a structured approach to spreading the knowledge.
Documentation
Let's take a moment to talk about documentation. Documentation, as in something written down already, doesn't provide a lot of useful knowledge transfer. Documents provide, in my experience, less concrete benefit than brown bags.
On the other hand, writing documentation can transfer a lot of knowledge. For complex systems, I find a dual benefit to having non-experts write the documentation. First, someone else learns the inner depths of the system. As they say, the best way to learn is to teach.
Second, the documentation will be better. How? Experts know too much. They hold too many assumptions and expectations. Documents written by the system expert often leave out important context. On the other hand, documents written by someone new to the system will include details that they had to learn in order to write. I don't think this is as effective as refactoring or implementing a feature at knowledge transfer, but it does give you better documentation for when you do need to communicate some of this expertise in written form.
Wrapping Up
Look, we all know the best way to learn is through doing. Product delivery timelines are tight, so the best place for sharing knowledge and breaking down silos is engineering-driven work. Technical debt reduction, bug fixes, and internal documentation provide much more timeline flexibility, meaning they can be more effective and consistent.
You have to remember that silos are not broken down over night. In reality, they aren't even broken down, just widened. You don't need your entire team to be equally knowledgeable about every system. You just need enough distribution to mitigate the worst risks.