In this blog post we will cover the basics of building a truly resilient network where throughput isn’t always important, but reliability and redundancy are. We will look at this from the operators’ stand point.
Why must we be resilient? There are two broad reasons:
1. If it is critical infrastructure (transportation, power grids, oil rigs etc.), people die if it fails.
2. If it is the bulk of the remaining SCADA infrastructure, such as a factory, there is a financial impact. If you are are manufacturing company and your factory goes down, you are quite literally losing money. There is a hard ROI and you know exactly how much you lose.
So the first thing we need to do is understand the requirements and what redundancy means. Do not assume your traditional understanding is acceptable. I think it is important to consider what the non IT engineers (construction, civil, electrical etc.) are doing in their environments that sit alongside yours. For example, if you look at some of the newer nuclear power plants, they have redundancy for almost all events including catastrophic loss of power and systems. Massive water cooling tanks are installed directly above and or below the chambers where nuclear fuel is stored. This is an example of a non-traditional redundancy, and it puts the onus on the IT professional to have redundant links on top of redundant links.
Broad Design Concepts:
So how do you think differently about designing your environment? Understand what the business or legal parameters around the design are. What happens if you lose redundancy? Do you have to shut down? Will the risk of going down if network is in a non-redundant state force a shutdown? The point here is that you will need to make a physical design that is appropriate, in some cases two devices and two links, in other cases three by three, so on and so forth. Take the business and statutory risk into account when making this design, and most importantly get rid of your assumptions of what is likely acceptable risk in a corporate network, because it is not acceptable in a SCADA environment.
Use the simplest design that works. Yes, things like a full meshed fabric sound awesome (and simple to manage ) but do the rewards outweigh the risks? When you use a single logical unit to do all your routing/switching, you run the risk of a single small fault causing a cascading catastrophic failure. When it goes wrong it in one place, it can go wrong everywhere. Troubleshooting these types of problems can be a nightmare especially if it’s the controller that’s gone haywire (likely all your redundancy would be useless). Simplicity is always best in SCADA/PCD environments.
[You might also like: SCADA Part 3: Mission critical, highly vulnerable, almost un-protectable.]
Connectivity and other Gotchas:
Once we have an idea of the high-level physical layout we need to consider a few other things we probably don’t on a regular basis. How will we connect to these SCADA devices? Believe it or not, they often times do not have Ethernet and they can and do still run over Sync or Async. As mentioned before, seldom is throughput a factor but consistent low latency is a requirement.
Today there are a growing number of manufacturers who make technology specifically for SCADA environments. Make sure you are dealing with someone who makes enterprise/SCADA-class products. The company needs to be reputable and have good support with replacement depots within a reasonable distance. In this world we need to avoid proprietary technology wherever we can. There are two reasons for this:
a) Proprietary means increased complexity when troubleshooting a mixed environment.
b) Also remember our SCADA environments are often run by operations employees, not IT. They may be remote, and finding people in the event of an emergency may be difficult. Don’t exacerbate the situation by using proprietary technology.
When considering what types of devices to use in your environment, always be cognizant of the latency ramifications. I’m not just talking about security products but any device that we would normally have sitting in line that acts as a proxy or real-time deep inspection device. Where possible, pick the appropriate architecture. For example, when buying load balancers, consider a switch-based architecture. Switch-based architectures don’t offer some of the application or resource optimization capabilities of proxy-based architectures, but they also don’t add any of the associated latency that those architectures have. If you feel like you may one day have a need, consider a dual stack architecture that has a switching architecture at L4 and can do reverse proxy architecture when needed at L7. For security devices the ideal solution is an out of path solution that has full security messaging capabilities. The inspection devices can therefore review any suspicious activity then send the appropriate remediation or block command to the head end, which is otherwise in line and passive (thus inspection being out of band adds no latency and the blocking device likewise does not increase the risk of latency).
To emphasize this point, consider that a firewall in the SCADA space can be a device that literally allows one-way communication (fiber transceiver either missing the RX or the TX) but does no inspection (inspection adds latency). If you have never played in the SCADA world, you would definitely question if that is actually a firewall. Rest assured it is a solution in the PCD/SCADA space and it is called a firewall.
In summary, remember availability and reliability are of paramount importance. Design the network with forethought using accepted standards while taking business requirements into account. Keep the design as simple as possible, use enterprise/SCADA ready components (ones from an established company that offers a comprehensive support nearby parts depots etc.), layer on networking best practices and non-proprietary technology wherever possible.