Over the last year, we’ve published a number of blogs talking about NewEdge, the network or infrastructure upon which we deliver the Netskope Security Cloud services, and comparing it to other approaches cloud security vendors have taken. We’ve talked at length about Netskope’s fundamental approach to overcoming the inherent performance limitations of the public Internet, as well as why backhauling (or “hairpinning”) inside the cloud is a bad strategy, why coverage isn’t just about counting data centers, and how important peering and an aggressive interconnection strategy is for the best performance and user experience.
One topic that frequently comes up in conversations with networking and infrastructure leaders, garnering much interest, is the approach we’re taking at Netskope to actually building and scaling out NewEdge. I’m personally very excited about telling this story, as my day-to-day responsibilities entail making this happen and it aligns with my previous experiences at AWS, the world’s largest and best-known public cloud. We’re now at a point where we want to demystify NewEdge and take the excitement our internal teams feel about NewEdge and share the details with our customers and partners.
We build NewEdge with a number of design tenets in mind, and the goal is to effectively balance availability, performance, and scalability. Because these can be at odds with each other, we have to not only make intentional decisions as to what hardware and software to use now but also see around corners and predict what we’ll need before we need it. Our current 50+ site footprint utilizes infrastructure that is less than three years old at the oldest location and has been vigorously tested by third parties and our own QA teams. We utilize advanced platform features where required, but do not run headlong to bleeding edge functionality provided by third parties. Where we need specific functionality, we build it ourselves. We over-provision the network to give us a buffer to scale before need and we operate in a non-blocking mode so that even with all services available, the application of our security features does not throttle customer traffic to and from their SaaS applications. At Netskope, everything we design, develop, and deploy is governed by a collection of core tenets or values that we firmly believe in, and our culture dictates that we measure our progress against those tenets. We want to be able to answer whether we’ve made life better or worse for our customers, and we use data to do it.
In this blog I’m going to spend some time unpacking the NewEdge data center strategy, introducing you to the “data center factory” behind NewEdge. I will not only drill down into what we’ve done but also why we’ve done it this way, and share some of our best practices too. The goal is not just to be transparent and convince you of the power of NewEdge, but also to share insights that you can incorporate as your organization makes the significant transition to the cloud and looks to leverage key learnings from the cloud and hyperscale pioneers.
Leverage a lean footprint
Since the introduction of NewEdge roughly two years ago, we have completely redesigned our data center footprint by finding and implementing optimizations across our infrastructure and software portfolio. Our previous architecture required a hybrid mix of several physical, on-premise racks as well as a substantial physical presence in hosted compute environments. Like many of our competitors, we relied on the public cloud for a period of time, and we were able to see for ourselves the unpredictable performance of these architectures when it came to the delivery of real-time, inline security services. As an example, prior to NewEdge, we experienced significant variability in performance as public cloud providers routed traffic according to their business needs, cost or otherwise. With latency for user traffic ranging from single digits to dozens of milliseconds (ms) in some locations, customers experienced application issues, especially where the applications were sensitive to jitter. Today, with the NewEdge footprint, we strive for consistent single-digit millisecond latency. Accordingly, we needed a solution that was mean, lean, and focused—providing more capacity, flexibility, and performance in a single rack. Plus the single rack approach would allow us to move fast if we need to scale in a particular geographic area or address a changing usage pattern. This approach was only possible with a significant investment in capital and expertise, and to this end, Netskope invested 100 people (including myself) and $100M to do the initial build-out of the NewEdge security private cloud.
Create data centers without personalities
We’ve worked very hard to make our infrastructure single, unified, and 100% homogenous, so no single data center looks any different from any other. As part of moving to a lean, performance-focused footprint, we have implemented the concept of integrated racks into our data center factory approach. We build, pre-stage, configure, test, and ship a data center as a pre-built rack, with every rack constructed in exactly the same way. This approach ensures total consistency in every data center when it comes to what services are available and the related configuration of the surrounding infrastructure and underlying components.
This level of consistency and uniformity extends all the way down to ensuring cables are plugged into the exact same ports across every data center globally! This allows us to use automation to speed deployment and employ auto-remediation when needed. To this point, while most of the world was in the midst of a global pandemic in 2020, through our use of automation we were able to roll out more than 20 data centers globally, including four data centers in Latin America in roughly 30 days! This is an unheard-of pace of deployment and scaling, even for leading cloud and hyperscale companies. Gone are the days of cabling together physical boxes and sending costly staff all over the world to launch a data center.
Employ extensive pre- and post-deployment testing
Although the configuration of NewEdge data centers is automated and executed in exactly the same way every single time, we know things will break and issues will inevitably occur. To de-risk our deployments, before a NewEdge data center leaves for its final destination, we collect and evaluate more than 2,000 unique metrics that are indicative of overall system health. This includes voltage at each power supply and memory parity and performance level, simulated load on our infrastructure, and granular testing of service functionality. A data center doesn’t ship until all items are in their expected state and all criteria have been met. The same tests are again performed after the data center arrives in the region. And the data center does not get launched and go live in production until all metrics are in 100% alignment and tests are successfully completed.
Once in production, it’s inevitable the data center will eventually hit its utilization limits and that’s when the process repeats itself. This is an important point as we don’t just add capacity to an existing data center. Instead, the NewEdge approach is to expand to a different location in the same region, which maximizes the overall resilience of our network by design. For example, we’ve had to do this across the United Kingdom, first with our London data center and then adding Manchester. Similar story for France with Paris followed by Marseille, or Frankfurt in Germany followed soon by Dusseldorf. It’s also worth mentioning that while many vendors will push their utilization to, and often beyond the “breaking point”, 20% is the target utilization that triggers a NewEdge expansion event. The reason for this is to ensure we can handle unusual traffic spikes, have the ability to onboard tens or hundreds of thousands of enterprise users at speed with ease, and generally harden our underlying infrastructure to achieve the absolute best performance and service resilience.
Take more control over customer experience
Whether it’s with our data center factory approach, in-the-weeds decisions about components that make up our integrated racks (e.g. bare metal servers, NVMe SSDs, or high-bandwidth network switches), data center site locations, peering or transit relationships, we take full responsibility for the customer experience and it’s our job to architect around issues. It’s important to recognize the vast majority of our engagement with customers involves dealing with traffic that traverses the Internet, whether for accessing web content, workloads in the public cloud, or SaaS applications. Today, web traffic dominates, accounting for 90% of most enterprise traffic, with 53% of cloud-related traffic. It’s a known fact that using the Internet without any special configuration or thought around routing, for example, results in an experience that’s largely out of the sender or receiver’s control. We’ve all experienced slowness (or in the worst case outages) of our favorite websites or apps that leave us helpless until our Internet Service Provider or IT Helpdesk person is able to address whatever is causing the problem.
Fundamentally, we all rely on the Internet to connect with others and this is most acute in businesses where the Internet is absolutely critical to connecting employees with one another and their customers, business partners, and suppliers. These issues around slowness or outages are a big deal. That’s why, with the NewEdge architecture, we’ve taken the approach of creating an “express lane” that overlays the traditional (and less predictable) public Internet. This has been discussed at length in previous blogs, but our technical and business approach aligns with and prioritizes peering with our customers, business partners, and web, cloud, and SaaS providers to route traffic as directly and deterministically as possible. For example, to get from point A to point B, we keep the traffic on private links for as long as possible and in some cases for the entirety of the traffic path. To put a point on this, today Netskope actually sends less traffic over the public Internet than we do through our semi-private or private peering links. This end-to-end control is precisely what allows NewEdge to deliver a superior user experience and application performance. You can see this for yourself by taking the NewEdge Speed Test, which highlights the industry-leading fast on-ramps to our network. This is also why we peer directly with Microsoft and Google, as just two prominent examples, in every NewEdge data center location.
Get closer to our customers
Because of the lean, modular nature of the NewEdge physical footprint, we’ve been able to position our data centers in more places globally (and more quickly) than our competitors. We’ve executed an extremely ambitious plan for global coverage, and have built a network with more locations with compute resources for security traffic processing than the largest cloud providers. This has required us to establish a physical presence in areas of the world that, due to space constraints, limited or unstable power supply, geopolitical upheaval, or other concerns, typically have presented an insurmountable barrier to entry for most organizations. The good news is that since we do the very hard work of getting our racks in these physical locations—as opposed to providing only a visual representation of being in a location, for example through virtual POPs that are incapable of actually processing traffic—we can connect NewEdge directly to end-user “eyeball” networks or web, cloud, and SaaS providers that have an in-market presence. In alignment with the previous tenet of having “more control over customer experience,” this is precisely what gives us our performance and user experience advantage.
Reduce the blast radius
As yet another core tenet of NewEdge, in order to deliver on high availability and maximize our network resilience, significant work is done to plan for and reduce the potential scope of any incident that could degrade performance. Since many of our competitors use large, concentrated data centers (and fewer of them) combined with an on-paper-only local market presence (with vPOPs, for example), the potential impact of any single outage can be very significant. If a single data center goes down or offline, the experience of a larger subset of customers (potentially millions of users) can be impacted, which in turn ripples through the operations of their business. This is a totally unacceptable outcome and it’s precisely why Service Level Agreements (SLAs) are so important to customers to back up any vendor claims. (For the record,