Overcoming DNS Challenges with Secure DNS v2

As we covered in our Project Autobahn blog, we challenged our engineering team to push the boundaries of Secure Access Service Edge (SASE) performance while improving security. With Project Autobahn released, we switched our sights to Secure DNS to further improve the user experience through our pursuit of “Invisible Security.”

We started by speaking with our users to understand the requirements for creating a high-performance, seamless experience for staff in the office, out of the office, and distributed around the globe. In addition, we captured the vast DNS architectures in play including the use of on-premises Active Directory, hybrid Active Directory, Azure AD DNS, 3rd Party DNS, and other implementations, to ensure the design was flexible enough to integrate with a wide variety of environments.

The feedback was tremendous, and we landed on the following goals:

Improve resiliency and eliminate single points of failure in customer environments
Reduce latency and increase performance for distributed organizations
Promote configuration flexibility with a new policy engine, based on user, device, location, and more
Provide more control with overrides and custom DNS records
Simplify cloud-native implementations with automatic SGN IP resolution
Increase visibility and simplify troubleshooting with visualizer tools

After reviewing the current implementation of Secure DNS, we decided to overhaul the module for performance and scalability to create a strong foundation for future enhancements.

The engineering behind Secure DNS v2

Below we cover some of the challenges, conversations, and architecture decisions made during the ground-up overhaul to create Secure DNS v2.

Starting with the basics: Forwarding vs. recursive resolution and blocklists

With DNS, the way that domains are resolved matters greatly to both security and usability. Typically, there are two main ways of configuring DNS server domain resolution:

Using forwarders: Relying on “outsourced” DNS servers (i.e., ISP-provided or public DNS servers)
Recursive resolution: Querying multiple DNS servers, starting from root servers, in succession

Both resolution methods have their advantages and drawbacks. Forwarders, for example, are simple, requiring little to no effort on the side of the business using them. On the flip side, forwarders are less secure, as domain resolution is more susceptible to spoofing and provides third parties with visibility into your internet data (DNS leak).

We opted to continue leveraging recursive resolution, despite being more intensive from a resource and development standpoint. Recursive resolution removes any external actors like ISPs from the DNS picture, as best-in-class security is of the utmost importance.

Another foundational component of Secure DNS is malicious domain block lists. Speed is critical when it comes to blocking malicious hostnames. Our real-time intelligence pipeline sources from paid and open-source feeds—along with submissions from the Research, MXDR, and Detection Engineering teams—ensure that all users are protected when a new threat is discovered.

Secure DNS v2 provides a better user experience for malicious domains through an integration with the SASE Proxy module. This allows for a simple, easy-to-read page explaining the block to users, regardless of HTTP or HTTPS.

Internal domains, uptime, and geographic distance

Secure DNS operates within Todyl’s multi-tenant global network, creating the need to ensure the separation of configurations and policy at internet scale. Internal DNS records vary tenant by tenant and require routing and other configurable resolution behaviors depending on user, device, location, and more, going far beyond basic DNS resolution and routing.

A common use case is to integrate SASE with company-managed DNS servers that handle internal domain resolution, where administrators have control over records and resolution behavior. This introduces several challenges as company servers add latency depending on where users are connected. If the company-managed server experiences any issues, the impact is far-reaching across the organization.

For example, if a company has a DNS server at its headquarters in Los Angeles with a small remote site in London, DNS queries would incur latency as the request would have to traverse a significant distance. This creates a poor user experience with slower application response and page load times. Continuing with this common architecture example, if the servers located in the Los Angeles HQ experience an issue or require maintenance, all connected devices as well as users would be impacted.

Another challenge introduced by internal domains across multiple-tenant SASE environments is caching. Secure DNS v2 adds different policy-based configurations within a tenant, including routing, upstream, overrides, and more. It requires a multi-tiered caching design that is policy aware, rather than simply at the tenant level—with the need to easily flush via the portal.

Introducing Smart-Cache, our new intelligent caching engine

So, what was our solution to minimizing latency and improving resiliency from server failure or maintenance in a multi-tenant, multi-policy environment? A new caching layer in Secure DNS called smart-cache.

Smart-cache operates at the policy level in every point of presence (PoP), drastically improving DNS lookup time for internal domains when the user is far from the server. Referencing the example above, the users connected to the London PoP would first have their DNS requests served from the smart-cache, eliminating the round trip from London to Los Angeles. In the event the record is not cached, only the first query would incur the additional latency, and all other users and devices in the same policy would leverage the cache for subsequent lookups. The smart-cache is policy aware, so different users and devices in different locations would all receive the same performance improvement regardless of location.

Smart-cache was also designed to increase resiliency against server failure and maintenance. One key design is how smart-cache checks the upstream server for a response before expiring out a cached entry. This allows Secure DNS to continue serving internal lookups even if the DNS server is offline for a limited amount of time. In the case of flushing, smart-cache honors upstream TTLs, and in the event of a quick change, users can flush via the Clear DNS Cache button in the portal.

Split horizon & in and out of office configurations

Running SASE on a device is beneficial in almost any scenario. When the device operates from a secured location like a physical office, however, SASE can add more complexities as IT and security teams may have different access levels and ownership or different behaviors may be expected with different split-horizon DNS across multiple locations.

Having users turn off SASE while in the office to make local resources available directly through the local network is a potential workaround. This, however, is not recommended. First off, users may forget to turn SASE back on when they are out of the office, leading to decreased security. Additionally, SASE still provides additional benefits that make it important regardless of where the user operates. Instead, the best way to remedy these issues altogether is to have location-aware DNS that leverages the proper DNS configuration based on a given device’s location.

Introducing Location-Based Policies

To solve this, we built location awareness into Secure DNS v2 that allows you to configure different DNS settings based on the device’s location: e.g., Server A when the device is remote, but Server B when the device is in the office. This can be refined even further via policies, allowing for DNS resolution behaviors that are specific to an individual device or set of devices.

Cloud native resolution behavior

Every SASE device receives a static SGN IP address that can be used for device-to-device communication via SASE. For example, a common use case is secure remote access. Having a desktop in an office (running the Todyl Agent and the SASE module enabled in the tenant) can allow another device in the same tenant— such as the laptop of a user working from home—to remotely connect to that machine via the SGN IP.

Zero Trust Network Access can be configured as well, limiting the connections to a specific user or device, even requiring MFA before the connection is allowed. The complexity comes from the user’s standpoint, of needing to use the SGN IP. The Todyl agent features location-aware shortcuts that can be configured at the user level, so if the laptop in the example above leaves the office, and the user is MFA-authenticated, they would see the RDP shortcut in the agent. However, after speaking with users the need for additional flexibility was clear.

Introducing SGN IP Resolution Policies

Policies in Secure DNS v2 feature SGN IP Resolution, automatically resolving device names to SGN IPs across a tenant. This can be combined with other policy attributes including users, devices, and location, giving flexibility to address several use cases. Referencing the use case above, a policy can be created to enable SGN IP resolution when the laptop is out of the office, simplifying the RDP connection process. This feature also comes in handy for server environments in a multi-cloud configuration to eliminate the need for tunnels and other complex underlying infrastructure. The server would simply resolve directly to the SGN IPs, giving portability across cloud environments and making the connections cloud native via SASE.

Resolution control and eliminating the need for on-premises DNS

Not every environment needs On-Premises DNS. Some use cases leverage public DNS, 3rd Party DNS, or a DNS server built into the gateway. In this case, the ability to create custom DNS records needs to happen in the Secure DNS module, giving administrators control of upstream responses.

The complexity comes from the way the SSL Inspection Proxy works and presents an interesting engineering challenge. At a high level, the multi-tenant DNS modules are separate from the SSL Inspection Proxy module. When an HTTPS connection is intercepted by the SSL Inspection Proxy module, the client connection is terminated, and a new connection to the upstream is opened, by resolving the SNI for the destination IP.

Enforcing an SNI match and separating the DNS resolution is a critical security feature for SSL Inspection. In the event of a compromise, if malware were to modify the host file on a device, it could redirect the traffic to a different destination, such as an upstream malicious server mimicking the true destination.

Let’s add some additional color to the example. The malware creates a host file entry for mybank.com, pointing the domain to a malicious server with a replicated webpage. When the connection terminates at the SSL Inspection Proxy, the SSL Inspection Proxy views the SNI, resolves the destination IP, and opens a new connection. In this case, the SSL Inspection Proxy would resolve the SNI (mybank.com) to its actual IP address not the one in the poisoned host file. Furthermore, even if mybank.com resolved to the IP address of the malicious server (poisoned DNS entry), the threat actor would not have a valid mybank.com certificate, and the SSL Inspection Proxy would reject the connection (security in depth).

SNI enforcement is here to stay as security is paramount; however, what about the case where the user owns the domain and wants to change the destination based on location? One case is controlling where traffic is sent based on a user’s location. In this case, a custom DNS record could be created, and resolution behavior via policy would change. However, the overridden destination IP would be ignored by the SSL Inspection Proxy since the SSL Inspection Proxy DNS resolution needs to remain separate for security. This would essentially ignore the DNS override for HTTPS traffic, which is a very common use case.

As a policy, the proxy must always enforce Public PKI and enforce SNI validation. But, if the user is in control of a valid public certificate, then we needed a way to redirect the proxy traffic to the destination that was resolved by the host via DNS.

Introducing custom DNS records

This was an exciting problem to solve. We landed on a secure approach that would only honor valid Public PKI, enforce SNI, and allow the proxy to redirect traffic to a specific destination that was entered into the DNS policy. We created a static domain name module in the multi-tenant proxy that ties into the Secure DNS policy engine. This allows for DNS to work with the SSL Inspection Proxy while maintaining a strong security posture.

This also eliminates the need for on-premises DNS servers as custom entries can be created, and tied to a policy, giving the ability to control resolution policy based on device, user, and location. There are a lot of possibilities with the flexibility this module delivers.

Additional features of Secure DNS v2

Beyond these challenges, here are some of the other facets we built into Secure DNS v2.

Simplicity

Regardless of having internal (local) domains or multiple device locations (e.g., office, remote), we designed our DNS workflow to be always the same: all SASE-connected devices use Secure DNS for resolutions by default. That way, there are no longer any complicated workflows such as using Secure DNS as an upstream forwarder or shutting off DNS entirely and removing its security features.

No single point of failure

This simplified workflow solves another problem: when a "local" DNS server goes down. With Secure DNS v2, domains are always resolved as the "local" DNS server is now used only for internal domains. Users set a policy that designates the local DNS server as the server of choice for given domains. Secure DNS will forward requests to resolve these domains to your local AD/DNS automatically.

Built-in simple DNS server / domain overrides

Don’t have your own AD/DNS server but wanted to have local names resolved to internal IP addresses? Secure DNS v2 offers a simple and configurable built-in DNS server so you can set up any domain names to be resolved.

Another use case here is to override any domain resolution behavior. Want ‘google.com` to be resolved to ‘10.10.10.10`? No problem—just set up a policy.

DNS visualizer / device name lookups

To cap it all off, Secure DNS v2 also provides you with a way to double-check all your DNS resolution behaviors. Using our plugin, you can examine DNS behaviors by toggling several different configurations, including the device or device group, location, a combination of the two, and how they behave when heading to a specific web address or IP.

In addition, to streamline the process even further, Secure DNS v2 allows you to use the device’s name when configuring policies instead of its SGN IP address to make it easier to pick the correct devices in an instant.

Learn More

Secure DNS v2 is just one of the many exciting updates we have in store for this year, so stay tuned to see everything we’re adding to the Todyl Platform.