This is a guest post written by Jeff Schmidt, CEO of JAS Global Advisors LLC. JAS is currently authoring a “Name Collision Occurrence Management Framework” for the new gTLD program under contract with ICANN.
One of JAS’ commitments during this process was to “float” ideas and solicit feedback. This set of thoughts poses an alternative to the “trial delegation” proposals in SAC062. The idea springs from past DNS-related experiences and has an effect we have named “controlled interruption.”
Learning from the Expired Registration Recovery Policy
Many are familiar with the infamous Microsoft Hotmail domain expiration in 1999. In short, a Microsoft registration for passport.com (Microsoft’s then-unified identity service) expired Christmas Eve 1999, denying millions of users access to the Hotmail email service (and several other Microsoft services) for roughly 20 hours.
Fortunately, a well-intended technology consultant recognized the problem and renewed the registration on Microsoft’s behalf, yielding a nice “thank you” from Microsoft and Network Solutions. Had a bad actor realized the situation, the outcome could have been far different.
The Microsoft Hotmail case and others like it lead to the current Expired Registration Recovery Policy.
More recently, Regions Bank made news when its domains expired, and countless others go unreported. In the case of Regions Bank, the Expired Registration Recovery Policy seemed to work exactly as intended – the interruption inspired immediate action and the problem was solved, resulting in only a bit of embarrassment.
Importantly, there was no opportunity for malicious activity.
For the most part, the Expired Registration Recovery Policy is effective at preventing unintended expirations. Why? We call it the application of “controlled interruption.”
The Expired Registration Recovery Policy calls for extensive notification before the expiration, then a period when “the existing DNS resolution path specified by the Registrant at Expiration (“RAE”) must be interrupted” – as a last-ditch effort to inspire the registrant to take action.
Nothing inspires urgent action more effectively than service interruption.
But critically, in the case of the Expired Registration Recovery Policy, the interruption is immediately corrected if the registrant takes the required action — renewing the registration.
It’s nothing more than another notification attempt – just a more aggressive round after all of the passive notifications failed. In the case of a registration in active use, the interruption will be recognized immediately, inspiring urgent action. Problem solved.
What does this have to do with collisions?
A Trial Delegation Implementing Controlled Interruption
There has been a lot of talk about various “trial delegations” as a technical mechanism to gather additional data regarding collisions and/or attempt to notify offending parties and provide self-help information. SAC062 touched on the technical models for trial delegations and the related issues.
Ideally, the approach should achieve these objectives:
- Notifies systems administrators of possible improper use of the global DNS;
- Protects these systems from malicious actors during a “cure period”;
- Doesn’t direct potentially sensitive traffic to Registries, Registrars, or other third parties;
- Inspires urgent remediation action; and
- Is easy to implement and deterministic for all parties.
Like unintended expirations, collisions are largely a notification problem. The offending system administrator must be notified and take action to preserve the security and stability of their system.
One approach to consider as an alternative trial delegation concept would be an application of controlled interruption to help solve this notification problem. The approach draws on the effectiveness of the Expired Registration Recovery Policy with the implementation looking like a modified “Application and Service Testing and Notification (Type II)” trial delegation as proposed in SAC62.
But instead of responding with pointers to application layer listeners, the authoritative nameserver would respond with an address inside 127/8 — the range reserved for localhost. This approach could be applied to A queries directly and MX queries via an intermediary A record (the vast majority of collision behavior observed in DITL data stems from A and MX queries).
Responding with an address inside 127/8 will likely break any application depending on a NXDOMAIN or some other response, but importantly also prevents traffic from leaving the requestor’s network and blocks a malicious actor’s ability to intercede.
In the same way as the Expired Registration Recovery Policy calls for “the existing DNS resolution path specified by the RAE [to] be interrupted”, responding with localhost will hopefully inspire immediate action by the offending party while not exposing them to new malicious activity.
If legacy/unintended use of a DNS name is present, one could think of controlled interruption as a “buffer” prior to use by a legitimate new registrant. This is similar to the CA Revocation Period as proposed in the New gTLD Collision Occurrence Management Plan which “buffers” the legacy use of certificates in internal namespaces from new use in the global DNS. Like the CA Revocation Period approach, a set period of controlled interruption is deterministic for all parties.
Moreover, instead of using the typical 127.0.0.1 address for localhost, we could use a “flag” IP like 127.0.53.53.
Why? While troubleshooting the problem, the administrator will likely at some point notice the strange IP address and search the Internet for assistance. Making it known that new TLDs may behave in this fashion and publicizing the “flag” IP (along with self-help materials) may help administrators isolate the problem more quickly than just using the common 127.0.0.1.
We could also suggest that systems administrators proactively search their logs for this flag IP as a possible indicator of problems.
Why the repeated 53? Preserving the 127.0/16 seems prudent to make sure the IP is treated as localhost by a wide range of systems; the repeated 53 will hopefully draw attention to the IP and provide another hint that the issue is DNS related.
Two controlled interruption periods could even be used — one phase returning 127.0.53.53 for some period of time, and a second slightly more aggressive phase returning 127.0.0.1. Such an approach may cover more failure modes of a wide variety of requestors while still providing helpful hints for troubleshooting.
A period of controlled interruption could be implemented before individual registrations are activated, or for an entire TLD zone using a wildcard. In the case of the latter, this could occur simultaneously with the CA Revocation Period as described in the New gTLD Collision Occurrence Management Plan.
The ability to “schedule” the controlled interruption would further mitigate possible effects.
One concern in dealing with collisions is the reality that a potentially harmful collision may not be identified until months or years after a TLD goes live — when a particular second level string is registered.
A key advantage to applying controlled interruption to all second level strings in a given TLD in advance and at once via wildcard is that most failure modes will be identified during a scheduled time and before a registration takes place.
This has many positive features, including easier troubleshooting and the ability to execute a far less intrusive rollback if a problem does occur. From a practical perspective, avoiding a complex string-by-string approach is also valuable.
If there were to be a catastrophic impact, a rollback could be implemented relatively quickly, easily, and with low risk while the impacted parties worked on a long-term solution. A new registrant and associated new dependencies would likely not be adding complexity at this point.
Request for Feedback
As stated above, one of JAS’ commitments during this process was to “float” ideas and solicit feedback early in the process. Please consider these questions:
- What unintended consequences may surface if localhost IPs are served in this fashion?
- Will serving localhost IPs cause the kind of visibility required to inspire action?
- What are the pros and cons of a “TLD-at-once” wildcard approach running simultaneously with the CA Revocation Period?
- Is there a better IP (or set of IPs) to use?
- Should the controlled interruption plan described here be included as part of the mitigation plan? Why or why not?
- To what extent would this methodology effectively address the perceived problem?
- Other feedback?
We anxiously await your feedback — in comments to this blog, on the DNS-OARC Collisions list, or directly. Thank you and Happy New Year!