Having spent the last 36 hours crunching ICANN’s lists of almost 10 million new gTLD name collisions, the DI PRO collisions database is back online, and we can start reporting some interesting facts.
First, while we reported yesterday that 1,318 new gTLD applicants will be asked to block a total of 9.8 million unique domain names, the number of distinct second-level strings involved is somewhat smaller.
It’s 6,806,050, according to our calculations, still a bewilderingly high number.
The most commonly blocked string, as expected, is “www”. It’s on the block-lists for 1,195 gTLDs, over 90% of the total.
Second is “2010”. I currently have no explanation for this, but I’m wondering if it’s an artifact of the years of Day In The Life data upon which ICANN based its lists.
Protocol-related strings such as “wpad” and “isatap” also rank highly, as do strings matching popular TLDs such as “com”, “org”, “uk” and “de”. Single-character strings are also very popular.
The brand with the most blocks (free trademark protection?) is unsurprisingly Google.
The string “google” appears as an exact match on 930 gTLDs’ lists. It appears as a substring of 1,235 additional blocked strings, such as “google-toolbar” and “googlemaps”.
Facebook, Yahoo, Gmail, YouTube and Hotmail also feature in the top 100 blocked brands.
DI PRO subscribers can search for strings that interest them, discovering how many and which gTLDs they’re blocked in, using the database.
Here’s a table of the top 50 blocked strings.
With all the talk recently about DNS Namespace Collisions, the heretofore relatively obscure Day In The Life (“DITL”) datasets maintained by the DNS-OARC have been getting a lot of attention.
While these datasets are well known to researchers, I’d like to take the opportunity to provide some background and talk a little about how these datasets are being used to research the DNS Namespace Collision issue.
The Domain Name System Operations Analysis and Research Center (“DNS-OARC”) began working with the root server operators to collect data in 2006. The effort was coined “Day In The Life of the Internet (DITL).”
Root server participation in the DITL collection is voluntary and the number of contributing operators has steadily increased; in 2010, all of the 13 root server letters participated. DITL data collection occurs on an annual basis and covers approximately 50 contiguous hours.
DNS-OARC’s DITL datasets are attractive for researching the DNS Namespace Collision issue because:
- DITL contains data from multiple root operators;
- The robust annual sampling methodology (with samples dating back to 2006) allows trending; and
- It’s available to all DNS-OARC Members.
More information on the DITL collection is available on DNS-OARC’s site at https://www.dns-oarc.net/oarc/data/ditl.
Terabytes and terabytes of data
The data consists of the raw network “packets” destined for each root server. Contained within the network packets are the DNS queries. The raw data consists of many terabytes of compressed network capture files and processing the raw data is very time-consuming and resource-intensive.
While several researchers have looked at DITL datasets over the years, the current collisions-oriented research started with Roy Hooper of Demand Media. Roy created a process to iterate through this data and convert it into intermediate forms that are much more usable for researching the proposed new TLDs.
We started with his process and continued working with it; our code is available on GitHub for others to review.
Finding needles in DITL haystacks
The first problem faced by researchers interested in new TLDs is isolating the relatively few queries of interest among many terabytes of traffic that are not of interest.
Each root operator contributes several hundred – or several thousand – files full of captured packets in time-sequential order. These packets contain every DNS query reaching the root that requests information about DNS names falling within delegated and undelegated TLDs.
The first step is to search these packets for DNS queries involving the TLDs of interest. The result is one file per TLD containing all queries from all roots involving that TLD. If the input packet is considered a “horizontal” slice of root DNS traffic, then this intermediary work product is a “vertical” slice per TLD.
These intermediary files are much more manageable, ranging from just a few records to 3 GB. To support additional investigation and debugging, the intermediary files that JAS produces are fully “traceable” such that a record in the intermediary file can be traced back to the source raw network packet.
The DITL data contain quite a bit of noise, primarily DNS traffic that was not actually destined for the root. Our process filters the data by destination IP address so that the only remaining data is that which was originally destined for the root name servers.
JAS has made these intermediary per-TLD files available to DNS-OARC members for further analysis.
The intermediary files are comparatively small and easy to parse, opening the door to more elaborate research. For example, JAS has written various “second passes” that classify queries, separate queries that use valid syntax at the second level from those that don’t, detect “randomness,” fit regular expressions to the queries, and more.
We have also checked to confirm that second level queries that look like Punycode IDNs (start with ‘xn--‘) are valid Punycode. It is interesting to note the tremendous volume of erroneous, technically invalid, and/or nonsensical DNS queries that make it to the root.
Also of interest is that the datasets are dominated by query strings that appear random and/or machine-generated.
Google’s Chrome browser generates three random 10-character queries upon startup in an effort to detect network properties. Those “Chrome 10” queries together with a relatively small number of other common patterns comprise a significant proportion of the entire dataset.
Research is being done in order to better understand the source of these machine-generated queries.
More technical details and information on running the process is available on the DNS-OARC web site.
This is a guest post written by Kevin White, VP Technology, JAS Global Advisors LLC. JAS is currently authoring a “Name Collision Occurrence Management Framework” for the new gTLD program under contract with ICANN.
JAS Global Advisors, the consultancy hired by ICANN to provide the final analysis on the risks posed by name collisions in new gTLDs, is to exclusively guest-blog its work here on DI.
ICANN picked JAS to provide a “Name Collision Occurrence Management Framework” earlier this week.
Its job is to basically figure out how new gTLD registries — some of which have been told to block many thousands of potential collisions from their zones — can identify and mitigate the risks, if any, posed by these names.
The framework will help registries reduce the size of their block-lists, in other words.
JAS expects to provide a short series of guest posts over the next few months, explaining the state of the project as it progresses. Reader comments will be read, I’m assured.
JAS CEO Jeff Schmidt said: “The macro intent is to shorten the feedback cycle so folks can see where we are incrementally and comment along the way.”
I’m hoping that the guest posts will provide DI readers with insight into the issue that is as disinterested as DI’s usual coverage, but better informed on the nitty-gritty of the affected technologies.
JAS is a regular consultant for ICANN. It was one of the independent evaluators for the new gTLD program itself.
I’m told that JAS doesn’t have financial relationships with either any new gTLD applicants, which generally think the collision risks have been overstated, or with Verisign, which say they could cause real damage.
JAS isn’t getting paid for the posts; nor is DI getting paid to carry them.
The first post in the series will appear soon, probably Friday.
A lot of people have noticed since the first four new gTLDs were delegated yesterday that Google’s Chrome browser doesn’t seem to handle internationalized domain names.
In fact it does, but if you’re an English-speaking user you’ll probably need to make a few small configuration changes, which should take less than a minute, to make it work.
As far as the DNS is concerned, these are the same URLs. They’re just displayed differently by Chrome, depending on your browser’s display languages settings.
If you want to see the Cyrillic version in your address bar, simply:
- Go to the Chrome Settings menu via the toolbar menu or by typing chrome://settings into the address bar.
- Click the “Language and input settings” button. It’s in the Advanced options bit, which may be hidden at first. Scroll all the way down to unhide.
- Click the Add button to add the languages you want to support in the address bar.
Right now, you can see all three active IDN gTLDs in their intended scripts by adding Arabic, Chinese (Simplified Han) and Russian. As gTLDs in other scripts are added, you’ll need to add those too.
Thanks to DNS jack o’ all trades Jothan Frakes for telling me how to do this.
ICANN has given blessed relief to many new gTLD applicants by wiping potentially months off their path to delegation.
Its New gTLD Program Committee this week adopted a new “New gTLD Collision Occurrence Management Plan” which aims to tackle the problem of clashes between new gTLDs and names used on private networks.
The good news is that the previous categorization of strings according to risk, which would have delayed “uncalculated risk” gTLDs by months pending further study, has been scrapped.
The two “high risk” strings — .home and .corp — don’t catch a break, however. ICANN says it will continue to refuse to delegate them “indefinitely”.
For everyone else, ICANN said it will conduct additional studies into the risk of name collisions, above and beyond what Interisle Consulting already produced.
The study will take into account not only the frequency that new gTLDs currently generate NXDOMAIN traffic in the DNS root, but also the number of second-level domains queried, the diversity of requesting sources, and other factors.
Any new gTLD applicant that does not wish to wait for this study will be able to proceed to delegation without delay, but only if they block huge numbers of second-level domains at launch.
The registries will have to block every SLD that was queried in their gTLD according to the Day in the Life of the Internet data that Interisle used in its study.
This list will vary by TLD, but in the most severe cases is likely to extend to tens of thousands of names. In many cases, it’s likely to be a few thousand names.
Fortunately, studies conducted by the likes of Donuts and Neustar indicate that many of these SLDs — maybe even the majority — are likely to be invalid strings, such as those with an underscore or other non-DNS character, or randomly generated 10-character strings of gibberish generated by Google Chrome.
In other words, the actual number of potentially salable domains that registries will have to block may turn out to be much lower than it appears at first glance.
Each SLD will have to be blocked in such a way that it continues to return NXDOMAIN responses, as they all do today.
Because the DITL data represented a 48-hour snapshot in May 2013, and may not include every potentially affected string, ICANN is also proposing to give organizations a way to:
report and request the blocking of a domain name (SLD) that causes demonstrably severe harm as a consequence of name collision occurrences.
The process will allow the deactivation (SLD removal from the TLD zone) of the name for a period of up to two (2) years in order to allow the affected party to effect changes to its network to eliminate the DNS request leakage that causes collisions, or mitigate the harmful impact.
One has to wonder if any trademark lawyers reading this will think: “Ooh, free defensive registration!” It will be interesting to see if any of them give it a cheeky shot.
I’ve got a feeling that most new gTLD applicants will want to take ICANN up on its offer. It’s not an ideal solution for them, but it does give them a way to get into the root relatively quickly.
There’s no telling what ICANN’s additional studies will find, but there’s a chance it could be negative for their string(s) — getting delegated at least mitigates the risk of never getting delegated.
The new ICANN proposal may in some cases interfere with their plans to market and use their TLDs, however.
Take a dot-brand such as .cisco, which the networking company has applied for. Its block list is likely to have about 100,000 strings on it, increasing the chances that useful, brandable SLDs are going to be taken out of circulation for a while.
ICANN is also proposing to conduct an awareness-raising campaign, using the media, to let network operators know about the risks that new gTLDs may present to their networks.
Depending on how effective this is, new registries may be able to forget about getting positive column inches for their launch — if a journalist is handed a negative angle for a story on a plate, they’ll take it.