With all the talk recently about DNS Namespace Collisions, the heretofore relatively obscure Day In The Life (“DITL”) datasets maintained by the DNS-OARC have been getting a lot of attention.
While these datasets are well known to researchers, I’d like to take the opportunity to provide some background and talk a little about how these datasets are being used to research the DNS Namespace Collision issue.
The Domain Name System Operations Analysis and Research Center (“DNS-OARC”) began working with the root server operators to collect data in 2006. The effort was coined “Day In The Life of the Internet (DITL).”
Root server participation in the DITL collection is voluntary and the number of contributing operators has steadily increased; in 2010, all of the 13 root server letters participated. DITL data collection occurs on an annual basis and covers approximately 50 contiguous hours.
DNS-OARC’s DITL datasets are attractive for researching the DNS Namespace Collision issue because:
- DITL contains data from multiple root operators;
- The robust annual sampling methodology (with samples dating back to 2006) allows trending; and
- It’s available to all DNS-OARC Members.
More information on the DITL collection is available on DNS-OARC’s site at https://www.dns-oarc.net/oarc/data/ditl.
Terabytes and terabytes of data
The data consists of the raw network “packets” destined for each root server. Contained within the network packets are the DNS queries. The raw data consists of many terabytes of compressed network capture files and processing the raw data is very time-consuming and resource-intensive.
While several researchers have looked at DITL datasets over the years, the current collisions-oriented research started with Roy Hooper of Demand Media. Roy created a process to iterate through this data and convert it into intermediate forms that are much more usable for researching the proposed new TLDs.
We started with his process and continued working with it; our code is available on GitHub for others to review.
Finding needles in DITL haystacks
The first problem faced by researchers interested in new TLDs is isolating the relatively few queries of interest among many terabytes of traffic that are not of interest.
Each root operator contributes several hundred – or several thousand – files full of captured packets in time-sequential order. These packets contain every DNS query reaching the root that requests information about DNS names falling within delegated and undelegated TLDs.
The first step is to search these packets for DNS queries involving the TLDs of interest. The result is one file per TLD containing all queries from all roots involving that TLD. If the input packet is considered a “horizontal” slice of root DNS traffic, then this intermediary work product is a “vertical” slice per TLD.
These intermediary files are much more manageable, ranging from just a few records to 3 GB. To support additional investigation and debugging, the intermediary files that JAS produces are fully “traceable” such that a record in the intermediary file can be traced back to the source raw network packet.
The DITL data contain quite a bit of noise, primarily DNS traffic that was not actually destined for the root. Our process filters the data by destination IP address so that the only remaining data is that which was originally destined for the root name servers.
JAS has made these intermediary per-TLD files available to DNS-OARC members for further analysis.
The intermediary files are comparatively small and easy to parse, opening the door to more elaborate research. For example, JAS has written various “second passes” that classify queries, separate queries that use valid syntax at the second level from those that don’t, detect “randomness,” fit regular expressions to the queries, and more.
We have also checked to confirm that second level queries that look like Punycode IDNs (start with ‘xn--‘) are valid Punycode. It is interesting to note the tremendous volume of erroneous, technically invalid, and/or nonsensical DNS queries that make it to the root.
Also of interest is that the datasets are dominated by query strings that appear random and/or machine-generated.
Google’s Chrome browser generates three random 10-character queries upon startup in an effort to detect network properties. Those “Chrome 10” queries together with a relatively small number of other common patterns comprise a significant proportion of the entire dataset.
Research is being done in order to better understand the source of these machine-generated queries.
More technical details and information on running the process is available on the DNS-OARC web site.
This is a guest post written by Kevin White, VP Technology, JAS Global Advisors LLC. JAS is currently authoring a “Name Collision Occurrence Management Framework” for the new gTLD program under contract with ICANN.
Initial Evaluation on the first round of new gTLD applications is almost done, with only two bids now remaining in that stage of the program.
ICANN last night published the delayed IE results for PricewaterhouseCooper’s .pwc and the Better Business Bureau’s .bbb, both of which were passes.
The only two applications remaining in IE are Kosher Marketing Assets’ .kosher and Google’s .search.
The latter is believed to be hung up on technical changes it has made to its bid, to remove the plan to make .search a “dotless” gTLD, which ICANN has banned on stability grounds.
Eight applications are currently in Extended Evaluation, having failed to achieve passing scores during IE.
Ken Hansen has surprised many by resigning from Neustar, where he was general manager of the slam-dunk .nyc new gTLD initiative, to become CEO of .co.com, a new pseudo-TLD registry.
The announcement raises a couple of big questions.
First, why is .co.com being launched as a registry?
The name belongs to domain investor Paul Goldstone. He put it up for sale in March 2012, with broker DomainAdvisors speculating aloud that it would fetch a price in the millions.
We wondered at the time whether CentralNic, whose bread and butter back then (before its interests in new gTLDs became clear) was two-letter country-codes in .com, would swoop to buy it.
We also wondered whether .CO Internet would make an offer, in order to eliminate competition and reduce existing and potential confusion with its own ccTLD, .co.
If either company made an offer, it does not seem to have been accepted.
Goldstone is instead going to try to build a registry around the name, with Hansen as CEO and himself as president. DomainAdvisors founder Gregg McNair is chairman of the new venture.
Second, why on earth would Hansen, who has been leading business development for Neustar’s own .nyc — the forthcoming new gTLD for the city of New York — join an unproven .com subdomain provider?
He tells us that his confidence in .nyc’s prospects has not waned, but that he is one of the owners of the new company.
He said in an email:
Sometimes following the crowd is not the best thing to do in business. New gTLDs have always been about choice from my perspective. I still believe in new gTLDs in general, but there is still a VERY significant market for short recognizable domains ending in .com. We will meet that demand. Not to mention, we can move quickly without waiting on ICANN.
Gaining visibility for a subdomain product can be tricky at the best of times, but with hundred of new generic TLDs coming to market… Hansen, Goldstone and McNair really do have a challenge on their hands.
The new company intends to run sunrise, landrush and “premium” names phases for its launch, which is expected to kick off in the first quarter next year. No word yet on whether it will follow the CentralNic model and also voluntarily incorporate ICANN policies on UDRP, Whois and so forth.
Internet governance expert Wolfgang Kleinwächter has joined ICANN’s board of directors with immediate effect.
Kleinwächter is the emergency replacement for Judith Vazquez, who quit with no explanation last month. He’ll carry out Vazquez’s duties until her term was due to end, a year from now.
He’s a rare insider appointment from the Nominating Committee, which regularly looks outside of ICANN for its board expertise.
He has been involved with ICANN since almost the beginning, and currently sits on the GNSO Council (a term due to expire this week) as a representative of the Non-Commercial Users Constituency.
He’s a German national and currently employed by the University of Aarhus, Denmark, where he teaches on the subjection of internet policy and regulation.
He also has experience in UN-related policy projects such as the World Summit on the Information Society and the Internet Governance Forum.
The third batch of new gTLDs have gone live.
Uniregistry’s .sexy and .tattoo are currently in the DNS root zone, the first two of its portfolio to become active.
The TLDs .bike, .construction, .contractors, .estate, .gallery, .graphics, .land, .plumbing, and .technology from Donuts have also gone live today.
Donuts already had 10 new gTLDs in the root from the first two batches.
There are now 24 live new gTLDs.
The first second-level domains to become available will be nic.tld in each, per the ICANN contract they’ve all signed.
You’ll notice that they’re all ASCII strings, despite the fact that IDNs get priority treatment in the new gTLD program.