This page presents documentation for the BETA version of DDec, CAIDA's public DNS Decoding database, public interface.
DDec understands two types of patterns: regexp and hostpat.
DDec's regexp is a subset of perl regexp (see tables below).
DDec's "hostpat" (hostname pattern) is a simpler pattern syntax designed specifically for matching hostnames. All legal lowercase hostname characters stand for themselves, and the most common special pattern expressions can be written with a single uppercase letter. No special characters are needed to match a literal string; e.g., hostpat "foo.bar.com" matches hostname "foo.bar.com" (and no other hostname).
Basic hostpat and regexp syntax:
|any digit|| || |
|any letter|| || |
|any alphanumeric|| || |
|any hex digit|| || |
|any label character|| || ||anything allowed in a hostname label|
|period|| || |
|any 1 char|| || ||"_" as in SQL "like" operator|
|any 0 or more chars|| || ||"%" as in SQL "like" operator|
|word boundary|| || ||[a-z0-9] on one side, [-.] or nothing on the other|
Other more advanced syntax that works in both hostpat and regexp:
|char class: match "a", "b", or "c"|| ||may contain only -a-z0-9|
|negated char class: match anything but "a", "b", or "c"|| |
|numbered grouping|| ||used to capture a substring for decoding|
|unnumbered grouping|| |
|alternation: match "com" or "net"|| ||allowed only inside (...) or (?:...)|
|0 or more of last item|| |
|1 or more of last item|| |
|0 or 1 of last item|| |
|n repeats of last item|| |
|n or more of last item|| |
|n..m repeats of last item|| |
Both regexps and hostpats are case insensitive.
Hostpats are always anchored, i.e. must match the entire string. To get the effect of an unanchored beginning, start the hostpat with "%". E.g., the hostpat "bar.com" matches only "bar.com", but the hostpat "%bar.com" also matches "foobar.com", "quux.bar.com", or any other hostname ending in "bar.com".
A ruleset is a collection of related rules and/or encodings for decoding a set of related hostnames (usually belonging to the same domain or organization). Here is an example of a simple ruleset containing 2 rules:
--- name: examplecorp note: ExampleCorp, Inc. rules: - hostpat: <iata>D+.example.com - hostpat: %.<clli>.example.net
DDec rulesets are displayed in a YAML format.
Each YAML document (beginning with "
---") contains one ruleset.
Every ruleset must have a name made of letters, digits, and underscores. An optional note can be used to describe the ruleset.
Rulesets used for decoding must define a list of one or more rules. The simplest type of rule contains just a pattern, either a hostpat or regexp. That pattern usually contains embedded <...> or <<...>> variable bindings that describe
- a pattern to match a relevant substring of the hostname,
- how to interpret that substring,
- what variable to assign the result to.
For example, if the hostpat
is matched against the hostname
<iata>" would match the 3-letter substring
lax", interpret it as an IATA airport code,
and assign the result "Los Angeles" to the variable "loc".
Embedded variable bindings
Finer control over the pattern, mapping, or variable can be had by using more complex expressions, described below.
|syntax||match||interpret with||assign result to|
| ||ENCODING's pattern||ENCODING's mapping||ENCODING's var|
| ||ENCODING's pattern||VAR's mapping or ENCODING's mapping||VAR|
| ||PATTERN||VAR's mapping||VAR|
| ||PATTERN||VAR's mapping or ENCODING's mapping||VAR|
The most commonly used variable is loc (geographic location).
The difference between
is that the latter matches only if the subpattern is not adjacent to a letter.
So, for example, the hostpat
would match "
and interpret "
las" as a 3-letter IATA airport code for Las Vegas,
which is probably not what was intended;
%<<iata>>.example.com" would not match,
because there is an "
l" is adajcent to the "
DDec defines a number of standard encodings, and rulesets may also define their own custom encodings.
Some standard encodings:
|iata||IATA 3-letter airport code||loc||LLL|
|icao||ICAO 4-letter airport code||loc||LLLL|
|clli||first 6 characters of CLLI code||loc||LLL[L-]LL|
Reusable named custom encodings can be defined in an encodings entry at the top level of a ruleset, and an anonymous one-time mapping can be defined under a var entry in a rule. For example:
--- name: examplecorp note: ExampleCorp, Inc. encodings: - citycode: mapping: la: Los Angeles, CA, US par: Paris, FR nyc: New York, NY, US - airport: extends: iata mapping: lnd: London, UK rules: - hostpat: <citycode>D+.example.com - hostpat: %.<airport>.example.net - hostpat: %.<loc=L+>.example.org vars: - loc: mapping: chic: Chicago, IL, US bos: Boston, MA, US lond: London, UK
The rule for example.com uses a custom encoding named "citycode", which is
defined earlier in the ruleset with custom codes for
par", and "
The rule for example.net uses the "airport" encoding, which is a custom
encoding that has all the attributes of the standard "iata" encoding, with the
addition of a nonstandard code for "
lnd" that overrides the
lnd" code already defined by iata.
Both of these named encodings could be reused by other rules.
The rule for example.org doesn't use a named encoding. Instead, it says that
a substring that matches "
L+" should be assigned to "loc",
but first it should be looked up in the mapping defined for "loc" in the
A rule's pattern must indicate a specific domain or small set of specific domains that DDec can identify. This means regexps must end in "$" and hostpats must not end in "%". For example:
| || ||foo.com|
| || ||foo.com, foo.net|
| || ||foo.net, foo.net.uk|
| || ||foo1.com, foo2.com, foo3.com|
| || ||(invalid)|
| || ||(invalid)|
Full ruleset syntax
The documentation above described only the most common syntax. The complete syntax for rulesets is described below.
--- name: name # name of ruleset (required) source: source # where did ruleset come from note: arbitrary additional information hostpat: hostpat # pattern that must match hostnames regexp: regexp # pattern that must match hostnames encodings: ENCODING_NAME_1: source: where did encoding come from note: arbitrary additional information extends: ENCODING # name of another encoding from which this encoding will inherit (OR) extends: - ENCODING1 # name of another encoding from which this encoding will inherit - (...more encoding names) hostpat: hostpat # pattern to match (unless overridden in <...>) regexp: regexp # pattern to match (unless overridden in <...>) var: name # name of variable to assign to (unless overridden in <...>) mapping: .encoding: ENCODING # optional encoding for re-interpreting VALUEs CODE1: VALUE1 CODE2: VALUE2 (... more mappings) (... more encodings) rules: - hostpat: hostpat # pattern that hostname must match (hostpat or regexp is required) regexp: regexp # pattern that hostname must match (hostpat or regexp is required) note: arbitrary additional information mapping_required: 1 # if mapping_required is "true" or "1", and a var binding has a mapping # but the value extracted from the hostname does not match any of the # mapping's codes, then the rule is treated as not matching. vars: - VAR1: VALUE (OR) - VAR1: value: VALUE # string, possibly with $-substitutions encoding: ENCODING # name of encoding to use to decode extracted value (OR) encoding: - ENCODING1 # name of encoding to use to decode extracted value - ... (more encoding names) mapping: .encoding: ENCODING # optional encoding for re-interpreting VALUEs CODE1: VALUE1 CODE2: VALUE2 (... more mappings) - (... more vars) - (... more rules)
The value in a variable binding can be a simple literal string like "San Diego", or may contain $-substitutions:
- "$N", where N is a number, will be replaced with the part of the hostname matched by the Nth set of (...) or <...> in the pattern.
- "$var" will be replaced with the value of variable named "var" (which must be defined earlier in the same rule).
In a rule, variable bindings may be embedded in the pattern or listed under "vars". E.g., these two rules are equivalent:
- hostpat: r<router=DD>.foo.net
- hostpat: r(DD).foo.net vars: - router: $1
A variable binding can even be both embedded and listed under "vars", which is useful if you need a mapping or multiple encodings, which can't be embedded:
- hostpat: <loc=LL>.bar.net vars: - loc: mapping: sd: San Diego, CA, US la: Los Angeles, CA, US
Encoding name syntax:
- "/foo" refers to the global encoding "foo".
- "foo" refers to the encoding "foo" in the current ruleset if there is one, otherwise the global encoding "foo".
- "bar/foo" refers to the encoding "foo" in ruleset "bar".