URL normalization rules

April 11, 2024

ID 231546

Various malicious software will attempt to hide its activity by using URL obfuscation techniques (for example, using national domain names, including those with single characters, representing IP addresses in octal notation, repeated slashes). In this case, the same content can be frequently accessed via technically different addresses (for example, addresses that differ in scheme, port, or character case in a URL address).

As a result, when matching a URL with the lists of indicators of compromise (IoCs) in their initial form, this leads to a problem of threat omission, because no matching with IoCs occurs.

For example, github.com@520966948 is the obfuscated IP address 31.13.83.36, which actually belongs to facebook.com.

CyberTrace has two advantage features:

  • URL normalization that, as a rule, is not available in SIEM.
  • Masks used in the Kaspersky data feeds for closing groups of malicious URLs.

Kaspersky data feeds cannot allow thirteen variants of a URL with a different normalization variant, because this will lead to an unreasonable increase of the feed's size. However, if the user sends us a known URL in a specific format, we can transform it, search for matches in the feeds, and detect it by using normalization.

At the moment, thirteen rules of URL normalization are used. The following are the examples of applying these rules:

  • Remove dot segments ("." and "..") according to the algorithm described in RFC 3986, section 5.2.4 Remove Dot Segments (https://www.ietf.org/rfc/rfc3986.txt):

    http://www.example.com/../a/b/../c/./d.html => http://www.example.com/a/b/c/d.html

  • Remove the protocol:

    http://example.com => example.com

  • Convert internationalized domain names according to the Punycode algorithm described in RFC 3492 (https://www.ietf.org/rfc/rfc3492.txt):

    тест.рф => xn--e1aybc.xn--p1ai

  • Remove the www prefix:

    www.example.com => example.com

  • Remove repeated slashes:

    example.com//dir/test.html => example.com/dir/test.html

  • Remove the trailing slash at the end of the URL:

    example.com/ => example.com

  • Remove the authorization information:

    login:password@example.com => example.com

  • Remove the port number:

    example.com:80/index => example.com/index

  • Remove the #fragment reference:

    example.com#fragment => example.com

  • Remove dots at the end of the host name:

    example.com./index.html => example.com/index.html

  • Convert percent-encoded symbols to UTF-8 according to RFC 3986 (https://www.ietf.org/rfc/rfc3986.txt) and RFC 2279 (https://www.ietf.org/rfc/rfc2279.txt).
  • Convert all characters to lower case:

    EXAMPLE.COM => example.com

  • Convert the IP address (if any) leading to the requested host to dot-decimal notation:

    0112.0175.0117.0150 => 74.125.79.104

For closing the groups of a malicious URL, the feeds use eight types of entries that are divided into masked and unmasked entries.

Matching a normalized URL with the entries from the databases on the basis of the URL should be performed regarding the purpose of certain types of entries. Using URL normalization and masks provides an increase in the feed's detection rate, as well as minimizing the supplied data volume and decreasing false positives.

Detailed information is provided in Kaspersky Threat Intelligence Data Feeds Implementation Guide.

Did you find this article helpful?
What can we do better?
Thank you for your feedback! You're helping us improve.
Thank you for your feedback! You're helping us improve.