
Change-Id: Ibf3e6936d438cd18868235e9fb783a3b9b78b6ff Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/2128377 Commit-Queue: Mustafa Emre Acer <meacer@chromium.org> Reviewed-by: Joe DeBlasio <jdeblasio@chromium.org> Cr-Commit-Position: refs/heads/master@{#755157}
218 lines
11 KiB
Markdown
218 lines
11 KiB
Markdown
# Internationalized Domain Names (IDN) in Google Chrome
|
||
|
||
## Background
|
||
|
||
Many years ago, domains could only consist of the Latin letters A to Z, digits,
|
||
and a few other characters. [Internationalized Domain Names
|
||
(IDNs)](https://en.wikipedia.org/wiki/Internationalized_domain_name) were
|
||
created to better support non-Latin alphabets for web users around the globe.
|
||
|
||
Different characters from different (or even the same!) languages can look very
|
||
similar. We’ve seen
|
||
[reports](https://bugs.chromium.org/p/chromium/issues/detail?id=683314) of
|
||
proof-of-concept attacks. These are called [homograph
|
||
attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). For example, the
|
||
Latin "a" looks a lot like the Cyrillic "а", so someone could register
|
||
`http://ebаy.com` (using Cyrillic "`а`"), which could be confused for
|
||
`http://ebay.com`. This is a limitation of how URLs are displayed in browsers in
|
||
general, not a specific bug in Chrome.
|
||
|
||
In a perfect world, domain registrars would not allow these confusable domain
|
||
names to be registered. Some domain registrars do exactly that, mostly by
|
||
restricting the characters allowed, but many do not. To better protect against
|
||
these attacks, browsers display some domains in
|
||
[punycode](https://en.wikipedia.org/wiki/Punycode) (looks like `xn--...`)
|
||
instead of the original IDN, according to their own IDN policies.
|
||
|
||
This is a challenging problem space. Chrome has a global user base of billions
|
||
of people around the world, many of whom are not viewing URLs with Latin
|
||
letters. We want to prevent confusion, while ensuring that users across
|
||
languages have a great experience in Chrome. Displaying either punycode or a
|
||
visible security warning on too wide of a set of URLs would hurt web usability
|
||
for people around the world.
|
||
|
||
Chrome and other browsers try to balance these needs by implementing IDN
|
||
policies in a way that allows IDN to be shown for valid domains, but protects
|
||
against confusable homograph attacks.
|
||
|
||
Chrome's IDN policy is one of several tools that aim to protect users.
|
||
[Google Safe Browsing](https://safebrowsing.google.com/) continues to help
|
||
protect over two billion devices every day by showing warnings to users when
|
||
they attempt to navigate to dangerous or deceptive sites or download dangerous
|
||
files. Password managers continue to remember which domain password logins are
|
||
for, and won’t automatically fill a password into a domain that is not the
|
||
exactly correct one.
|
||
|
||
## How IDN works
|
||
|
||
IDNs were devised to support arbitrary Unicode characters in hostnames in a
|
||
backward-compatible way. This works by having user agents transform hostnames
|
||
containing non-ASCII Unicode characters into an ASCII-only hostname, which can
|
||
then be sent on to DNS servers. This is done by encoding each domain label into
|
||
its punycode representation. This representation includes a four-character
|
||
prefix (`xn--`) and then the unicode translated to ASCII Compatible Encoding
|
||
(ACE). For example, `http://öbb.at` is transformed to `http://xn--bb-eka.at`.
|
||
|
||
## Google Chrome's IDN policy
|
||
|
||
Since Chrome 51, Chrome uses an IDN display policy that does not take into
|
||
account the language settings (the Accept-Language list) of the browser. A
|
||
[similar strategy](https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm) is
|
||
used by Firefox.
|
||
|
||
Google Chrome decides if it should show Unicode or punycode for each domain
|
||
label (component) of a hostname separately. To decide if a component should be
|
||
shown in Unicode, Google Chrome uses the following algorithm:
|
||
1. Convert each component stored in the ACE to Unicode per [UTS 46 transitional
|
||
processing](http://unicode.org/reports/tr46/#Processing) (`ToUnicode`).
|
||
|
||
2. If there is an error in `ToUnicode` conversion (e.g. contains [disallowed
|
||
characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auts46%3Ddisallowed%3A%5D&abb=on&g=&i=),
|
||
[starts with a combining
|
||
mark](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da390a6b3d9844a1dcc1f99fb1ae478ecf),
|
||
or [violates BiDi
|
||
rules](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da8a9311811fb0f3db1644ac1a88056370)),
|
||
show punycode.
|
||
|
||
3. If there is a character in a label not belonging to [Characters allowed in
|
||
identifiers](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AIdentifierStatus%3DAllowed%3A&abb=on&g=&i=)
|
||
per [Unicode Technical Standard 39 (UTS
|
||
39)](http://www.unicode.org/reports/tr39/#Identifier_Status_and_Type), show
|
||
punycode.
|
||
|
||
4. If any character in a label belongs to [the disallowed
|
||
list](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu01CD-%5Cu01DC%5D+%5B%5Cu1c80-%5Cu1c8f%5D++%5B%5Cu1e90-%5Cu1e9b%5D++%5B%5Cu1f00-%5Cu1fff%5D++%5B%5Cua640-%5Cua69f%5D-%5B%5Cua720-%5Cua72f%5D+%5B%5Cu0338+%5Cu058a+%5Cu2010+%5Cu2019+%5Cu2027+%5Cu30a0+%5Cu02bb+%5Cu02bc+%5D&abb=on&g=&i=),
|
||
show punycode.
|
||
|
||
5. If the component uses characters drawn from multiple scripts, it is subject
|
||
to a script mixing check based on ["Highly Restrictive" profile of UTS
|
||
39](http://www.unicode.org/reports/tr39/#Restriction_Level_Detection) with an
|
||
additional restriction on Latin. If the component fails the check, show the
|
||
component in punycode.
|
||
- Latin, Cyrillic or Greek characters cannot be mixed with each other
|
||
- Latin characters in the ASCII range can be mixed ONLY with Chinese (Han,
|
||
Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja)
|
||
- Han (CJK Ideographs) can be mixed with Bopomofo
|
||
- Han can be mixed with Hiragana and Katakana
|
||
- Han can be mixed with Korean Hangul
|
||
|
||
6. If two or more numbering systems (e.g. European digits + Bengali digits) are
|
||
mixed, show punycode.
|
||
|
||
7. If there are any invisible characters (e.g. a sequence of the same combining
|
||
mark or a sequence of Kana combining marks), show punycode.
|
||
|
||
8. If there are any characters used in an unusual way, show punycode. E.g.
|
||
[`LATIN MIDDLE DOT (·)`](https://unicode.org/cldr/utility/character.jsp?a=00B7)
|
||
used outside [ela geminada](https://en.wiktionary.org/wiki/ela_geminada).
|
||
|
||
9. Test the label for [mixed script confusable per UTS
|
||
39](http://unicode.org/reports/tr39/#Mixed_Script_Confusables). If mixed script
|
||
confusable is detected, show punycode.
|
||
|
||
10. Test the label for [whole script
|
||
confusables](http://unicode.org/reports/tr39/#Whole_Script_Confusables): If all
|
||
the letters in a given label belong to a set of whole-script-confusable letters
|
||
in one of the [whole-script-confusable
|
||
scripts](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=kWholeScriptConfusables&sq=package:chromium)
|
||
and if the hostname doesn't have a corresponding
|
||
[allowed top-level-domain](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.h?type=cs&q=allowed_tlds)
|
||
for that script, show punycode.
|
||
**Example for Cyrillic:**
|
||
The first label in hostname `аррӏе.com` (`xn--80ak6aa92e.com`) is all [Cyrillic
|
||
letters that look like Latin letters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%D0%B0%D1%81%D4%81%D0%B5%D2%BB%D1%96%D1%98%D3%8F%D0%BE%D1%80%D4%9B%D1%95%D4%9D%D1%85%D1%83%D1%8A%D0%AC%D2%BD%D0%BF%D0%B3%D1%B5%D1%A1%5D&g=gc&i=)
|
||
**AND** the TLD (`com`) is not Cyrillic **AND** the TLD is not one of the TLDs
|
||
known to host a large number of Cyrillic domains (e.g. `ru`, `su`, `pyc`, `ua`).
|
||
Show it in punycode.
|
||
|
||
11. If the label contains only [digits and digit
|
||
spoofs](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=IsDigitLookalike),
|
||
show punycode.
|
||
|
||
12. If the label matches a [dangerous
|
||
pattern](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&g=0&l=422),
|
||
show punycode.
|
||
|
||
13. If the [skeleton](http://unicode.org/reports/tr39/#def-skeleton) of the
|
||
registrable part of a hostname is identical to one of the top domains after
|
||
removing diacritic marks and mapping each character to its spoofing skeleton
|
||
(e.g. `www.googlé.com` with `é` in place of `e`), show punycode.
|
||
|
||
Otherwise, show Unicode.
|
||
|
||
This is implemented by `IDNToUnicodeOneComponent()` and `IsIDNComponentSafe()`
|
||
in
|
||
[`components/url_formatter/url_formatter.cc`](https://cs.chromium.org/search/?q=components/url_formatter/url_formatter.cc)
|
||
and `IDNSpoofChecker` class in
|
||
[`components/url_formatter/spoof_checks/idn_spoof_checker.cc`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc).
|
||
|
||
## Additional Protections
|
||
|
||
In addition to the spoof checks above, Chrome also implements a full page
|
||
security warning to protect against lookalike URLs. You can find an example of
|
||
this warning at `chrome://interstitials/lookalike`. This warning blocks main
|
||
frame navigations that involve lookalike URLs, either as a direct navigation or
|
||
as part of a redirect.
|
||
|
||
The algorithm to show this warning is as follows:
|
||
|
||
1. If the scheme of the navigation is not `http` or `https`, allow
|
||
the navigation.
|
||
|
||
2. If the navigation is a redirect, check the redirect chain. If the redirect
|
||
chain is safe, allow the navigation. (See Defensive Registrations section for
|
||
details).
|
||
|
||
3. If the hostname of the navigation has at least a medium site engagement
|
||
score, allow the navigation. Site engagement score is assigned to sites by the
|
||
[Site Engagement
|
||
Service](https://www.chromium.org/developers/design-documents/site-engagement).
|
||
|
||
4. If the hostname of the navigation is in
|
||
[`domains.list`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/top_domains/domains.list),
|
||
allow the navigation.
|
||
|
||
5. If the user previously allowed the hostname of the navigation by clicking
|
||
"Ignore" in the warning, allow the navigation. Currently, user decisions are
|
||
stored per tab, so navigating to the same site in a new tab may show the
|
||
warning.
|
||
|
||
6. If the hostname has the same skeleton as a recently engaged site or a top 500
|
||
domain, block the navigation and show the warning.
|
||
|
||
All of these checks are done locally on the client side.
|
||
|
||
### Defensive Registrations
|
||
|
||
Domain owners can sometimes register multiple versions of their domains, such
|
||
as the ASCII and IDN versions, to improve user experience and prevent potential
|
||
spoofs. We call these supplementary domains defensive registrations.
|
||
|
||
In some cases, Chrome's lookalike warning may flag and block navigations to
|
||
these domains:
|
||
- If one of the sites is in `domains.list` but the other isn't, the latter will
|
||
be blocked.
|
||
- If the user engaged with one of the sites but not the other, the latter will
|
||
be blocked.
|
||
|
||
### Avoiding a lookalike warning on your site
|
||
|
||
**Domain owners can avoid the "Did you mean" warning by redirecting their
|
||
defensive registrations to their canonical domain.**
|
||
|
||
**Example**: If you own both `example.com` and `éxample.com` and the majority of
|
||
your traffic is to `example.com`, you can fix the warning by redirecting
|
||
`éxample.com` to `example.com`. The lookalike warning logic considers this a
|
||
safe redirect and allows the navigation. If you must also redirect `http`
|
||
navigations to `https`, do this in a single redirect such as
|
||
`http://éxample.com -> https://example.com`. Use HTTP 301 or HTTP 302
|
||
redirects, the lookalike warning ignores meta redirects.
|
||
|
||
## Reporting Security Bugs
|
||
|
||
We reward certain cases of IDN spoofs according to [Chrome's Vulnerability
|
||
Reward Program](https://www.google.com/about/appsecurity/chrome-rewards/index.html)
|
||
policies. Please see [this
|
||
document]( https://docs.google.com/document/d/1_xJz3J9kkAPwk3pma6K3X12SyPTyyaJDSCxTfF8Y5sU/edit?usp=sharing)
|
||
before reporting a security bug.
|