| # Internationalized Domain Names (IDN) in Google Chrome |
| |
| ## Background |
| |
| Many years ago, domains could only consist of the Latin letters A to Z, digits, |
| and a few other characters. [Internationalized Domain Names |
| (IDNs)](https://en.wikipedia.org/wiki/Internationalized_domain_name) were |
| created to better support non-Latin alphabets for web users around the globe. |
| |
| Different characters from different (or even the same!) languages can look very |
| similar. We’ve seen |
| [reports](https://bugs.chromium.org/p/chromium/issues/detail?id=683314) of |
| proof-of-concept attacks. These are called [homograph |
| attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). For example, the |
| Latin "a" looks a lot like the Cyrillic "а", so someone could register |
| `http://ebаy.com` (using Cyrillic "`а`"), which could be confused for |
| `http://ebay.com`. This is a limitation of how URLs are displayed in browsers in |
| general, not a specific bug in Chrome. |
| |
| In a perfect world, domain registrars would not allow these confusable domain |
| names to be registered. Some domain registrars do exactly that, mostly by |
| restricting the characters allowed, but many do not. To better protect against |
| these attacks, browsers display some domains in |
| [punycode](https://en.wikipedia.org/wiki/Punycode) (looks like `xn--...`) |
| instead of the original IDN, according to their own IDN policies. |
| |
| This is a challenging problem space. Chrome has a global user base of billions |
| of people around the world, many of whom are not viewing URLs with Latin |
| letters. We want to prevent confusion, while ensuring that users across |
| languages have a great experience in Chrome. Displaying either punycode or a |
| visible security warning on too wide of a set of URLs would hurt web usability |
| for people around the world. |
| |
| Chrome and other browsers try to balance these needs by implementing IDN |
| policies in a way that allows IDN to be shown for valid domains, but protects |
| against confusable homograph attacks. |
| |
| Chrome's IDN policy is one of several tools that aim to protect users. |
| [Google Safe Browsing](https://safebrowsing.google.com/) continues to help |
| protect over two billion devices every day by showing warnings to users when |
| they attempt to navigate to dangerous or deceptive sites or download dangerous |
| files. Password managers continue to remember which domain password logins are |
| for, and won’t automatically fill a password into a domain that is not the |
| exactly correct one. |
| |
| ## How IDN works |
| |
| IDNs were devised to support arbitrary Unicode characters in hostnames in a |
| backward-compatible way. This works by having user agents transform hostnames |
| containing non-ASCII Unicode characters into an ASCII-only hostname, which can |
| then be sent on to DNS servers. This is done by encoding each domain label into |
| its punycode representation. This representation includes a four-character |
| prefix (`xn--`) and then the unicode translated to ASCII Compatible Encoding |
| (ACE). For example, `http://öbb.at` is transformed to `http://xn--bb-eka.at`. |
| |
| ## Google Chrome's IDN policy |
| |
| Since Chrome 51, Chrome uses an IDN display policy that does not take into |
| account the language settings (the Accept-Language list) of the browser. A |
| [similar strategy](https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm) is |
| used by Firefox. |
| |
| Google Chrome decides if it should show Unicode or punycode for each domain |
| label (component) of a hostname separately. To decide if a component should be |
| shown in Unicode, Google Chrome uses the following algorithm: |
| 1. Convert each component stored in the ACE to Unicode per [UTS 46 transitional |
| processing](http://unicode.org/reports/tr46/#Processing) (`ToUnicode`). |
| |
| 2. If there is an error in `ToUnicode` conversion (e.g. contains [disallowed |
| characters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Auts46%3Ddisallowed%3A%5D&abb=on&g=&i=), |
| [starts with a combining |
| mark](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da390a6b3d9844a1dcc1f99fb1ae478ecf), |
| or [violates BiDi |
| rules](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uidna_8h.html#a0411cd49bb5b71852cecd93bcbf0ca2da8a9311811fb0f3db1644ac1a88056370)), |
| show punycode. |
| |
| 3. If there is a character in a label not belonging to [Characters allowed in |
| identifiers](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AIdentifierStatus%3DAllowed%3A&abb=on&g=&i=) |
| per [Unicode Technical Standard 39 (UTS |
| 39)](http://www.unicode.org/reports/tr39/#Identifier_Status_and_Type), show |
| punycode. |
| |
| 4. If any character in a label belongs to [the disallowed |
| list](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5Cu01CD-%5Cu01DC%5D+%5B%5Cu1c80-%5Cu1c8f%5D++%5B%5Cu1e90-%5Cu1e9b%5D++%5B%5Cu1f00-%5Cu1fff%5D++%5B%5Cua640-%5Cua69f%5D-%5B%5Cua720-%5Cua72f%5D+%5B%5Cu0338+%5Cu058a+%5Cu2010+%5Cu2019+%5Cu2027+%5Cu30a0+%5Cu02bb+%5Cu02bc+%5D&abb=on&g=&i=), |
| show punycode. |
| |
| 5. If the component uses characters drawn from multiple scripts, it is subject |
| to a script mixing check based on ["Highly Restrictive" profile of UTS |
| 39](http://www.unicode.org/reports/tr39/#Restriction_Level_Detection) with an |
| additional restriction on Latin. If the component fails the check, show the |
| component in punycode. |
| - Latin, Cyrillic or Greek characters cannot be mixed with each other |
| - Latin characters in the ASCII range can be mixed ONLY with Chinese (Han, |
| Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja) |
| - Han (CJK Ideographs) can be mixed with Bopomofo |
| - Han can be mixed with Hiragana and Katakana |
| - Han can be mixed with Korean Hangul |
| |
| 6. If two or more numbering systems (e.g. European digits + Bengali digits) are |
| mixed, show punycode. |
| |
| 7. If there are any invisible characters (e.g. a sequence of the same combining |
| mark or a sequence of Kana combining marks), show punycode. |
| |
| 8. If there are any characters used in an unusual way, show punycode. E.g. |
| [`LATIN MIDDLE DOT (·)`](https://unicode.org/cldr/utility/character.jsp?a=00B7) |
| used outside [ela geminada](https://en.wiktionary.org/wiki/ela_geminada). |
| |
| 9. Test the label for [mixed script confusable per UTS |
| 39](http://unicode.org/reports/tr39/#Mixed_Script_Confusables). If mixed script |
| confusable is detected, show punycode. |
| |
| 10. Test the label for [whole script |
| confusables](http://unicode.org/reports/tr39/#Whole_Script_Confusables): If all |
| the letters in a given label belong to a set of whole-script-confusable letters |
| in one of the [whole-script-confusable |
| scripts](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=kWholeScriptConfusables&sq=package:chromium) |
| and if the hostname doesn't have a corresponding |
| [allowed top-level-domain](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.h?type=cs&q=allowed_tlds) |
| for that script, show punycode. |
| **Example for Cyrillic:** |
| The first label in hostname `аррӏе.com` (`xn--80ak6aa92e.com`) is all [Cyrillic |
| letters that look like Latin letters](http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%D0%B0%D1%81%D4%81%D0%B5%D2%BB%D1%96%D1%98%D3%8F%D0%BE%D1%80%D4%9B%D1%95%D4%9D%D1%85%D1%83%D1%8A%D0%AC%D2%BD%D0%BF%D0%B3%D1%B5%D1%A1%5D&g=gc&i=) |
| **AND** the TLD (`com`) is not Cyrillic **AND** the TLD is not one of the TLDs |
| known to host a large number of Cyrillic domains (e.g. `ru`, `su`, `pyc`, `ua`). |
| Show it in punycode. |
| |
| 11. If the label contains only [digits and digit |
| spoofs](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&q=IsDigitLookalike), |
| show punycode. |
| |
| 12. If the label matches a [dangerous |
| pattern](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc?type=cs&g=0&l=422), |
| show punycode. |
| |
| 13. If the [skeleton](http://unicode.org/reports/tr39/#def-skeleton) of the |
| registrable part of a hostname is identical to one of the top domains after |
| removing diacritic marks and mapping each character to its spoofing skeleton |
| (e.g. `www.googlé.com` with `é` in place of `e`), show punycode. |
| |
| Otherwise, show Unicode. |
| |
| This is implemented by `IDNToUnicodeOneComponent()` and `IsIDNComponentSafe()` |
| in |
| [`components/url_formatter/url_formatter.cc`](https://cs.chromium.org/search/?q=components/url_formatter/url_formatter.cc) |
| and `IDNSpoofChecker` class in |
| [`components/url_formatter/spoof_checks/idn_spoof_checker.cc`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/idn_spoof_checker.cc). |
| |
| ## Additional Protections |
| |
| In addition to the spoof checks above, Chrome also implements a full page |
| security warning to protect against lookalike URLs. You can find an example of |
| this warning at `chrome://interstitials/lookalike`. This warning blocks main |
| frame navigations that involve lookalike URLs, either as a direct navigation or |
| as part of a redirect. |
| |
| The algorithm to show this warning is as follows: |
| |
| 1. If the scheme of the navigation is not `http` or `https`, allow |
| the navigation. |
| |
| 2. If the navigation is a redirect, check the redirect chain. If the redirect |
| chain is safe, allow the navigation. (See Defensive Registrations section for |
| details). |
| |
| 3. If the hostname of the navigation has at least a medium site engagement |
| score, allow the navigation. Site engagement score is assigned to sites by the |
| [Site Engagement |
| Service](https://www.chromium.org/developers/design-documents/site-engagement). |
| |
| 4. If the hostname of the navigation is in |
| [`domains.list`](https://cs.chromium.org/chromium/src/components/url_formatter/spoof_checks/top_domains/domains.list), |
| allow the navigation. |
| |
| 5. If the user previously allowed the hostname of the navigation by clicking |
| "Ignore" in the warning, allow the navigation. Currently, user decisions are |
| stored per tab, so navigating to the same site in a new tab may show the |
| warning. |
| |
| 6. If the hostname has the same skeleton as a recently engaged site or a top 500 |
| domain, block the navigation and show the warning. |
| |
| All of these checks are done locally on the client side. |
| |
| ### Defensive Registrations |
| |
| Domain owners can sometimes register multiple versions of their domains, such |
| as the ASCII and IDN versions, to improve user experience and prevent potential |
| spoofs. We call these supplementary domains defensive registrations. |
| |
| In some cases, Chrome's lookalike warning may flag and block navigations to |
| these domains: |
| - If one of the sites is in `domains.list` but the other isn't, the latter will |
| be blocked. |
| - If the user engaged with one of the sites but not the other, the latter will |
| be blocked. |
| |
| ### Avoiding a lookalike warning on your site |
| |
| **Domain owners can avoid the "Did you mean" warning by redirecting their |
| defensive registrations to their canonical domain.** |
| |
| **Example**: If you own both `example.com` and `éxample.com` and the majority of |
| your traffic is to `example.com`, you can fix the warning by redirecting |
| `éxample.com` to `example.com`. The lookalike warning logic considers this a |
| safe redirect and allows the navigation. If you must also redirect `http` |
| navigations to `https`, do this in a single redirect such as |
| `http://éxample.com -> https://example.com`. Use HTTP 301 or HTTP 302 |
| redirects, the lookalike warning ignores meta redirects. |
| |
| ## Reporting Security Bugs |
| |
| We reward certain cases of IDN spoofs according to [Chrome's Vulnerability |
| Reward Program](https://www.google.com/about/appsecurity/chrome-rewards/index.html) |
| policies. Please see [this |
| document]( https://docs.google.com/document/d/1_xJz3J9kkAPwk3pma6K3X12SyPTyyaJDSCxTfF8Y5sU/edit?usp=sharing) |
| before reporting a security bug. |