test/utf8.txt
author Ryan C. Gordon <icculus@icculus.org>
Fri, 12 Aug 2016 19:59:00 -0400
changeset 10266 c09f06c4e8c8
parent 1518 4d711949cd9a
permissions -rw-r--r--
emscripten: send fake mouse events for touches, like other targets do. (This really should be handled at the higher level and not in the individual targets, but this fixes the immediate bug.)
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     1
UTF-8 decoder capability and stress test
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     2
----------------------------------------
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     3
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     4
Markus Kuhn <http://www.cl.cam.ac.uk/~mgk25/> - 2003-02-19
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     5
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     6
This test file can help you examine, how your UTF-8 decoder handles
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     7
various types of correct, malformed, or otherwise interesting UTF-8
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     8
sequences. This file is not meant to be a conformance test. It does
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
     9
not prescribes any particular outcome and therefore there is no way to
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    10
"pass" or "fail" this test file, even though the texts suggests a
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    11
preferable decoder behaviour at some places. The aim is instead to
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    12
help you think about and test the behaviour of your UTF-8 on a
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    13
systematic collection of unusual inputs. Experience so far suggests
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    14
that most first-time authors of UTF-8 decoders find at least one
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    15
serious problem in their decoder by using this file.
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    16
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    17
The test lines below cover boundary conditions, malformed UTF-8
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    18
sequences as well as correctly encoded UTF-8 sequences of Unicode code
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    19
points that should never occur in a correct UTF-8 file.
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    20
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    21
According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    22
receiving UTF-8 shall interpret a "malformed sequence in the same way
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    23
that it interprets a character that is outside the adopted subset" and
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    24
"characters that are not within the adopted subset shall be indicated
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    25
to the user" by a receiving device. A quite commonly used approach in
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    26
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    27
replacement character (U+FFFD), which looks a bit like an inverted
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    28
question mark, or a similar symbol. It might be a good idea to
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    29
visually distinguish a malformed UTF-8 sequence from a correctly
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    30
encoded Unicode character that is just not available in the current
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    31
font but otherwise fully legal, even though ISO 10646-1 doesn't
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    32
mandate this. In any case, just ignoring malformed sequences or
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    33
unavailable characters does not conform to ISO 10646, will make
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    34
debugging more difficult, and can lead to user confusion.
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    35
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    36
Please check, whether a malformed UTF-8 sequence is (1) represented at
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    37
all, (2) represented by exactly one single replacement character (or
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    38
equivalent signal), and (3) the following quotation mark after an
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    39
illegal UTF-8 sequence is correctly displayed, i.e. proper
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    40
resynchronization takes place immageately after any malformed
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    41
sequence. This file says "THE END" in the last line, so if you don't
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    42
see that, your decoder crashed somehow before, which should always be
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    43
cause for concern.
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    44
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    45
All lines in this file are exactly 79 characters long (plus the line
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    46
feed). In addition, all lines end with "|", except for the two test
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    47
lines 2.1.1 and 2.2.1, which contain non-printable ASCII controls
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    48
U+0000 and U+007F. If you display this file with a fixed-width font,
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    49
these "|" characters should all line up in column 79 (right margin).
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    50
This allows you to test quickly, whether your UTF-8 decoder finds the
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    51
correct number of characters in every line, that is whether each
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    52
malformed sequences is replaced by a single replacement character.
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    53
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    54
Note that as an alternative to the notion of malformed sequence used
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    55
here, it is also a perfectly acceptable (and in some situations even
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    56
preferable) solution to represent each individual byte of a malformed
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    57
sequence by a replacement character. If you follow this strategy in
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    58
your decoder, then please ignore the "|" column.
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    59
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    60
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    61
Here come the tests:                                                          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    62
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    63
1  Some correct UTF-8 text                                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    64
                                                                              |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    65
(The codepoints for this test are:                                            |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    66
  U+03BA U+1F79 U+03C3 U+03BC U+03B5  --ryan.)                                |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    67
                                                                              |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    68
You should see the Greek word 'kosme':       "κόσμε"                          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    69
                                                                              |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    70
                                                                              |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    71
2  Boundary condition test cases                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    72
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    73
2.1  First possible sequence of a certain length                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    74
                                                                              |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    75
(byte zero skipped...there's a null added at the end of the test. --ryan.)    |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    76
                                                                              |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    77
2.1.2  2 bytes (U-00000080):        "€"                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    78
2.1.3  3 bytes (U-00000800):        "ࠀ"                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    79
2.1.4  4 bytes (U-00010000):        "𐀀"                                       |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    80
                                                                              |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    81
(5 and 6 byte sequences were made illegal in rfc3629. --ryan.)                |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    82
2.1.5  5 bytes (U-00200000):        ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    83
2.1.6  6 bytes (U-04000000):        ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    84
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    85
2.2  Last possible sequence of a certain length                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    86
                                                                              |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    87
2.2.1  1 byte  (U-0000007F):        ""                                       |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    88
2.2.2  2 bytes (U-000007FF):        "߿"                                       |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    89
                                                                              |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    90
(Section 5.3.2 below calls this illegal. --ryan.)                             |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    91
2.2.3  3 bytes (U-0000FFFF):        "￿"                                       |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    92
                                                                              |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    93
(5 and 6 bytes sequences, and 4 bytes sequences > 0x10FFFF were made illegal  |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    94
 in rfc3629, so these next three should be replaced with a invalid            |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
    95
 character codepoint. --ryan.)                                                |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    96
2.2.4  4 bytes (U-001FFFFF):        ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    97
2.2.5  5 bytes (U-03FFFFFF):        ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    98
2.2.6  6 bytes (U-7FFFFFFF):        ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
    99
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   100
2.3  Other boundary conditions                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   101
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   102
2.3.1  U-0000D7FF = ed 9f bf = "퟿"                                            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   103
2.3.2  U-0000E000 = ee 80 80 = ""                                            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   104
2.3.3  U-0000FFFD = ef bf bd = "�"                                            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   105
2.3.4  U-0010FFFF = f4 8f bf bf = "􏿿"                                         |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
   106
                                                                              |
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
   107
(This one is bogus in rfc3629. --ryan.)                                       |
1501
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   108
2.3.5  U-00110000 = f4 90 80 80 = ""                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   109
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   110
3  Malformed sequences                                                        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   111
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   112
3.1  Unexpected continuation bytes                                            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   113
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   114
Each unexpected continuation byte should be separately signalled as a         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   115
malformed sequence of its own.                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   116
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   117
3.1.1  First continuation byte 0x80: ""                                      |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   118
3.1.2  Last  continuation byte 0xbf: ""                                      |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   119
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   120
3.1.3  2 continuation bytes: ""                                             |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   121
3.1.4  3 continuation bytes: ""                                            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   122
3.1.5  4 continuation bytes: ""                                           |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   123
3.1.6  5 continuation bytes: ""                                          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   124
3.1.7  6 continuation bytes: ""                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   125
3.1.8  7 continuation bytes: ""                                        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   126
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   127
3.1.9  Sequence of all 64 possible continuation bytes (0x80-0xbf):            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   128
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   129
   "                                                          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   130
                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   131
                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   132
    "                                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   133
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   134
3.2  Lonely start characters                                                  |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   135
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   136
3.2.1  All 32 first bytes of 2-byte sequences (0xc0-0xdf),                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   137
       each followed by a space character:                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   138
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   139
   "                                                          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   140
                    "                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   141
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   142
3.2.2  All 16 first bytes of 3-byte sequences (0xe0-0xef),                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   143
       each followed by a space character:                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   144
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   145
   "                "                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   146
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   147
3.2.3  All 8 first bytes of 4-byte sequences (0xf0-0xf7),                     |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   148
       each followed by a space character:                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   149
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   150
   "        "                                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   151
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   152
3.2.4  All 4 first bytes of 5-byte sequences (0xf8-0xfb),                     |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   153
       each followed by a space character:                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   154
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   155
   "    "                                                                 |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   156
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   157
3.2.5  All 2 first bytes of 6-byte sequences (0xfc-0xfd),                     |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   158
       each followed by a space character:                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   159
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   160
   "  "                                                                     |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   161
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   162
3.3  Sequences with last continuation byte missing                            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   163
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   164
All bytes of an incomplete sequence should be signalled as a single           |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   165
malformed sequence, i.e., you should see only a single replacement            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   166
character in each of the next 10 tests. (Characters as in section 2)          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   167
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   168
3.3.1  2-byte sequence with last byte missing (U+0000):     ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   169
3.3.2  3-byte sequence with last byte missing (U+0000):     ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   170
3.3.3  4-byte sequence with last byte missing (U+0000):     ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   171
3.3.4  5-byte sequence with last byte missing (U+0000):     ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   172
3.3.5  6-byte sequence with last byte missing (U+0000):     ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   173
3.3.6  2-byte sequence with last byte missing (U-000007FF): ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   174
3.3.7  3-byte sequence with last byte missing (U-0000FFFF): ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   175
3.3.8  4-byte sequence with last byte missing (U-001FFFFF): ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   176
3.3.9  5-byte sequence with last byte missing (U-03FFFFFF): ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   177
3.3.10 6-byte sequence with last byte missing (U-7FFFFFFF): ""               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   178
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   179
3.4  Concatenation of incomplete sequences                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   180
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   181
All the 10 sequences of 3.3 concatenated, you should see 10 malformed         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   182
sequences being signalled:                                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   183
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   184
   ""                                                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   185
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   186
3.5  Impossible bytes                                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   187
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   188
The following two bytes cannot appear in a correct UTF-8 string               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   189
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   190
3.5.1  fe = ""                                                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   191
3.5.2  ff = ""                                                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   192
3.5.3  fe fe ff ff = ""                                                   |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   193
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   194
4  Overlong sequences                                                         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   195
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   196
The following sequences are not malformed according to the letter of          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   197
the Unicode 2.0 standard. However, they are longer then necessary and         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   198
a correct UTF-8 encoder is not allowed to produce them. A "safe UTF-8         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   199
decoder" should reject them just like malformed sequences for two             |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   200
reasons: (1) It helps to debug applications if overlong sequences are         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   201
not treated as valid representations of characters, because this helps        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   202
to spot problems more quickly. (2) Overlong sequences provide                 |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   203
alternative representations of characters, that could maliciously be          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   204
used to bypass filters that check only for ASCII characters. For              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   205
instance, a 2-byte encoded line feed (LF) would not be caught by a            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   206
line counter that counts only 0x0a bytes, but it would still be               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   207
processed as a line feed by an unsafe UTF-8 decoder later in the              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   208
pipeline. From a security point of view, ASCII compatibility of UTF-8         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   209
sequences means also, that ASCII characters are *only* allowed to be          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   210
represented by ASCII bytes in the range 0x00-0x7f. To ensure this             |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   211
aspect of ASCII compatibility, use only "safe UTF-8 decoders" that            |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   212
reject overlong UTF-8 sequences for which a shorter encoding exists.          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   213
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   214
4.1  Examples of an overlong ASCII character                                  |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   215
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   216
With a safe UTF-8 decoder, all of the following five overlong                 |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   217
representations of the ASCII character slash ("/") should be rejected         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   218
like a malformed UTF-8 sequence, for instance by substituting it with         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   219
a replacement character. If you see a slash below, you do not have a          |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   220
safe UTF-8 decoder!                                                           |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   221
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   222
4.1.1 U+002F = c0 af             = ""                                        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   223
4.1.2 U+002F = e0 80 af          = ""                                        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   224
4.1.3 U+002F = f0 80 80 af       = ""                                        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   225
4.1.4 U+002F = f8 80 80 80 af    = ""                                        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   226
4.1.5 U+002F = fc 80 80 80 80 af = ""                                        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   227
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   228
4.2  Maximum overlong sequences                                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   229
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   230
Below you see the highest Unicode value that is still resulting in an         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   231
overlong sequence if represented with the given number of bytes. This         |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   232
is a boundary test for safe UTF-8 decoders. All five characters should        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   233
be rejected like malformed UTF-8 sequences.                                   |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   234
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   235
4.2.1  U-0000007F = c1 bf             = ""                                   |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   236
4.2.2  U-000007FF = e0 9f bf          = ""                                   |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   237
4.2.3  U-0000FFFF = f0 8f bf bf       = ""                                   |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   238
4.2.4  U-001FFFFF = f8 87 bf bf bf    = ""                                   |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   239
4.2.5  U-03FFFFFF = fc 83 bf bf bf bf = ""                                   |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   240
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   241
4.3  Overlong representation of the NUL character                             |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   242
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   243
The following five sequences should also be rejected like malformed           |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   244
UTF-8 sequences and should not be treated like the ASCII NUL                  |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   245
character.                                                                    |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   246
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   247
4.3.1  U+0000 = c0 80             = ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   248
4.3.2  U+0000 = e0 80 80          = ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   249
4.3.3  U+0000 = f0 80 80 80       = ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   250
4.3.4  U+0000 = f8 80 80 80 80    = ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   251
4.3.5  U+0000 = fc 80 80 80 80 80 = ""                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   252
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   253
5  Illegal code positions                                                     |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   254
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   255
The following UTF-8 sequences should be rejected like malformed               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   256
sequences, because they never represent valid ISO 10646 characters and        |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   257
a UTF-8 decoder that accepts them might introduce security problems           |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   258
comparable to overlong UTF-8 sequences.                                       |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   259
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   260
5.1 Single UTF-16 surrogates                                                  |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   261
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   262
5.1.1  U+D800 = ed a0 80 = ""                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   263
5.1.2  U+DB7F = ed ad bf = ""                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   264
5.1.3  U+DB80 = ed ae 80 = ""                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   265
5.1.4  U+DBFF = ed af bf = ""                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   266
5.1.5  U+DC00 = ed b0 80 = ""                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   267
5.1.6  U+DF80 = ed be 80 = ""                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   268
5.1.7  U+DFFF = ed bf bf = ""                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   269
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   270
5.2 Paired UTF-16 surrogates                                                  |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   271
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   272
5.2.1  U+D800 U+DC00 = ed a0 80 ed b0 80 = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   273
5.2.2  U+D800 U+DFFF = ed a0 80 ed bf bf = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   274
5.2.3  U+DB7F U+DC00 = ed ad bf ed b0 80 = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   275
5.2.4  U+DB7F U+DFFF = ed ad bf ed bf bf = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   276
5.2.5  U+DB80 U+DC00 = ed ae 80 ed b0 80 = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   277
5.2.6  U+DB80 U+DFFF = ed ae 80 ed bf bf = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   278
5.2.7  U+DBFF U+DC00 = ed af bf ed b0 80 = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   279
5.2.8  U+DBFF U+DFFF = ed af bf ed bf bf = ""                               |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   280
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   281
5.3 Other illegal code positions                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   282
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   283
5.3.1  U+FFFE = ef bf be = "￾"                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   284
5.3.2  U+FFFF = ef bf bf = "￿"                                                |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   285
                                                                              |
73dc5d39bbf8 Added UTF-8 <-> UTF-16 <-> UTF-32 <-> UCS-2 <-> UCS-4 conversion capability
Sam Lantinga <slouken@libsdl.org>
parents:
diff changeset
   286
THE END                                                                       |
1518
4d711949cd9a Updated by Ryan Gordon
Sam Lantinga <slouken@libsdl.org>
parents: 1501
diff changeset
   287