Prec.      Page 9     Suiv.

2 Lexical conventions

[lex] 

  1. The text of the program is kept in units called source files in this International Standard. A source file together with all the headers (17.4.1.2) and source files included (16.2) via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion (16.1) preprocessing directives, is called a translation unit. [Note: a C++ program need not all be translated at the same time].
  2. [Note: previously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate (3.5) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program. (3.5)].

2.1 Phases of translation 

[lex.phases] 

  1. The precedence among the syntax rules of translation is specified by the following phases 13).
    1. Physical source file characters are mapped, in an implementation­defined manner, to the basic source character set (introducing new­line characters for end­of­line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single­character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal­character­name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal­character­name (i.e. using the \uXXXX notation), are handled equivalently.)
    2. Each instance of a new­line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. If, as a result, a character sequence that matches the syntax of a universal­character­name is produced, the behavior is undefined. If a source file that is not empty does not end in a new­line character, or ends in a new­line character immediately preceded by a backslash character, the behavior is undefined.
    3. The source file is decomposed into preprocessing tokens (2.4) and sequences of white­space characters (including comments). A source file shall not end in a partial preprocessing token or partial comment 14). Each comment is replaced by one space character. New­line characters are retained. Whether each nonempty sequence of white­space characters other than new­line is retained or replaced by one space character is implementation­defined. The process of dividing a source file's characters into pre­processing tokens is context­dependent. [Example: see the handling of < within a #include preprocessing directive.]
    4. Preprocessing directives are executed and macro invocations are expanded. If a character sequence that matches the syntax of a universal­character­name is produced by token concatenation (16.3.3), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.
    5. Each source character set member, escape sequence, or universal­character­name in character literals and string literals is converted to a member of the execution character set (2.13.2, 2.13.4).
    6. Adjacent ordinary string literal tokens are concatenated. Adjacent wide string literal tokens are concatenated.
    7. Whitespace characters separating tokens are no longer significant. Each preprocessing token is converted into a token (2.6). The resulting tokens are syntactically and semantically analyzed and translated. [Note: Source files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one­to­one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation].
    8. Translated translation units and instantiation units are combined as follows: [Note: some or all of these may be supplied from a library.] Each translated translation unit is examined to produce a list of required instantiations. [Note: this may include instantiations which have been explicitly requested(14.7.2).] The definitions of the required templates are located. It is implementation­defined whether the source of the translation units containing these definitions is required to be available. [Note: an implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here.] All the required instantiations are performed to produce instantiation units. [Note: these are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions.] The program is ill­formed if any instantiation fails.
    9. All external object and function references are resolved. Library components are linked to satisfy extenal references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

Prec.      Page 10     Suiv.

2.2 Character sets

[lex.charset]

  1. The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new­line, plus the following 91 graphical characters: 15)
  2. a b c d e f g h i j k l m n o p q r s t u v w x y z
    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    0 1 2 3 4 5 6 7 8 9
    _{ } [ ] # ( ) < > % : ; . ? * + ­/ ^ & | ~ ! = , \" '
  3. The universal­character­name construct provides a way to name other characters.
  4. hex­quad:
    hexadecimal­digit hexadecimal­digit hexadecimal­digit hexadecimal­digit
    universal­character-name:
    \u hex­quad
    \U hex­quad hex­quad
    The character designated by the universal­character­name \UNNNNNNNN is that character whose character short name in ISO/ IEC 10646 is NNNNNNNN; the character designated by the universal­character name \uNNNN is that character whose character short name in ISO/ IEC 10646 is 0000NNNN. If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F­0x9F (inclusive), or if the universal character name designates a character in the basic source character set, then the program is ill­formed.

  5. The basic execution character set and the basic execution wide­character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non­negative and distinct from one another. The execution character set and the execution wide­ character set are upersets of the basic execution character set and the basic execution wide­character set, respectively. The values of the members of the execution character sets are implementation­defined, and any additional members are locale­specific.

Prec.      Page 11     Suiv.

2.3 Trigraph sequences

[lex.trigraph]

  1. Before any other processing takes place, each occurrence of one of the following sequences of three characters ("trigraph sequences") is replaced by the single character indicated in Table 1.


  2. Table 1 - trigraph sequences

    trigraph
    replacement
    ??=<\tt>
    #<\tt>
    ??/
    \
    ??'
    ^
    trigraph
    replacement
    ??(
    [
    ??)
    ]
    ??!
    |
    trigraph
    replacement
    ??<
    {
    ??>
    }
    ??­
    ~

  3. Example:
  4. ??= define arraycheck (a, b) a ??(b??) ??!??! b ??(a??)
    becomes
    #define arraycheck (a, b) a [b] || b [a]
    --end example]

  5. No other trigraph sequence exists. Each ? that does not begin one of the trigraphs listed above is not changed.

2.4 Preprocessing tokens

[lex.pptoken]

preprocessing­token:
header­name
identifier
pp­number
character­literal
string-literal
preprocessing­op­or­punc
each non­white­space character that cannot be one of the above
  1. Each preprocessing token that is converted to a token (2.6) shall have the lexical form of a keyword, an identifier, a literal, an operator, or a punctuator.
  2. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals, string literals, preprocessing­op­or­punc, and single non­white­space characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (2.7), or white­space characters (space, horizontal tab, new­line, vertical tab, and form­feed), or both. As described in clause 16, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
  3. If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail.
  4. [Example: The program fragment 1Ex is parsed as a preprocessing number token (one that is not a valid floating or integer literal token), even though a parse as the pair of preprocessing tokens 1 and Ex might produce a valid expression (for example, if Ex were a macro defined as +1). Similarly, the program fragment 1E1is parsed as a preprocessing number (one that is a valid floating literal token), whether or not E is a macro name.]

  5.  

    Prec.      Page 12     Suiv.

  6. [Example: The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y are of built­in types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression.]

2.5 Alternative tokens

[lex.digraph]

  1. Alternative token representations are provided for some operators and punctuators 16).
  2. In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling 17). The set of alternative tokens is defined in Table 2.
Table 2 - alternative tokens
alternative
primary
<%
{
%>
}
<:
[
%:
#
%:%:
##
alternative
primary
and
&&
bitor
|
or
||
compl
~
bitand
&
alternative
primary
and_eq
&=
or_eq
|=
xor_eq
^=
not_eq
!=
   

2.6 Tokens

[lex.token]

token:
identifier
keyword
literal
operator
punctuator
  1. There are five kinds of tokens: identifiers, keywords, literals 18), operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, "white space"), as described below, are ignored except as they serve to separate tokens. [Note: Some white space is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters.]

2.7 Comments

[lex.comment]

  1. The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates with the next new­line character. If there is a formfeed or a vertical­tab character in such a comment, only white­space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [Note: The comment characters //,/*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.]

Prec.      Page 13     Suiv.

2.8 Header names

[lex.header]

    header­name:
    <h­char­sequence>
    "q­char­sequence"
    h­char­sequence:
    h­char
    h­char­sequence h­char
    h­char:
    any member of the source character set except new­line and >
    q­char­sequence:
    q­char
    q­char­sequence q­char
    q­char:
    any member of the source character set except new­line and "
  1. Header name preprocessing tokens shall only appear within a #include preprocessing directive (16.2). The sequences in both forms of header­names are mapped in an implementation­defined manner to headers or to external source file names as specified in 16.2.
  2. If either of the characters ' or \, or either of the character sequences /* or // appears in a q­char­sequence or a h­char­sequence, or the character " appears in a h­char­sequence, the behavior is undefined 19).

2.9 Preprocessing numbers

[lex.ppnumber]

    pp­number:
    digit
    . digit
    pp­number digit
    pp­number nondigit
    pp­number e sign
    pp­number E sign
    pp­number .
  1. Preprocessing number tokens lexically include all integral literal tokens (2.13.1) and all floating literal tokens (2.13.3).
  2. A preprocessing number does not have a type or a value; it acquires both after a successful conversion (as part of translation phase 7, 2.1) to an integral literal token or a floating literal token.

Prec.     Page 14     Suiv.

2.10 Identifiers

[lex.name]

    identifier:
    nondigit
    identifier nondigit
    identifier digit
    nondigit: one of
    universal­character­name
    _a b c d e f g h i j k l m
    n o p q r s t u v w x y z
    A B C D E F G H I J K L M
    N O P Q R S T U V W X Y Z
    digit: one of
    0 1 2 3 4 5 6 7 8 9
  1. An identifier is an arbitrarily long sequence of letters and digits. Each universal­character­name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Annex E. Upper­and lower­case letters are different. All characters are significant 20).
  2. In addition, some identifiers are reserved for use by C++ implementations and standard libraries (17.4.3.1.2) and shall not be used otherwise; no diagnostic is required.

2.11 Keywords

[lex.key]

  1. The identifiers shown in Table 3 are reserved for use as keywords (that is, they are unconditionally treated as keywords in phase 7):

  2.  
    Table 3 - keywords

     
    asm
    auto
    bool
    break
    case
    catch
    char
    class
    const
    const_cast
    continue
    default
    delete
    do
    double
    dynamic_cast
    else
    enum
    explicit
    export
    extern
    false
    float
    for
    friend
    goto
    if
    inline
    int
    long
    mutable
    namespace
    new
    operator
    private
    protected
    public
    register
    reinterpret_cast
    return
    short
    signed
    sizeof
    static
    static_cast
    struct
    switch
    template
    this
    throw
    true
    try
    typedef
    typeid
    typename
    union
    unsigned
    using
    virtual
    void
    volatile
    wchar_t
    while

  3. Furthermore, the alternative representations shown in Table 4 for certain operators and punctuators (2.5) are reserved and shall not be used otherwise:

  4.  
    Table 4 - alternative representations

     
    and
    not_eq
    and_eq
    or
    bitand
    or_eq
    bitor
    xor
    compl
    xor_eq
    not
     

Prec.     Page 15     Suiv.

2.12 Operators and punctuators

[lex.operators]

  1. The lexical representation of C++ programs includes a number of preprocessing tokens which are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:

    preprocessing­op­or­punc: one of

    { } [ ] # ## ( )
    <: :> <% %> %: %:%: ; : ...
    new delete ? :: . .*
    + ­* / % ^ & | ~
    ! = < > += ­= *= /= %=
    ^= &= |= << >> >>= <<= == !=
    <= >= && || ++ ­­, ­>* ­>
    and and_eq bitand bitor compl not not_eq
    or or_eq xor xor_eq
    Each preprocessing­op­or­punc is converted to a single token in translation phase 7 (2.1).

2.13 Literals

[lex.literal]

  1. There are several kinds of literals 21).

    literal:

    integer­literal
    character­literal
    floating­literal
    string­literal
    boolean­literal

2.13.1 Integer literals

[lex.icon]

    integer­literal:
    decimal­literal integer­suffixopt
    octal­literal integer­suffixopt
    hexadecimal­literal integer­suffixopt
    decimal­literal:
    nonzero­digit
    decimal­literal digit
    octal­literal:
    0
    octal­literal octal­digit
    hexadecimal­literal:
    0x hexadecimal­digit
    0X hexadecimal­digit
    hexadecimal­literal hexadecimal­digit
    nonzero­digit: one of
    1 2 3 4 5 6 7 8 9
    octal­digit: one of
    0 1 2 3 4 5 6 7

    Prec.      Page 16     Suiv.
    hexadecimal­digit: one of
    0 1 2 3 4 5 6 7 8 9
    a b c d e f
    A B C D E F
    integer­suffix:
    unsigned­suffix long­suffixopt
    long­suffix unsigned­suffixopt
    unsigned­suffix: one of
    u U
    long­suffix: one of
    l L
  1. An integer literal is a sequence of digits that has no period or exponent part. An integer literal may have a prefix that specifies its base and a suffix that specifies its type. The lexically first digit of the sequence of digits is the most significant. A decimal integer literal (base ten) begins with a digit other than 0 and consists of a sequence of decimal digits. An octal integer literal (base eight) begins with the digit 0 and consists of a sequence of octal digits 22). A hexadecimal integer literal (base sixteen) begins with 0x or 0X and consists of a sequence of hexadecimal digits, which include the decimal digits and the letters a through f and A through F with decimal values ten through fifteen. [Example: the number twelve can be written 12, 014, or 0XC.]
  2. The type of an integer literal depends on its form, value, and suffix. If it is decimal and has no suffix, it has the first of these types in which its value can be represented : int, long int; if the value cannot be represented as a long int, the behavior is undefined. If it is octal or hexadecimal and has no suffix, it has the first of these types in which its value can be represented: int, unsigned int, long int, unsigned long int. If it is suffixed by u or U, its type is the first of these types in which its value can be represented: unsigned int, unsigned long int. If it is suffixed by l or L, its type is the first of these types in which its value can be represented: long int, unsigned long int. If it is suffixed by ul,lu, uL, Lu, Ul, lU, UL, or LU, its type is unsigned long int.
  3. A program is ill­formed if one of its translation units contains an integer literal that cannot be represented by any of the allowed types.

2.13.2 Character literals

[lex.ccon]

    character­literal:
    'c­char­sequence'
    L'c­char­sequence'
    c­char­sequence:
    c­char
    c­char­sequence c­char
    c­char:
    any member of the source character set except the single­quote ', backslash \, or new­line character
    escape­sequence
    universal­character­name

    Prec.      Page 17     Suiv.
    escape­sequence:
    simple­escape­sequence
    octal­escape­sequence
    hexadecimal­escape­sequence
    simple­escape­sequence: one of
    \' \" \? \\
    \a \b \f \n \r \t \v
    octal­escape­sequence:
    \ octal­digit
    \ octal­digit octal­digit
    \ octal­digit octal­digit octal­digit
    hexadecimal­escape­sequence:
    \x hexadecimal­digit
    hexadecimal­escape­sequence hexadecimal­digit
  1. A character literal is one or more characters enclosed in single quotes, as in 'x', optionally preceded by the letter L, as in L'x'. A character literal that does not begin with L is an ordinary character literal, also referred to as a narrow­character literal. An ordinary character literal that contains a single c­char has type char, with value equal to the numerical value of the encoding of the c­char in the execution character set. An ordinary character literal that contains more than one c­char is a multicharacter literal. A multicharacter literal has type int and implementation­defined value.
  2. A character literal that begins with the letter L, such as L'x', is a wide­character literal. A wide­character literal has type wchar_t 23). The value of a wide­character literal containing a single c­char has value equal to the numerical value of the encoding of the c­char in the execution wide­character set. The value of a wide­character literal containing multiple c­chars is implementation­defined.
  3. Certain nongraphic characters, the single quote ', the double quote ", the question mark ?, and the back­slash \, can be represented according to Table 5.
  4. Table 5 - escape sequences


     
    new­line
    horizontal tab
    vertical tab
    backspace
    carriage return
    form feed
    alert
    backslash
    question mark
    single quote
    double quote
    octal number
    hex number
    NL (LF)
    HT
    VT
    BS
    CR
    FF
    BEL
    \
    ?
    '
    "
    ooo
    hhh
    \n
    \t
    \v
    \b
    \r
    \f
    \a
    \\
    \?
    \'
    \"
    \ooo
    \xhhh

    The double quote " and the question mark ?, can be represented as themselves or by the escape sequences \" and \? respectively, but the single quote ' and the backslash \ shall be represented by the escape sequences \' and \\ respectively. If the character following a backslash is not one of those specified, the behavior is undefined. An escape sequence specifies a single character.


    Prec.      Page 18     Suiv.

     
  5. The escape \ooo consists of the backslash followed by one, two, or three octal digits that are taken to specify the value of the desired character. The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. There is no limit to the number of digits in a hexadecimal sequence. A sequence of octal or hexadecimal digits is terminated by the first character that is not an octal digit or a hexadecimal digit, respectively. The value of a character literal is implementation­defined if it falls outside of the implementation­defined range defined for char (for ordinary literals) or wchar_t (for wide literals).
  6. A universal­character­name is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the universal­character­name is translated to an implementation­defined encoding. [Note: in translation phase 1, a universal­character­name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal­character­names. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained.]

2.13.3 Floating literals

[lex.fcon]

    floating­literal:
    fractional­constant exponent­partopt floating­suffixopt
    digit­sequence exponent­part floating­suffixopt
    fractional­constant:
    digit­sequenceopt . digit­sequence
    digit­sequence .
    exponent­part:
    e signopt digit­sequence
    E signopt digit­sequence
    sign: one of
    + ­
    digit­sequence:
    digit
    digit­sequence digit
    floating­suffix: one of
    f l F L
  1. A floating literal consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent, and an optional type suffix. The integer and fraction parts both consist of a sequence of decimal (base ten) digits. Either the integer part or the fraction part (not both) can be omitted; either the decimal point or the letter e (or E) and the exponent (not both) can be omitted. The integer part, the optional decimal point and the optional fraction part form the significant part of the floating literal. The exponent, if present, indicates the power of 10 by which the significant part is to be scaled. If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation­defined manner. The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double. If the scaled value is not in the range of representable values for its type, the program is ill­formed.

Prec.      Page 19     Suiv.

 

2.13.4 String literals

[lex.string]

    string­literal:
    "s­char­sequenceopt "
    L" s­char­sequenceopt "
    s­char­sequence:
    s­char
    s­char­sequences­char
    s­char:
    any member of the source character set except the double­quote ", backslash \, or new-line character
    escape­sequence
    universal­character­name
  1. A string literal is a sequence of characters (as defined in 2.13.2) surrounded by double quotes, optionally beginning with the letter L, as in "..." or L"...". A string literal that does not begin with L is an ordinary string literal, also referred to as a narrow string literal. An ordinary string literal has type "array of n const char" and static storage duration (3.7), where n is the size of the string as defined below, and is initialized with the given characters. A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type "array of n const wchar_t" and has static storage duration, where n is the size of the string as defined below, and is initialized with the given characters.
  2. Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation­defined. The effect of attempting to modify a string literal is undefined.
  3. In translation phase 6 (2.1), adjacent narrow string literals are concatenated and adjacent wide string literals are concatenated. If a narrow string literal token is adjacent to a wide string literal token, the behavior is undefined. Characters in concatenated strings are kept distinct. [Example:
    "\xA" "B"

    contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB').]

  4. After any necessary concatenation, in translation phase 7 (2.1), '\0' is appended to every string literal so that programs that scan a string can find its end.
  5. Escape sequences and universal­character­names in string literals have the same meaning as in character literals (2.13.2), except that the single quote ' is representable either by itself or by the escape sequence \', and the double quote " shall be preceded by a \. In a narrow string literal, a universal­character­name may map to more than one char element due to multibyte encoding. The size of a wide string literal is the total number of escape sequences, universal­character­names, and other characters, plus one for the terminating L'\0'. The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal­character­name, plus one for the terminating '\0'.

2.13.5 Boolean literals

[lex.bool]

    boolean­literal:
    false
    true
  1. The Boolean literals are the keywords false and true. Such literals have type bool. They are not lvalues.
pas de page 20


13) Implementations must behave as if these separate phases occur, although in practice different phases might be folded together.

14) A partial preprocessing token would arise from a source file ending in the first portion of a multi­character token that requires a ter­minating sequence of characters, such as a header­name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment.

15) The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/ IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation­defined, an implementation is required to document how the basic source characters are represented in source files.

16) These include "digraphs" and additional reserved words. The term "digraph" (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing­ tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as "digraphs".

17) Thus the "stringized" values (16.3.2) of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged.

18) Literals include strings and character and numeric literals.

19) Thus, sequences of characters that resemble escape sequences cause undefined behavior.

20) On systems in which linkers cannot accept extended characters, an encoding of the universal­character­name may be used in forming valid external identifiers. For example, some otherwise unused character or sequence of characters may be used to encode the \u in a universal­character­name. Extended characters may produce a long external identifier, but C++ does not place a translation limit on significant characters for external identifiers. In C++, upper­and lower­case letters are considered different for all identifiers, including external identifiers.

21) The term "literal" generally designates, in this International Standard, those tokens that are called "constants" in ISO C.

22) The digits 8 and 9 are not octal digits.

23) They are intended for character sets where a character does not fit into a single byte.