JavaScriptCore引擎深度解析3—词法分析篇

前言

词法分析是编译程序进行编译时第一个要进行的任务,主要是对源程序进行编译预处理(去除注释、无用的回车换行找到包含的文件等)之后,对整个源程序进行分解,分解成一个个单词,这些单词有且只有五类,分别是 标识符、关键字、常数、运算符、界符。以便为下面的语法分析和语义分析做准备。可以说词法分析面向的对象是单个的字符,目的是把它们组成有效的单词(字符串),那么该阶段的主要任务就是构造一个词法分析器。有以下两种方法:

  • 手工写,就是对于一种特定的语言,例如JavaScript,我们手工敲代码根据该语言规则来模拟出这个转化过程,这种方法虽然复杂,并且容易出错,但这样对词法分析各个部分都有比较精确的控制,并且效率可能会比较高。
  • 自动生成一个词法分析器,过程就是:声明式的规范–>自动生成–>词法分析器,这里自动生成是一个工具,那么它所接受的输入是声明式的规范,输出是一个词法分析,我们先来看下输入声明式规范,声明式规范的意思是我们只需写出识别字符流的规则是什么,不需要指出怎么识别,然后放进自动生成器就行了,那么这个声明式的规范是什么?它就是我们所学的正则表达式,我们所学的C语言的关键字,标识符,整数都是正则表达式,也就是说我们将这些正则表达式放进自动生成器,那么就会生成词法分析器,这个词法分析器就是DFA,确定状态有限自动机。那么很明显这个自动生成器整个过程实际上是正则表达式转化为非确定的有限状态自动机(NFA),最后非确定的有限状态自动机转化为确定状态有限自动机,然后通过DFA的最小化,就生成了最后的DFA,也就是我们的输出词法分析器。如果输入的代码可以被有限自动机所接受,那么将会进入语法分析的阶段

词法分析及语法分析,最著名的工具就是lex/yacc,以及后继者flex/bison(The LEX & YACC Page)。它们为很多软件提供了语言或文本解析的功能,相当强大,也很有趣。

虽然JavaScriptCore并没有使用它们,而是自行编写实现的,但基本思路是相似的。

数据类型

按照惯例,还是从一些用到的基本数据类型说起.

CharacterType类型

JavaScript语言中有如下的字符类型,是针对每一个字符而言的

// JavaScriptCore/parser/Lexer.cpp
enum CharacterType {
    // Types for the main switch

    // The first three types are fixed, and also used for identifying
    // ASCII alpha and alphanumeric characters (see isIdentStart and isIdentPart).
    CharacterIdentifierStart,
    CharacterZero,
    CharacterNumber,

    CharacterInvalid,
    CharacterLineTerminator,
    CharacterExclamationMark,
    CharacterOpenParen,
    CharacterCloseParen,
    CharacterOpenBracket,
    CharacterCloseBracket,
    CharacterComma,
    CharacterColon,
    CharacterQuestion,
    CharacterTilde,
    CharacterQuote,
    CharacterBackQuote,
    CharacterDot,
    CharacterSlash,
    CharacterBackSlash,
    CharacterSemicolon,
    CharacterOpenBrace,
    CharacterCloseBrace,

    CharacterAdd,
    CharacterSub,
    CharacterMultiply,
    CharacterModulo,
    CharacterAnd,
    CharacterXor,
    CharacterOr,
    CharacterLess,
    CharacterGreater,
    CharacterEqual,

    // Other types (only one so far)
    CharacterWhiteSpace,
    CharacterPrivateIdentifierStart
};

JSTokenType类型

JavaScript中的JSTokenType类型如下,可以看到JavaScript的每个关键字,算术运算符、逻辑运算符都被列为一种JSToken类型(由于JSTokenType较多,有一些尚未列出…)

// JavaScriptCore/parser/ParseTokens.h
enum JSTokenType {
    NULLTOKEN = KeywordTokenFlag,
    TRUETOKEN,
    FALSETOKEN,
    BREAK,
    CASE,
    DEFAULT,
    FOR,
    NEW,
    VAR,
    LET,
    CONSTTOKEN,
    CONTINUE,
    FUNCTION,
    RETURN,
    IF,
    THISTOKEN,
    DO,
    WHILE,
    SWITCH,
    WITH,
    RESERVED,
    RESERVED_IF_STRICT,
    THROW,
    TRY,
    CATCH,
    FINALLY,
    DEBUGGER,
    ELSE,
    IMPORT,
    EXPORT,
    YIELD,
    CLASSTOKEN,
    EXTENDS,
    SUPER,
    OPENBRACE = 0,
    CLOSEBRACE,
    OPENPAREN,
    CLOSEPAREN,
    OPENBRACKET,
    CLOSEBRACKET,
    COMMA,
    QUESTION,

    ...

在如下文件中可以看到上述枚举和关键词的映射关系(同样,由于篇幅关系,有些尚未列出)

// JavaScriptCore/parser/Keywords.table
# Types.
null        NULLTOKEN
true        TRUETOKEN
false       FALSETOKEN

# Keywords.
break       BREAK
case        CASE
catch       CATCH
class       CLASSTOKEN
const       CONSTTOKEN
default     DEFAULT
extends     EXTENDS
finally     FINALLY
for         FOR
instanceof  INSTANCEOF
new         NEW
var         VAR
let         LET
continue    CONTINUE
function    FUNCTION
return      RETURN
void        VOIDTOKEN
delete      DELETETOKEN
if          IF
this        THISTOKEN
do          DO
while       WHILE
else        ELSE
in          INTOKEN
super       SUPER
switch      SWITCH
throw       THROW
try         TRY
typeof      TYPEOF
with        WITH
debugger    DEBUGGER
yield       YIELD

...

Token类型

同时在源码中还可以看到一个枚举TokenType类型,这个类型是专门针对json的解析定义的,不要和前面的JSTokenType混淆(本来不应该在这里讲的,但是为了避免看源码混淆,还是提一下)

// JavaScriptCore/runtime/LiteralParser.h
enum TokenType {
    TokLBracket,        // [
    TokRBracket,        // ]
    TokLBrace,          // {
    TokRBrace,          // }
    TokString,          // string
    TokIdentifier,      // identifier
    TokNumber,          // number
    TokColon,           // :
    TokLParen,          // (
    TokRParen,          // )
    TokComma,           // ,
    TokTrue,            // true
    TokFalse,           // false
    TokNull,            // null
    TokEnd,             // end
    TokDot,             // .
    TokAssign,          // =
    TokSemi,            // ;
    TokError            // error
};

JSToken类型

这个类型就比较重要了,将JavaScript源码通过词法分析过程之后,我们得到的就是一个个的JSToken了。可以看到它是一个复合类型

struct JSToken {
    JSTokenType m_type;             // JSToken类型
    JSTokenData m_data;             // JSToken数据
    JSTokenLocation m_location;     // 定位
    JSTextPosition m_startPosition; // 起始位置
    JSTextPosition m_endPosition;   // 结束位置
};

JSTokenData是一个联合体,可以存放以下4种成员,每个Token的数据的表示方式可能是不同的,例如

  • 1、解析出来的数字字面值就可以用doubleValue来表示;
  • 2、如果是字符串,可能就需要用第一个结构体来表示;
  • 3、如果是方法名字或者变量名,需要Identifier来胜任;
union JSTokenData {
    struct {
        uint32_t line;
        uint32_t offset;
        uint32_t lineStartOffset;
    };
    double doubleValue;
    const Identifier* ident;
    struct {
        const Identifier* cooked;
        const Identifier* raw;
        bool isTail;
    };
};

JSTokenLocation结构体包括行号、行起始位置、源文件起始位置、源文件结束位置

struct JSTokenLocation {
    int line;
    unsigned lineStartOffset;
    unsigned startOffset;
    unsigned endOffset;
};

JSTextPosition作为一个位置,有行号、源文件偏移、行偏移

struct JSTextPosition {
    int line;
    int offset;
    int lineStartOffset;
};

词法分析过程

Lexer初始化

项目工程中,词法分析类Lexer类是辅助Parser类的,在Parser类中有一个成员变量m_lexer,在构造函数中对m_lexer进行了初始化:

// 定义
std::unique_ptr<LexerType> m_lexer;

// 构造函数中的初始化
m_lexer = std::make_unique<LexerType>(vm, builtinMode);
m_lexer->setCode(source, &m_parserArena);
m_token.m_location.line = source.firstLine();
m_token.m_location.startOffset = source.startOffset();
m_token.m_location.endOffset = source.startOffset();
m_token.m_location.lineStartOffset = source.startOffset();

标识符解析

然后在函数Parser类的方法nextExpectIdentifier中进行了调用,顾名思义,返回下一个期望的标识符

ALWAYS_INLINE void nextExpectIdentifier(unsigned lexerFlags = 0)
{
    // 保存上一个Token的结束位置
    int lastLine = m_token.m_location.line;
    int lastTokenEnd = m_token.m_location.endOffset;
    int lastTokenLineStart = m_token.m_location.lineStartOffset;
    m_lastTokenEndPosition = JSTextPosition(lastLine, lastTokenEnd, lastTokenLineStart);

    // 设置行号
    m_lexer->setLastLineNumber(lastLine);

    // 分析JSToken类型
    m_token.m_type = m_lexer->lexExpectIdentifier(&m_token, lexerFlags, strictMode());
}

逻辑在lexExpectIdentifier方法中(删除了其中的一些Alert,便于代码清晰)

template <typename T>
ALWAYS_INLINE JSTokenType Lexer<T>::lexExpectIdentifier(JSToken* tokenRecord, unsigned lexerFlags, bool strictMode)
{
    JSTokenData* tokenData = &tokenRecord->m_data;
    JSTokenLocation* tokenLocation = &tokenRecord->m_location;

    const T* start = m_code;
    const T* ptr = start;
    const T* end = m_codeEnd;
    JSTextPosition startPosition = currentPosition();
    if (ptr >= end) {
        goto slowCase;
    }

    // 必须是ASCII字符
    if (!WTF::isASCIIAlpha(*ptr))
        goto slowCase;

    // 往后遍历,直到不是字母和数字,跳出循环
    ++ptr;
    while (ptr < end) {
        if (!WTF::isASCIIAlphanumeric(*ptr))
            break;
        ++ptr;
    }

    // Here's the shift
    if (ptr < end) {
        if ((!WTF::isASCII(*ptr)) || (*ptr == '\\') || (*ptr == '_') || (*ptr == '$'))
            goto slowCase;
        m_current = *ptr;
    } else
        m_current = 0;

    m_code = ptr;

    // 创建一个标识符,JSTokenType = IDENT
    if (lexerFlags & LexexFlagsDontBuildKeywords)
        tokenData->ident = 0;
    else
        tokenData->ident = makeLCharIdentifier(start, ptr - start);

    tokenLocation->line = m_lineNumber;
    tokenLocation->lineStartOffset = currentLineStartOffset();
    tokenLocation->startOffset = offsetFromSourcePtr(start);
    tokenLocation->endOffset = currentOffset();
    tokenRecord->m_startPosition = startPosition;
    tokenRecord->m_endPosition = currentPosition();
    m_lastToken = IDENT;
    return IDENT;

slowCase:
    // 核心逻辑
    return lex(tokenRecord, lexerFlags, strictMode);
}

该方法存在的意义在于为标识符开了一条快速通道:因为源码中存在着大量的标识符,相对来说,关键字、常数、运算符、界符的数量就少一些,从概率角度来说,源码中每解析一个JSToken,结果是标识符的概率是比较大的,这样就可以节省时间,正如方法名所示,下一个期望的标识符。

核心逻辑

除了标识符的解析外,剩下的词法分析核心逻辑在lex方法中,具体逻辑请看代码以及注释(代码较长)

template <typename T>
JSTokenType Lexer<T>::lex(JSToken* tokenRecord, unsigned lexerFlags, bool strictMode)
{
    JSTokenData* tokenData = &tokenRecord->m_data;
    JSTokenLocation* tokenLocation = &tokenRecord->m_location;
    m_lastTokenLocation = JSTokenLocation(tokenRecord->m_location);

    JSTokenType token = ERRORTOK;
    m_terminator = false;

start:
    // 跳过空格
    skipWhitespace();

    // 如果是结尾,返回EOFTOK类型
    if (atEnd()) return EOFTOK;

    tokenLocation->startOffset = currentOffset();
    ASSERT(currentOffset() >= currentLineStartOffset());
    tokenRecord->m_startPosition = currentPosition();

    // 判断JSToken的首字符类型
    CharacterType type;
    if (LIKELY(isLatin1(m_current)))
        // typesOfLatin1Characters是一个ASCII字符到CharacterType的映射表
        type = static_cast<CharacterType>(typesOfLatin1Characters[m_current]);
    else if (isNonLatin1IdentStart(m_current))
        type = CharacterIdentifierStart;
    else if (isLineTerminator(m_current))
        type = CharacterLineTerminator;
    else
        type = CharacterInvalid;

    switch (type) {
    case CharacterGreater:
        shift();
        if (m_current == '>') 
        {
            shift();
            if (m_current == '>') {
                shift();
                if (m_current == '=') {
                    shift();
                    // >>>= 符号
                    token = URSHIFTEQUAL;
                    break;
                }

                // >>> 无符号逻辑右移
                token = URSHIFT;
                break;
            }
            if (m_current == '=') {
                shift();
                // >>= 符号
                token = RSHIFTEQUAL;
                break;
            }

            // >>逻辑右移
            token = RSHIFT;
            break;
        }

        // >= 大于等于符号
        if (m_current == '=') {
            shift();
            token = GE;
            break;
        }

        // > 大于符号
        token = GT;
        break;


    case CharacterEqual: 
    {
        if (peek(1) == '>') {
            // Arrow Function(箭头函数): x => x * x
            token = ARROWFUNCTION;
            tokenData->line = lineNumber();
            tokenData->offset = currentOffset();
            tokenData->lineStartOffset = currentLineStartOffset();
            ASSERT(tokenData->offset >= tokenData->lineStartOffset);
            shift();
            shift();
            break;
        }

        shift();
        if (m_current == '=') {
            shift();
            if (m_current == '=') {
                shift();
                // ===符号,全等(值和类型)
                token = STREQ;
                break;
            }

            // ==符号,判断是否相等
            token = EQEQ;
            break;
        }

        // 等于符号,赋值
        token = EQUAL;
        break;
    }

    case CharacterLess:
        shift();
        if (m_current == '!' && peek(1) == '-' && peek(2) == '-') {
            // <!-- 单行注释
            goto inSingleLineComment;
        }
        if (m_current == '<') {
            shift();
            if (m_current == '=') {
                shift();
                // <<= 
                token = LSHIFTEQUAL;
                break;
            }
            // << 左移操作
            token = LSHIFT;
            break;
        }
        if (m_current == '=') {
            shift();
            // <= 小于等于
            token = LE;
            break;
        }

        // < 小于号
        token = LT;
        break;
    case CharacterExclamationMark:
        shift();
        if (m_current == '=') {
            shift();
            if (m_current == '=') {
                shift();
                // !== 字符串不等于
                token = STRNEQ;
                break;
            }
            // != 不等于
            token = NE;
            break;
        }
        // ! 非操作
        token = EXCLAMATION;
        break;
    case CharacterAdd:
        shift();
        if (m_current == '+') {
            shift();
            token = (!m_terminator) ? PLUSPLUS : AUTOPLUSPLUS;
            break;
        }
        if (m_current == '=') {
            shift();
            // += 
            token = PLUSEQUAL;
            break;
        }
        // +操作
        token = PLUS;
        break;

    case CharacterSub:
        shift();
        if (m_current == '-') {
            shift();
            if (m_atLineStart && m_current == '>') {
                shift();
                // 单行注释
                goto inSingleLineComment;
            }
            token = (!m_terminator) ? MINUSMINUS : AUTOMINUSMINUS;
            break;
        }
        if (m_current == '=') {
            shift();
            // -=操作
            token = MINUSEQUAL;
            break;
        }
        token = MINUS;
        break;
    case CharacterMultiply:
        shift();
        if (m_current == '=') {
            shift();
            // *=
            token = MULTEQUAL;
            break;
        }
        // *
        token = TIMES;
        break;
    case CharacterSlash:
        shift();
        if (m_current == '/') {
            shift();
            // 单行注释
            goto inSingleLineCommentCheckForDirectives;
        }
        if (m_current == '*') {
            // 多行注释
            shift();
            if (parseMultilineComment())
                goto start;
            m_lexErrorMessage = ASCIILiteral("Multiline comment was not closed properly");
            token = UNTERMINATED_MULTILINE_COMMENT_ERRORTOK;
            goto returnError;
        }
        if (m_current == '=') {
            shift();
            // /= 
            token = DIVEQUAL;
            break;
        }
        // / 除操作
        token = DIVIDE;
        break;
    case CharacterAnd:
        shift();
        if (m_current == '&') {
            shift();
            // && 逻辑与
            token = AND;
            break;
        }
        if (m_current == '=') {
            shift();
            // &= 与等于
            token = ANDEQUAL;
            break;
        }
        // 与操作
        token = BITAND;
        break;
    case CharacterXor:
        shift();
        if (m_current == '=') {
            shift();
            token = XOREQUAL;
            break;
        }
        token = BITXOR;
        break;
    case CharacterModulo:
        shift();
        if (m_current == '=') {
            shift();
            token = MODEQUAL;
            break;
        }
        token = MOD;
        break;
    case CharacterOr:
        shift();
        if (m_current == '=') {
            shift();
            // |= 操作
            token = OREQUAL;
            break;
        }
        if (m_current == '|') {
            shift();
            // || 逻辑或
            token = OR;
            break;
        }
        // 或操作
        token = BITOR;
        break;

    case CharacterOpenParen:
        // 左括号(
        token = OPENPAREN;
        tokenData->line = lineNumber();
        tokenData->offset = currentOffset();
        tokenData->lineStartOffset = currentLineStartOffset();
        shift();
        break;
    case CharacterCloseParen:
        // 右括号)
        token = CLOSEPAREN;
        shift();
        break;
    case CharacterOpenBracket:
        // 左中括号[
        token = OPENBRACKET;
        shift();
        break;
    case CharacterCloseBracket:
        // 右中括号]
        token = CLOSEBRACKET;
        shift();
        break;
    case CharacterComma:
        // 逗号,
        token = COMMA;
        shift();
        break;
    case CharacterColon:
        // 冒号:
        token = COLON;
        shift();
        break;
    case CharacterQuestion:
        // 问号?
        token = QUESTION;
        shift();
        break;
    case CharacterTilde:
        // ~符号
        token = TILDE;
        shift();
        break;
    case CharacterSemicolon:
        // ;分号
        shift();
        token = SEMICOLON;
        break;
    case CharacterOpenBrace:
        tokenData->line = lineNumber();
        tokenData->offset = currentOffset();
        tokenData->lineStartOffset = currentLineStartOffset();
        shift();
        // { 左大括号
        token = OPENBRACE;
        break;
    case CharacterCloseBrace:
        tokenData->line = lineNumber();
        tokenData->offset = currentOffset();
        tokenData->lineStartOffset = currentLineStartOffset();
        shift();
        // } 右大括号
        token = CLOSEBRACE;
        break;
    case CharacterDot:
        shift();
        if (!isASCIIDigit(m_current)) {
            if (UNLIKELY((m_current == '.') && (peek(1) == '.'))) {
                shift();
                shift();
                /// ...符号
                token = DOTDOTDOT;
                break;
            }
            // .符号
            token = DOT;
            break;
        }
        goto inNumberAfterDecimalPoint;
    case CharacterZero:
        shift();
        // 16进制的数字
        if ((m_current | 0x20) == 'x') {
            if (!isASCIIHexDigit(peek(1))) {
                m_lexErrorMessage = ASCIILiteral("No hexadecimal digits after '0x'");
                token = UNTERMINATED_HEX_NUMBER_ERRORTOK;
                goto returnError;
            }

            // Shift out the 'x' prefix.
            shift();

            parseHex(tokenData->doubleValue);
            if (isIdentStart(m_current)) {
                m_lexErrorMessage = ASCIILiteral("No space between hexadecimal literal and identifier");
                token = UNTERMINATED_HEX_NUMBER_ERRORTOK;
                goto returnError;
            }
            token = tokenTypeForIntegerLikeToken(tokenData->doubleValue);
            m_buffer8.shrink(0);
            break;
        }
        // 2进制数字
        if ((m_current | 0x20) == 'b') {
            if (!isASCIIBinaryDigit(peek(1))) {
                m_lexErrorMessage = ASCIILiteral("No binary digits after '0b'");
                token = UNTERMINATED_BINARY_NUMBER_ERRORTOK;
                goto returnError;
            }

            // Shift out the 'b' prefix.
            shift();

            parseBinary(tokenData->doubleValue);
            if (isIdentStart(m_current)) {
                m_lexErrorMessage = ASCIILiteral("No space between binary literal and identifier");
                token = UNTERMINATED_BINARY_NUMBER_ERRORTOK;
                goto returnError;
            }
            token = tokenTypeForIntegerLikeToken(tokenData->doubleValue);
            m_buffer8.shrink(0);
            break;
        }
        // 8进制数字
        if ((m_current | 0x20) == 'o') {
            if (!isASCIIOctalDigit(peek(1))) {
                m_lexErrorMessage = ASCIILiteral("No octal digits after '0o'");
                token = UNTERMINATED_OCTAL_NUMBER_ERRORTOK;
                goto returnError;
            }

            // Shift out the 'o' prefix.
            shift();

            parseOctal(tokenData->doubleValue);
            if (isIdentStart(m_current)) {
                m_lexErrorMessage = ASCIILiteral("No space between octal literal and identifier");
                token = UNTERMINATED_OCTAL_NUMBER_ERRORTOK;
                goto returnError;
            }
            token = tokenTypeForIntegerLikeToken(tokenData->doubleValue);
            m_buffer8.shrink(0);
            break;
        }

        record8('0');
        if (strictMode && isASCIIDigit(m_current)) {
            m_lexErrorMessage = ASCIILiteral("Decimal integer literals with a leading zero are forbidden in strict mode");
            token = UNTERMINATED_OCTAL_NUMBER_ERRORTOK;
            goto returnError;
        }
        if (isASCIIOctalDigit(m_current)) {
            if (parseOctal(tokenData->doubleValue)) {
                token = tokenTypeForIntegerLikeToken(tokenData->doubleValue);
            }
        }
        FALLTHROUGH;
    case CharacterNumber:
        if (LIKELY(token != INTEGER && token != DOUBLE)) 
        {
            if (!parseDecimal(tokenData->doubleValue)) {
                token = INTEGER;

                // 解析小数部分
                if (m_current == '.') {
                    shift();
inNumberAfterDecimalPoint:
                    parseNumberAfterDecimalPoint();
                    // 可以肯定是一个小数,类型DOUBLE
                    token = DOUBLE;
                }

                // 解析指数部分
                if ((m_current | 0x20) == 'e') {
                    if (!parseNumberAfterExponentIndicator()) {
                        m_lexErrorMessage = ASCIILiteral("Non-number found after exponent indicator");
                        token = atEnd() ? UNTERMINATED_NUMERIC_LITERAL_ERRORTOK : INVALID_NUMERIC_LITERAL_ERRORTOK;
                        goto returnError;
                    }
                }
                size_t parsedLength;
                tokenData->doubleValue = parseDouble(m_buffer8.data(), m_buffer8.size(), parsedLength);
                if (token == INTEGER)
                    token = tokenTypeForIntegerLikeToken(tokenData->doubleValue);
            } else
                token = tokenTypeForIntegerLikeToken(tokenData->doubleValue);
        }

        if (UNLIKELY(isIdentStart(m_current))) {
            m_lexErrorMessage = ASCIILiteral("No identifiers allowed directly after numeric literal");
            token = atEnd() ? UNTERMINATED_NUMERIC_LITERAL_ERRORTOK : INVALID_NUMERIC_LITERAL_ERRORTOK;
            goto returnError;
        }
        m_buffer8.shrink(0);
        break;

    case CharacterQuote: {
        // 引号",解析字符串
        StringParseResult result = StringCannotBeParsed;
        if (lexerFlags & LexerFlagsDontBuildStrings)
            result = parseString<false>(tokenData, strictMode);
        else
            result = parseString<true>(tokenData, strictMode);

        if (UNLIKELY(result != StringParsedSuccessfully)) {
            token = result == StringUnterminated ? UNTERMINATED_STRING_LITERAL_ERRORTOK : INVALID_STRING_LITERAL_ERRORTOK;
            goto returnError;
        }
        shift();
        token = STRING;
        break;
        }
    case CharacterBackQuote: {
        // Skip backquote.
        shift();

        // 解析模板
        StringParseResult result = StringCannotBeParsed;
        if (lexerFlags & LexerFlagsDontBuildStrings)
            result = parseTemplateLiteral<false>(tokenData, RawStringsBuildMode::BuildRawStrings);
        else
            result = parseTemplateLiteral<true>(tokenData, RawStringsBuildMode::BuildRawStrings);

        if (UNLIKELY(result != StringParsedSuccessfully)) {
            token = result == StringUnterminated ? UNTERMINATED_TEMPLATE_LITERAL_ERRORTOK : INVALID_TEMPLATE_LITERAL_ERRORTOK;
            goto returnError;
        }
        token = TEMPLATE;
        break;
        }
    case CharacterIdentifierStart:
        ASSERT(isIdentStart(m_current));
        FALLTHROUGH;
    case CharacterBackSlash:
        // 反斜杠 \

        parseIdent:
        if (lexerFlags & LexexFlagsDontBuildKeywords)
            token = parseIdentifier<false>(tokenData, lexerFlags, strictMode);
        else
            token = parseIdentifier<true>(tokenData, lexerFlags, strictMode);
        break;
    case CharacterLineTerminator:
        // ASCII 10和13
        ASSERT(isLineTerminator(m_current));
        shiftLineTerminator();
        m_atLineStart = true;
        m_terminator = true;
        m_lineStart = m_code;
        goto start;
    case CharacterPrivateIdentifierStart:
        if (m_parsingBuiltinFunction)
            goto parseIdent;
        FALLTHROUGH;
    case CharacterInvalid:
        m_lexErrorMessage = invalidCharacterMessage();
        token = ERRORTOK;
        goto returnError;
    default:
        RELEASE_ASSERT_NOT_REACHED();
        m_lexErrorMessage = ASCIILiteral("Internal Error");
        token = ERRORTOK;
        goto returnError;
    }

    m_atLineStart = false;
    goto returnToken;

inSingleLineCommentCheckForDirectives:
    // Script comment directives like "//# sourceURL=test.js".
    if (UNLIKELY((m_current == '#' || m_current == '@') && isWhiteSpace(peek(1)))) {
        shift();
        shift();
        parseCommentDirective();
    }
    // Fall through to complete single line comment parsing.

inSingleLineComment:
    while (!isLineTerminator(m_current)) {
        if (atEnd())
            return EOFTOK;
        shift();
    }
    shiftLineTerminator();
    m_atLineStart = true;
    m_terminator = true;
    m_lineStart = m_code;
    if (!lastTokenWasRestrKeyword())
        goto start;

    token = SEMICOLON;
    // Fall through into returnToken.

returnToken:
    // 返回正常解析结果
    tokenLocation->line = m_lineNumber;
    tokenLocation->endOffset = currentOffset();
    tokenLocation->lineStartOffset = currentLineStartOffset();
    ASSERT(tokenLocation->endOffset >= tokenLocation->lineStartOffset);
    tokenRecord->m_endPosition = currentPosition();
    m_lastToken = token;
    return token;

returnError:
    // 异常结果
    m_error = true;
    tokenLocation->line = m_lineNumber;
    tokenLocation->endOffset = currentOffset();
    tokenLocation->lineStartOffset = currentLineStartOffset();
    ASSERT(tokenLocation->endOffset >= tokenLocation->lineStartOffset);
    tokenRecord->m_endPosition = currentPosition();
    RELEASE_ASSERT(token & ErrorTokenFlag);
    return token;
}

代码篇幅有些长,但是这是本篇最重要的一个函数,所以未作任何篇幅省略或缩减。
不过换一种方式看下,就会觉得比较清晰了,不妨结合来看。

上述的switch语句过后,就产生了一个JSToken,其中大部分case都简单明了,易于理解,不过其中有几个重要的方法值得细看一下

字符串解析

template <typename T>
template <bool shouldBuildStrings> ALWAYS_INLINE typename Lexer<T>::StringParseResult Lexer<T>::parseString(JSTokenData* tokenData, bool strictMode)
{
    int startingOffset = currentOffset();
    int startingLineStartOffset = currentLineStartOffset();
    int startingLineNumber = lineNumber();
    T stringQuoteCharacter = m_current;
    shift();

    const T* stringStart = currentSourcePtr();

    // 在碰到下一个引号之前,循环遍历
    while (m_current != stringQuoteCharacter) 
    {
        if (UNLIKELY(m_current == '\\')) {
            // 如果是转义符号
            if (stringStart != currentSourcePtr() && shouldBuildStrings)
                append8(stringStart, currentSourcePtr() - stringStart);
            shift();

            // ASCII 0~127
            LChar escape = singleEscape(m_current);

            // Most common escape sequences first.
            if (escape) {
                // 如果是常见的ASCII码,直接右移
                if (shouldBuildStrings)record8(escape);
                shift();
            } else if (UNLIKELY(isLineTerminator(m_current)))
                // 跳过换行符 \n \r
                shiftLineTerminator();
            else if (m_current == 'x') 
            {
                // 字符串中的16进制数字解析
                // \x之后跟2位十六进制数。取值范围:\x00 到 \xff
                shift();
                if (!isASCIIHexDigit(m_current) || !isASCIIHexDigit(peek(1))) {
                    // 并不是一个16进制数字
                    m_lexErrorMessage = ASCIILiteral("\\x can only be followed by a hex character sequence");
                    return (atEnd() || (isASCIIHexDigit(m_current) && (m_code + 1 == m_codeEnd))) ? StringUnterminated : StringCannotBeParsed;
                }

                // 正常的16进制数字:\x00 ~ \xff
                T prev = m_current;
                shift();
                if (shouldBuildStrings)
                    record8(convertHex(prev, m_current));
                shift();
            }
            else
            {
                // 剩下的情况走慢路径解析:parseStringSlowCase
                setOffset(startingOffset, startingLineStartOffset);
                setLineNumber(startingLineNumber);
                m_buffer8.shrink(0);
                return parseStringSlowCase<shouldBuildStrings>(tokenData, strictMode);
            }
            stringStart = currentSourcePtr();
            continue;
        }

        // 走字符串的慢解析路径
        if (UNLIKELY(characterRequiresParseStringSlowCase(m_current))) {
            setOffset(startingOffset, startingLineStartOffset);
            setLineNumber(startingLineNumber);
            m_buffer8.shrink(0);
            return parseStringSlowCase<shouldBuildStrings>(tokenData, strictMode);
        }

        shift();
    }

    // 确认开始字符串的区域,存放到m_buffer8中
    if (currentSourcePtr() != stringStart && shouldBuildStrings)
        append8(stringStart, currentSourcePtr() - stringStart);

    if (shouldBuildStrings) {
        // 创建标识符
        tokenData->ident = makeIdentifier(m_buffer8.data(), m_buffer8.size());
        m_buffer8.shrink(0);
    } else
        tokenData->ident = 0;

    return StringParsedSuccessfully;
}

parseStringSlowCase的流程和parseString的流程非常相似,都有转义字符、换行符的处理,以及最后的标识符的创建,不同的是parseStringSlowCase用的容器是m_buffer16,而parseString用的是m_buffer8

typedef unsigned char LChar;
typedef uint16_t UChar;

Vector<LChar> m_buffer8;
Vector<UChar> m_buffer16;
template <typename T>
template <bool shouldBuildStrings> auto Lexer<T>::parseStringSlowCase(JSTokenData* tokenData, bool strictMode) -> StringParseResult
{
    T stringQuoteCharacter = m_current;
    shift();

    const T* stringStart = currentSourcePtr();

    while (m_current != stringQuoteCharacter) {
        if (UNLIKELY(m_current == '\\')) {
            // 转义符号的特殊处理
            if (stringStart != currentSourcePtr() && shouldBuildStrings)
                append16(stringStart, currentSourcePtr() - stringStart);
            shift();

            LChar escape = singleEscape(m_current);

            // Most common escape sequences first
            if (escape) {
                if (shouldBuildStrings)
                    record16(escape);
                shift();
            } else if (UNLIKELY(isLineTerminator(m_current)))
                shiftLineTerminator();
            else {
                StringParseResult result = parseComplexEscape<shouldBuildStrings>(EscapeParseMode::String, strictMode, stringQuoteCharacter);
                if (result != StringParsedSuccessfully)
                    return result;
            }

            stringStart = currentSourcePtr();
            continue;
        }

        // 需要特殊处理的字符
        // Fast check for characters that require special handling.
        // Catches 0, \n, \r, 0x2028, and 0x2029 as efficiently
        // as possible, and lets through all common ASCII characters.
        if (UNLIKELY(((static_cast<unsigned>(m_current) - 0xE) & 0x2000))) {
            // New-line or end of input is not allowed
            if (atEnd() || isLineTerminator(m_current)) {
                m_lexErrorMessage = ASCIILiteral("Unexpected EOF");
                return atEnd() ? StringUnterminated : StringCannotBeParsed;
            }
            // Anything else is just a normal character
        }
        shift();
    }

    // 正常的逻辑会走到这里
    if (currentSourcePtr() != stringStart && shouldBuildStrings)
        append16(stringStart, currentSourcePtr() - stringStart);
    if (shouldBuildStrings)
        tokenData->ident = makeIdentifier(m_buffer16.data(), m_buffer16.size());
    else
        tokenData->ident = 0;

    m_buffer16.shrink(0);
    return StringParsedSuccessfully;
}

确定好了字符串的起始位置、长度和内容后,再来看下标识符的创建逻辑

template <typename CharType>
ALWAYS_INLINE const Identifier LiteralParser<CharType>::makeIdentifier(const LChar* characters, size_t length)
{
    // 长度为0,返回emptyIdentifier
    if (!length)
        return m_exec->vm().propertyNames->emptyIdentifier;

    // static const int MaximumCachableCharacter = 128;
    if (characters[0] >= MaximumCachableCharacter)
        return Identifier::fromString(&m_exec->vm(), characters, length);

    // 长度为1的字符的处理,可以看到m_shortIdentifiers做内存缓存
    if (length == 1) {
        if (!m_shortIdentifiers[characters[0]].isNull())
            return m_shortIdentifiers[characters[0]];
        m_shortIdentifiers[characters[0]] = Identifier::fromString(&m_exec->vm(), characters, length);
        return m_shortIdentifiers[characters[0]];
    }

    // characters[0]<128 && length > 1,用m_recentIdentifiers作内存缓存

    if (!m_recentIdentifiers[characters[0]].isNull() && Identifier::equal(m_recentIdentifiers[characters[0]].impl(), characters, length))
        return m_recentIdentifiers[characters[0]];
    m_recentIdentifiers[characters[0]] = Identifier::fromString(&m_exec->vm(), characters, length);
    return m_recentIdentifiers[characters[0]];
}

无论上述哪条路径,都离不开Identifier::fromString这个方法,这个方法已经涉及到了虚拟机,想一下:创建标识符肯定得经过虚拟机的记录,以便后续访问或者重复性的处理。这里不太适合分析下去,要不然泥潭就就出不去了,感兴趣的同学可以去看看。

10进制解析

先看下10进制数字的DFA图

之后对解析到的10进制字符串进行求值,并将结果存放到returnValue中。

template <typename T>
ALWAYS_INLINE bool Lexer<T>::parseDecimal(double& returnValue)
{
    // Optimization: most decimal values fit into 4 bytes.
    uint32_t decimalValue = 0;

    // Since parseOctal may be executed before parseDecimal,
    // the m_buffer8 may hold ascii digits.
    if (!m_buffer8.size()) 
    {
        const unsigned maximumDigits = 10;
        int digit = maximumDigits - 1;
        // Temporary buffer for the digits. Makes easier
        // to reconstruct the input characters when needed.
        LChar digits[maximumDigits];

        // 逐个遍历,每向右偏移一个字符,就将当前的decimalValue乘以10,
        // 并加上(m_current - '0')
        do {
            decimalValue = decimalValue * 10 + (m_current - '0');
            digits[digit] = m_current;
            shift();
            --digit;
        } while (isASCIIDigit(m_current) && digit >= 0);

        if (digit >= 0 && m_current != '.' && (m_current | 0x20) != 'e') {
            // 如果没有小数点,也没有指数符号e,直接将decimalValue返回
            returnValue = decimalValue;
            return true;
        }

        for (int i = maximumDigits - 1; i > digit; --i)
            record8(digits[i]);
    }

    // 出现了非数字字符,解析结果异常
    while (isASCIIDigit(m_current)) {
        record8(m_current);
        shift();
    }

    return false;
}

16进制解析

首先看下如何去分析出一个16进制数字

然后计算该16进制的数字的值

template <typename T>
ALWAYS_INLINE void Lexer<T>::parseHex(double& returnValue)
{
    // Optimization: 大部分的16进制数据,4个字节就可以存放下了
    uint32_t hexValue = 0;
    int maximumDigits = 7;

    // toASCIIHexValue : character < 'A' ? character - '0' : (character - 'A' + 10) & 0xF
    // 将结果hexValue左移4bit,计算当前的16进制字符对应的10进制值,加到hexValue中
    do {
        hexValue = (hexValue << 4) + toASCIIHexValue(m_current);
        shift();
        --maximumDigits;
    } while (isASCIIHexDigit(m_current) && maximumDigits >= 0);

    if (maximumDigits >= 0) {
        returnValue = hexValue;
        return;
    }

    // maximumDigits如果小于0,说明4个字节存放不下了
    // No more place in the hexValue buffer.
    // The values are shifted out and placed into the m_buffer8 vector.
    // 将前面解析好的部分16进制字符存放到m_buffer8中
    for (int i = 0; i < 8; ++i) {
         int digit = hexValue >> 28;
         if (digit < 10)
             record8(digit + '0');
         else
             record8(digit - 10 + 'a');
         hexValue <<= 4;
    }

    // 继续将后面的16进制字符添加到m_buffer8中
    while (isASCIIHexDigit(m_current)) {
        record8(m_current);
        shift();
    }

    // 调用parseIntOverflow解析16进制
    returnValue = parseIntOverflow(m_buffer8.data(), m_buffer8.size(), 16);
}

8进制解析

一样的套路,不再赘言

template <typename T>
ALWAYS_INLINE bool Lexer<T>::parseOctal(double& returnValue)
{
    // Optimization: most octal values fit into 4 bytes.
    uint32_t octalValue = 0;
    const unsigned maximumDigits = 10;
    int digit = maximumDigits - 1;
    // Temporary buffer for the digits. Makes easier
    // to reconstruct the input characters when needed.
    LChar digits[maximumDigits];

    do {
        octalValue = octalValue * 8 + (m_current - '0');
        digits[digit] = m_current;
        shift();
        --digit;
    } while (isASCIIOctalDigit(m_current) && digit >= 0);

    if (!isASCIIDigit(m_current) && digit >= 0) {
        returnValue = octalValue;
        return true;
    }

    for (int i = maximumDigits - 1; i > digit; --i)
         record8(digits[i]);

    while (isASCIIOctalDigit(m_current)) {
        record8(m_current);
        shift();
    }

    if (isASCIIDigit(m_current))
        return false;

    returnValue = parseIntOverflow(m_buffer8.data(), m_buffer8.size(), 8);
    return true;
}

N进制整数溢出解析

接下来就是标准的整数解析过程了,2进制、8进制、10进制和16进制的整数,都可以通过此函数进行解析,radix代表进制

double parseIntOverflow(const LChar* s, unsigned length, int radix)
{
    double number = 0.0;
    double radixMultiplier = 1.0;

    for (const LChar* p = s + length - 1; p >= s; p--) {
        if (radixMultiplier == std::numeric_limits<double>::infinity()) {
            if (*p != '0') {
                number = std::numeric_limits<double>::infinity();
                break;
            }
        } else {
            int digit = parseDigit(*p, radix);
            number += digit * radixMultiplier;
        }

        radixMultiplier *= radix;
    }

    return number;
}

小数点和指数解析

用DFA转移图来表示下浮点数的解析:

template <typename T>
ALWAYS_INLINE void Lexer<T>::parseNumberAfterDecimalPoint()
{
    // record8:Vector<LChar> m_buffer8,将当前字符添加到m_buffer8中
    record8('.');

    // 小数点之后,只要是数字,就添加到m_buffer8中
    while (isASCIIDigit(m_current)) {
        record8(m_current);
        shift();
    }
}

template <typename T>
ALWAYS_INLINE bool Lexer<T>::parseNumberAfterExponentIndicator()
{
    record8('e');
    shift();

    // 确定指数部分的符号
    if (m_current == '+' || m_current == '-') {
        record8(m_current);
        shift();
    }

    // 如果是非数字,解析异常
    if (!isASCIIDigit(m_current))
        return false;

    // 只要是数字,就添加到m_buffer8中
    do {
        record8(m_current);
        shift();
    } while (isASCIIDigit(m_current));

    return true;
}

多行注释解析

多行注释的DFA转移图:

template <typename T>
ALWAYS_INLINE bool Lexer<T>::parseMultilineComment()
{
    while (true) 
    {
        // 寻找离开始的/*最近的*/(*和/必须紧挨着)
        while (UNLIKELY(m_current == '*')) {
            shift();
            if (m_current == '/') {
                shift();
                return true;
            }
        }

        // 偏移结束位置,异常
        if (atEnd()) return false;

        if (isLineTerminator(m_current)) {
            // 行末,跳过
            shiftLineTerminator();
            m_terminator = true;
        } else{
            // 直接跳过当前字符
            shift();
        }
    }
}

示例(更新)

例如有这么一段js代码,包含一个函数、一个变量以及函数的调用

function test(age){
    if(age > 10){
        console.log(age);
    }
}

var age = 6 * 7;
test(age);

在经过Esprima词法分析后,会得到下面的结果:

[
    {
        "type": "Keyword",
        "value": "function"
    },
    {
        "type": "Identifier",
        "value": "test"
    },
    {
        "type": "Punctuator",
        "value": "("
    },
    {
        "type": "Identifier",
        "value": "age"
    },
    {
        "type": "Punctuator",
        "value": ")"
    },
    {
        "type": "Punctuator",
        "value": "{"
    },
    {
        "type": "Keyword",
        "value": "if"
    },
    {
        "type": "Punctuator",
        "value": "("
    },
    {
        "type": "Identifier",
        "value": "age"
    },
    {
        "type": "Punctuator",
        "value": ">"
    },
    {
        "type": "Numeric",
        "value": "10"
    },
    {
        "type": "Punctuator",
        "value": ")"
    },
    {
        "type": "Punctuator",
        "value": "{"
    },
    {
        "type": "Identifier",
        "value": "console"
    },
    {
        "type": "Punctuator",
        "value": "."
    },
    {
        "type": "Identifier",
        "value": "log"
    },
    {
        "type": "Punctuator",
        "value": "("
    },
    {
        "type": "Identifier",
        "value": "age"
    },
    {
        "type": "Punctuator",
        "value": ")"
    },
    {
        "type": "Punctuator",
        "value": ";"
    },
    {
        "type": "Punctuator",
        "value": "}"
    },
    {
        "type": "Punctuator",
        "value": "}"
    },
    {
        "type": "Keyword",
        "value": "var"
    },
    {
        "type": "Identifier",
        "value": "age"
    },
    {
        "type": "Punctuator",
        "value": "="
    },
    {
        "type": "Numeric",
        "value": "6"
    },
    {
        "type": "Punctuator",
        "value": "*"
    },
    {
        "type": "Numeric",
        "value": "7"
    },
    {
        "type": "Punctuator",
        "value": ";"
    },
    {
        "type": "Identifier",
        "value": "test"
    },
    {
        "type": "Punctuator",
        "value": "("
    },
    {
        "type": "Identifier",
        "value": "age"
    },
    {
        "type": "Punctuator",
        "value": ")"
    },
    {
        "type": "Punctuator",
        "value": ";"
    }
]

总结

经过此过程,一个完整的JSC世界的Token就生成了,整个世界清静了…

-------------本文结束 感谢您的阅读-------------

本文标题:JavaScriptCore引擎深度解析3—词法分析篇

文章作者:lingyun

发布时间:2018年07月30日 - 00:07

最后更新:2018年11月07日 - 00:11

原始链接:https://tsuijunxi.github.io/2018/07/30/JavaScriptCore引擎深度解析-3-词法分析篇/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。