libaditof: tokenizer.cc Source File

Go to the documentation of this file.
 // Protocol Buffers - Google's data interchange format
 // Copyright 2008 Google Inc.  All rights reserved.
 // https://developers.google.com/protocol-buffers/
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions are
 // met:
 //
 //     * Redistributions of source code must retain the above copyright
 // notice, this list of conditions and the following disclaimer.
 //     * Redistributions in binary form must reproduce the above
 // copyright notice, this list of conditions and the following disclaimer
 // in the documentation and/or other materials provided with the
 // distribution.
 //     * Neither the name of Google Inc. nor the names of its
 // contributors may be used to endorse or promote products derived from
 // this software without specific prior written permission.
 //
 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  
 // Author: kenton@google.com (Kenton Varda)
 //  Based on original Protocol Buffers design by
 //  Sanjay Ghemawat, Jeff Dean, and others.
 //
 // Here we have a hand-written lexer.  At first you might ask yourself,
 // "Hand-written text processing?  Is Kenton crazy?!"  Well, first of all,
 // yes I am crazy, but that's beside the point.  There are actually reasons
 // why I ended up writing this this way.
 //
 // The traditional approach to lexing is to use lex to generate a lexer for
 // you.  Unfortunately, lex's output is ridiculously ugly and difficult to
 // integrate cleanly with C++ code, especially abstract code or code meant
 // as a library.  Better parser-generators exist but would add dependencies
 // which most users won't already have, which we'd like to avoid.  (GNU flex
 // has a C++ output option, but it's still ridiculously ugly, non-abstract,
 // and not library-friendly.)
 //
 // The next approach that any good software engineer should look at is to
 // use regular expressions.  And, indeed, I did.  I have code which
 // implements this same class using regular expressions.  It's about 200
 // lines shorter.  However:
 // - Rather than error messages telling you "This string has an invalid
 //   escape sequence at line 5, column 45", you get error messages like
 //   "Parse error on line 5".  Giving more precise errors requires adding
 //   a lot of code that ends up basically as complex as the hand-coded
 //   version anyway.
 // - The regular expression to match a string literal looks like this:
 //     kString  = new RE("(\"([^\"\\\\]|"              // non-escaped
 //                       "\\\\[abfnrtv?\"'\\\\0-7]|"   // normal escape
 //                       "\\\\x[0-9a-fA-F])*\"|"       // hex escape
 //                       "\'([^\'\\\\]|"        // Also support single-quotes.
 //                       "\\\\[abfnrtv?\"'\\\\0-7]|"
 //                       "\\\\x[0-9a-fA-F])*\')");
 //   Verifying the correctness of this line noise is actually harder than
 //   verifying the correctness of ConsumeString(), defined below.  I'm not
 //   even confident that the above is correct, after staring at it for some
 //   time.
 // - PCRE is fast, but there's still more overhead involved than the code
 //   below.
 // - Sadly, regular expressions are not part of the C standard library, so
 //   using them would require depending on some other library.  For the
 //   open source release, this could be really annoying.  Nobody likes
 //   downloading one piece of software just to find that they need to
 //   download something else to make it work, and in all likelihood
 //   people downloading Protocol Buffers will already be doing so just
 //   to make something else work.  We could include a copy of PCRE with
 //   our code, but that obligates us to keep it up-to-date and just seems
 //   like a big waste just to save 200 lines of code.
 //
 // On a similar but unrelated note, I'm even scared to use ctype.h.
 // Apparently functions like isalpha() are locale-dependent.  So, if we used
 // that, then if this code is being called from some program that doesn't
 // have its locale set to "C", it would behave strangely.  We can't just set
 // the locale to "C" ourselves since we might break the calling program that
 // way, particularly if it is multi-threaded.  WTF?  Someone please let me
 // (Kenton) know if I'm missing something here...
 //
 // I'd love to hear about other alternatives, though, as this code isn't
 // exactly pretty.
  
 #include <google/protobuf/io/tokenizer.h>
 #include <google/protobuf/stubs/common.h>
 #include <google/protobuf/stubs/logging.h>
 #include <google/protobuf/stubs/stringprintf.h>
 #include <google/protobuf/io/strtod.h>
 #include <google/protobuf/io/zero_copy_stream.h>
 #include <google/protobuf/stubs/strutil.h>
 #include <google/protobuf/stubs/stl_util.h>
  
 namespace google {
 namespace protobuf {
 namespace io {
 namespace {
  
 // As mentioned above, I don't trust ctype.h due to the presence of "locales".
 // So, I have written replacement functions here.  Someone please smack me if
 // this is a bad idea or if there is some way around this.
 //
 // These "character classes" are designed to be used in template methods.
 // For instance, Tokenizer::ConsumeZeroOrMore<Whitespace>() will eat
 // whitespace.
  
 // Note:  No class is allowed to contain '\0', since this is used to mark end-
 //   of-input and is handled specially.
  
 #define CHARACTER_CLASS(NAME, EXPRESSION)                     \
   class NAME {                                                \
    public:                                                    \
     static inline bool InClass(char c) { return EXPRESSION; } \
   }
  
 CHARACTER_CLASS(Whitespace, c == ' ' || c == '\n' || c == '\t' || c == '\r' ||
                                 c == '\v' || c == '\f');
 CHARACTER_CLASS(WhitespaceNoNewline,
                 c == ' ' || c == '\t' || c == '\r' || c == '\v' || c == '\f');
  
 CHARACTER_CLASS(Unprintable, c<' ' && c> '\0');
  
 CHARACTER_CLASS(Digit, '0' <= c && c <= '9');
 CHARACTER_CLASS(OctalDigit, '0' <= c && c <= '7');
 CHARACTER_CLASS(HexDigit, ('0' <= c && c <= '9') || ('a' <= c && c <= 'f') ||
                               ('A' <= c && c <= 'F'));
  
 CHARACTER_CLASS(Letter,
                 ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || (c == '_'));
  
 CHARACTER_CLASS(Alphanumeric, ('a' <= c && c <= 'z') ||
                                   ('A' <= c && c <= 'Z') ||
                                   ('0' <= c && c <= '9') || (c == '_'));
  
 CHARACTER_CLASS(Escape, c == 'a' || c == 'b' || c == 'f' || c == 'n' ||
                             c == 'r' || c == 't' || c == 'v' || c == '\\' ||
                             c == '?' || c == '\'' || c == '\"');
  
 #undef CHARACTER_CLASS
  
 // Given a char, interpret it as a numeric digit and return its value.
 // This supports any number base up to 36.
 inline int DigitValue(char digit) {
   if ('0' <= digit && digit <= '9') return digit - '0';
   if ('a' <= digit && digit <= 'z') return digit - 'a' + 10;
   if ('A' <= digit && digit <= 'Z') return digit - 'A' + 10;
   return -1;
 }
  
 // Inline because it's only used in one place.
 inline char TranslateEscape(char c) {
   switch (c) {
     case 'a':
       return '\a';
     case 'b':
       return '\b';
     case 'f':
       return '\f';
     case 'n':
       return '\n';
     case 'r':
       return '\r';
     case 't':
       return '\t';
     case 'v':
       return '\v';
     case '\\':
       return '\\';
     case '?':
       return '\?';  // Trigraphs = :(
     case '\'':
       return '\'';
     case '"':
       return '\"';
  
     // We expect escape sequences to have been validated separately.
     default:
       return '?';
   }
 }
  
 }  // anonymous namespace
  
 ErrorCollector::~ErrorCollector() {}
  
 // ===================================================================
  
 Tokenizer::Tokenizer(ZeroCopyInputStream* input,
                      ErrorCollector* error_collector)
     : input_(input),
       error_collector_(error_collector),
       buffer_(NULL),
       buffer_size_(0),
       buffer_pos_(0),
       read_error_(false),
       line_(0),
       column_(0),
       record_target_(NULL),
       record_start_(-1),
       allow_f_after_float_(false),
       comment_style_(CPP_COMMENT_STYLE),
       require_space_after_number_(true),
       allow_multiline_strings_(false) {
   current_.line = 0;
   current_.column = 0;
   current_.end_column = 0;
   current_.type = TYPE_START;
  
   Refresh();
 }
  
 Tokenizer::~Tokenizer() {
   // If we had any buffer left unread, return it to the underlying stream
   // so that someone else can read it.
   if (buffer_size_ > buffer_pos_) {
     input_->BackUp(buffer_size_ - buffer_pos_);
   }
 }
  
 // -------------------------------------------------------------------
 // Internal helpers.
  
 void Tokenizer::NextChar() {
   // Update our line and column counters based on the character being
   // consumed.
   if (current_char_ == '\n') {
     ++line_;
     column_ = 0;
   } else if (current_char_ == '\t') {
     column_ += kTabWidth - column_ % kTabWidth;
   } else {
     ++column_;
   }
  
   // Advance to the next character.
   ++buffer_pos_;
   if (buffer_pos_ < buffer_size_) {
     current_char_ = buffer_[buffer_pos_];
   } else {
     Refresh();
   }
 }
  
 void Tokenizer::Refresh() {
   if (read_error_) {
     current_char_ = '\0';
     return;
   }
  
   // If we're in a token, append the rest of the buffer to it.
   if (record_target_ != NULL && record_start_ < buffer_size_) {
     record_target_->append(buffer_ + record_start_,
                            buffer_size_ - record_start_);
     record_start_ = 0;
   }
  
   const void* data = NULL;
   buffer_ = NULL;
   buffer_pos_ = 0;
   do {
     if (!input_->Next(&data, &buffer_size_)) {
       // end of stream (or read error)
       buffer_size_ = 0;
       read_error_ = true;
       current_char_ = '\0';
       return;
     }
   } while (buffer_size_ == 0);
  
   buffer_ = static_cast<const char*>(data);
  
   current_char_ = buffer_[0];
 }
  
 inline void Tokenizer::RecordTo(std::string* target) {
   record_target_ = target;
   record_start_ = buffer_pos_;
 }
  
 inline void Tokenizer::StopRecording() {
   // Note:  The if() is necessary because some STL implementations crash when
   //   you call string::append(NULL, 0), presumably because they are trying to
   //   be helpful by detecting the NULL pointer, even though there's nothing
   //   wrong with reading zero bytes from NULL.
   if (buffer_pos_ != record_start_) {
     record_target_->append(buffer_ + record_start_,
                            buffer_pos_ - record_start_);
   }
   record_target_ = NULL;
   record_start_ = -1;
 }
  
 inline void Tokenizer::StartToken() {
   current_.type = TYPE_START;  // Just for the sake of initializing it.
   current_.text.clear();
   current_.line = line_;
   current_.column = column_;
   RecordTo(&current_.text);
 }
  
 inline void Tokenizer::EndToken() {
   StopRecording();
   current_.end_column = column_;
 }
  
 // -------------------------------------------------------------------
 // Helper methods that consume characters.
  
 template <typename CharacterClass>
 inline bool Tokenizer::LookingAt() {
   return CharacterClass::InClass(current_char_);
 }
  
 template <typename CharacterClass>
 inline bool Tokenizer::TryConsumeOne() {
   if (CharacterClass::InClass(current_char_)) {
     NextChar();
     return true;
   } else {
     return false;
   }
 }
  
 inline bool Tokenizer::TryConsume(char c) {
   if (current_char_ == c) {
     NextChar();
     return true;
   } else {
     return false;
   }
 }
  
 template <typename CharacterClass>
 inline void Tokenizer::ConsumeZeroOrMore() {
   while (CharacterClass::InClass(current_char_)) {
     NextChar();
   }
 }
  
 template <typename CharacterClass>
 inline void Tokenizer::ConsumeOneOrMore(const char* error) {
   if (!CharacterClass::InClass(current_char_)) {
     AddError(error);
   } else {
     do {
       NextChar();
     } while (CharacterClass::InClass(current_char_));
   }
 }
  
 // -------------------------------------------------------------------
 // Methods that read whole patterns matching certain kinds of tokens
 // or comments.
  
 void Tokenizer::ConsumeString(char delimiter) {
   while (true) {
     switch (current_char_) {
       case '\0':
         AddError("Unexpected end of string.");
         return;
  
       case '\n': {
         if (!allow_multiline_strings_) {
           AddError("String literals cannot cross line boundaries.");
           return;
         }
         NextChar();
         break;
       }
  
       case '\\': {
         // An escape sequence.
         NextChar();
         if (TryConsumeOne<Escape>()) {
           // Valid escape sequence.
         } else if (TryConsumeOne<OctalDigit>()) {
           // Possibly followed by two more octal digits, but these will
           // just be consumed by the main loop anyway so we don't need
           // to do so explicitly here.
         } else if (TryConsume('x')) {
           if (!TryConsumeOne<HexDigit>()) {
             AddError("Expected hex digits for escape sequence.");
           }
           // Possibly followed by another hex digit, but again we don't care.
         } else if (TryConsume('u')) {
           if (!TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
               !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>()) {
             AddError("Expected four hex digits for \\u escape sequence.");
           }
         } else if (TryConsume('U')) {
           // We expect 8 hex digits; but only the range up to 0x10ffff is
           // legal.
           if (!TryConsume('0') || !TryConsume('0') ||
               !(TryConsume('0') || TryConsume('1')) ||
               !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
               !TryConsumeOne<HexDigit>() || !TryConsumeOne<HexDigit>() ||
               !TryConsumeOne<HexDigit>()) {
             AddError(
                 "Expected eight hex digits up to 10ffff for \\U escape "
                 "sequence");
           }
         } else {
           AddError("Invalid escape sequence in string literal.");
         }
         break;
       }
  
       default: {
         if (current_char_ == delimiter) {
           NextChar();
           return;
         }
         NextChar();
         break;
       }
     }
   }
 }
  
 Tokenizer::TokenType Tokenizer::ConsumeNumber(bool started_with_zero,
                                               bool started_with_dot) {
   bool is_float = false;
  
   if (started_with_zero && (TryConsume('x') || TryConsume('X'))) {
     // A hex number (started with "0x").
     ConsumeOneOrMore<HexDigit>("\"0x\" must be followed by hex digits.");
  
   } else if (started_with_zero && LookingAt<Digit>()) {
     // An octal number (had a leading zero).
     ConsumeZeroOrMore<OctalDigit>();
     if (LookingAt<Digit>()) {
       AddError("Numbers starting with leading zero must be in octal.");
       ConsumeZeroOrMore<Digit>();
     }
  
   } else {
     // A decimal number.
     if (started_with_dot) {
       is_float = true;
       ConsumeZeroOrMore<Digit>();
     } else {
       ConsumeZeroOrMore<Digit>();
  
       if (TryConsume('.')) {
         is_float = true;
         ConsumeZeroOrMore<Digit>();
       }
     }
  
     if (TryConsume('e') || TryConsume('E')) {
       is_float = true;
       TryConsume('-') || TryConsume('+');
       ConsumeOneOrMore<Digit>("\"e\" must be followed by exponent.");
     }
  
     if (allow_f_after_float_ && (TryConsume('f') || TryConsume('F'))) {
       is_float = true;
     }
   }
  
   if (LookingAt<Letter>() && require_space_after_number_) {
     AddError("Need space between number and identifier.");
   } else if (current_char_ == '.') {
     if (is_float) {
       AddError(
           "Already saw decimal point or exponent; can't have another one.");
     } else {
       AddError("Hex and octal numbers must be integers.");
     }
   }
  
   return is_float ? TYPE_FLOAT : TYPE_INTEGER;
 }
  
 void Tokenizer::ConsumeLineComment(std::string* content) {
   if (content != NULL) RecordTo(content);
  
   while (current_char_ != '\0' && current_char_ != '\n') {
     NextChar();
   }
   TryConsume('\n');
  
   if (content != NULL) StopRecording();
 }
  
 void Tokenizer::ConsumeBlockComment(std::string* content) {
   int start_line = line_;
   int start_column = column_ - 2;
  
   if (content != NULL) RecordTo(content);
  
   while (true) {
     while (current_char_ != '\0' && current_char_ != '*' &&
            current_char_ != '/' && current_char_ != '\n') {
       NextChar();
     }
  
     if (TryConsume('\n')) {
       if (content != NULL) StopRecording();
  
       // Consume leading whitespace and asterisk;
       ConsumeZeroOrMore<WhitespaceNoNewline>();
       if (TryConsume('*')) {
         if (TryConsume('/')) {
           // End of comment.
           break;
         }
       }
  
       if (content != NULL) RecordTo(content);
     } else if (TryConsume('*') && TryConsume('/')) {
       // End of comment.
       if (content != NULL) {
         StopRecording();
         // Strip trailing "*/".
         content->erase(content->size() - 2);
       }
       break;
     } else if (TryConsume('/') && current_char_ == '*') {
       // Note:  We didn't consume the '*' because if there is a '/' after it
       //   we want to interpret that as the end of the comment.
       AddError(
           "\"/*\" inside block comment.  Block comments cannot be nested.");
     } else if (current_char_ == '\0') {
       AddError("End-of-file inside block comment.");
       error_collector_->AddError(start_line, start_column,
                                  "  Comment started here.");
       if (content != NULL) StopRecording();
       break;
     }
   }
 }
  
 Tokenizer::NextCommentStatus Tokenizer::TryConsumeCommentStart() {
   if (comment_style_ == CPP_COMMENT_STYLE && TryConsume('/')) {
     if (TryConsume('/')) {
       return LINE_COMMENT;
     } else if (TryConsume('*')) {
       return BLOCK_COMMENT;
     } else {
       // Oops, it was just a slash.  Return it.
       current_.type = TYPE_SYMBOL;
       current_.text = "/";
       current_.line = line_;
       current_.column = column_ - 1;
       current_.end_column = column_;
       return SLASH_NOT_COMMENT;
     }
   } else if (comment_style_ == SH_COMMENT_STYLE && TryConsume('#')) {
     return LINE_COMMENT;
   } else {
     return NO_COMMENT;
   }
 }
  
 // -------------------------------------------------------------------
  
 bool Tokenizer::Next() {
   previous_ = current_;
  
   while (!read_error_) {
     ConsumeZeroOrMore<Whitespace>();
  
     switch (TryConsumeCommentStart()) {
       case LINE_COMMENT:
         ConsumeLineComment(NULL);
         continue;
       case BLOCK_COMMENT:
         ConsumeBlockComment(NULL);
         continue;
       case SLASH_NOT_COMMENT:
         return true;
       case NO_COMMENT:
         break;
     }
  
     // Check for EOF before continuing.
     if (read_error_) break;
  
     if (LookingAt<Unprintable>() || current_char_ == '\0') {
       AddError("Invalid control characters encountered in text.");
       NextChar();
       // Skip more unprintable characters, too.  But, remember that '\0' is
       // also what current_char_ is set to after EOF / read error.  We have
       // to be careful not to go into an infinite loop of trying to consume
       // it, so make sure to check read_error_ explicitly before consuming
       // '\0'.
       while (TryConsumeOne<Unprintable>() ||
              (!read_error_ && TryConsume('\0'))) {
         // Ignore.
       }
  
     } else {
       // Reading some sort of token.
       StartToken();
  
       if (TryConsumeOne<Letter>()) {
         ConsumeZeroOrMore<Alphanumeric>();
         current_.type = TYPE_IDENTIFIER;
       } else if (TryConsume('0')) {
         current_.type = ConsumeNumber(true, false);
       } else if (TryConsume('.')) {
         // This could be the beginning of a floating-point number, or it could
         // just be a '.' symbol.
  
         if (TryConsumeOne<Digit>()) {
           // It's a floating-point number.
           if (previous_.type == TYPE_IDENTIFIER &&
               current_.line == previous_.line &&
               current_.column == previous_.end_column) {
             // We don't accept syntax like "blah.123".
             error_collector_->AddError(
                 line_, column_ - 2,
                 "Need space between identifier and decimal point.");
           }
           current_.type = ConsumeNumber(false, true);
         } else {
           current_.type = TYPE_SYMBOL;
         }
       } else if (TryConsumeOne<Digit>()) {
         current_.type = ConsumeNumber(false, false);
       } else if (TryConsume('\"')) {
         ConsumeString('\"');
         current_.type = TYPE_STRING;
       } else if (TryConsume('\'')) {
         ConsumeString('\'');
         current_.type = TYPE_STRING;
       } else {
         // Check if the high order bit is set.
         if (current_char_ & 0x80) {
           error_collector_->AddError(
               line_, column_,
               StringPrintf("Interpreting non ascii codepoint %d.",
                            static_cast<unsigned char>(current_char_)));
         }
         NextChar();
         current_.type = TYPE_SYMBOL;
       }
  
       EndToken();
       return true;
     }
   }
  
   // EOF
   current_.type = TYPE_END;
   current_.text.clear();
   current_.line = line_;
   current_.column = column_;
   current_.end_column = column_;
   return false;
 }
  
 namespace {
  
 // Helper class for collecting comments and putting them in the right places.
 //
 // This basically just buffers the most recent comment until it can be decided
 // exactly where that comment should be placed.  When Flush() is called, the
 // current comment goes into either prev_trailing_comments or detached_comments.
 // When the CommentCollector is destroyed, the last buffered comment goes into
 // next_leading_comments.
 class CommentCollector {
  public:
   CommentCollector(std::string* prev_trailing_comments,
                    std::vector<std::string>* detached_comments,
                    std::string* next_leading_comments)
       : prev_trailing_comments_(prev_trailing_comments),
         detached_comments_(detached_comments),
         next_leading_comments_(next_leading_comments),
         has_comment_(false),
         is_line_comment_(false),
         can_attach_to_prev_(true) {
     if (prev_trailing_comments != NULL) prev_trailing_comments->clear();
     if (detached_comments != NULL) detached_comments->clear();
     if (next_leading_comments != NULL) next_leading_comments->clear();
   }
  
   ~CommentCollector() {
     // Whatever is in the buffer is a leading comment.
     if (next_leading_comments_ != NULL && has_comment_) {
       comment_buffer_.swap(*next_leading_comments_);
     }
   }
  
   // About to read a line comment.  Get the comment buffer pointer in order to
   // read into it.
   std::string* GetBufferForLineComment() {
     // We want to combine with previous line comments, but not block comments.
     if (has_comment_ && !is_line_comment_) {
       Flush();
     }
     has_comment_ = true;
     is_line_comment_ = true;
     return &comment_buffer_;
   }
  
   // About to read a block comment.  Get the comment buffer pointer in order to
   // read into it.
   std::string* GetBufferForBlockComment() {
     if (has_comment_) {
       Flush();
     }
     has_comment_ = true;
     is_line_comment_ = false;
     return &comment_buffer_;
   }
  
   void ClearBuffer() {
     comment_buffer_.clear();
     has_comment_ = false;
   }
  
   // Called once we know that the comment buffer is complete and is *not*
   // connected to the next token.
   void Flush() {
     if (has_comment_) {
       if (can_attach_to_prev_) {
         if (prev_trailing_comments_ != NULL) {
           prev_trailing_comments_->append(comment_buffer_);
         }
         can_attach_to_prev_ = false;
       } else {
         if (detached_comments_ != NULL) {
           detached_comments_->push_back(comment_buffer_);
         }
       }
       ClearBuffer();
     }
   }
  
   void DetachFromPrev() { can_attach_to_prev_ = false; }
  
  private:
   std::string* prev_trailing_comments_;
   std::vector<std::string>* detached_comments_;
   std::string* next_leading_comments_;
  
   std::string comment_buffer_;
  
   // True if any comments were read into comment_buffer_.  This can be true even
   // if comment_buffer_ is empty, namely if the comment was "/**/".
   bool has_comment_;
  
   // Is the comment in the comment buffer a line comment?
   bool is_line_comment_;
  
   // Is it still possible that we could be reading a comment attached to the
   // previous token?
   bool can_attach_to_prev_;
 };
  
 }  // namespace
  
 bool Tokenizer::NextWithComments(std::string* prev_trailing_comments,
                                  std::vector<std::string>* detached_comments,
                                  std::string* next_leading_comments) {
   CommentCollector collector(prev_trailing_comments, detached_comments,
                              next_leading_comments);
  
   if (current_.type == TYPE_START) {
     // Ignore unicode byte order mark(BOM) if it appears at the file
     // beginning. Only UTF-8 BOM (0xEF 0xBB 0xBF) is accepted.
     if (TryConsume((char)0xEF)) {
       if (!TryConsume((char)0xBB) || !TryConsume((char)0xBF)) {
         AddError(
             "Proto file starts with 0xEF but not UTF-8 BOM. "
             "Only UTF-8 is accepted for proto file.");
         return false;
       }
     }
     collector.DetachFromPrev();
   } else {
     // A comment appearing on the same line must be attached to the previous
     // declaration.
     ConsumeZeroOrMore<WhitespaceNoNewline>();
     switch (TryConsumeCommentStart()) {
       case LINE_COMMENT:
         ConsumeLineComment(collector.GetBufferForLineComment());
  
         // Don't allow comments on subsequent lines to be attached to a trailing
         // comment.
         collector.Flush();
         break;
       case BLOCK_COMMENT:
         ConsumeBlockComment(collector.GetBufferForBlockComment());
  
         ConsumeZeroOrMore<WhitespaceNoNewline>();
         if (!TryConsume('\n')) {
           // Oops, the next token is on the same line.  If we recorded a comment
           // we really have no idea which token it should be attached to.
           collector.ClearBuffer();
           return Next();
         }
  
         // Don't allow comments on subsequent lines to be attached to a trailing
         // comment.
         collector.Flush();
         break;
       case SLASH_NOT_COMMENT:
         return true;
       case NO_COMMENT:
         if (!TryConsume('\n')) {
           // The next token is on the same line.  There are no comments.
           return Next();
         }
         break;
     }
   }
  
   // OK, we are now on the line *after* the previous token.
   while (true) {
     ConsumeZeroOrMore<WhitespaceNoNewline>();
  
     switch (TryConsumeCommentStart()) {
       case LINE_COMMENT:
         ConsumeLineComment(collector.GetBufferForLineComment());
         break;
       case BLOCK_COMMENT:
         ConsumeBlockComment(collector.GetBufferForBlockComment());
  
         // Consume the rest of the line so that we don't interpret it as a
         // blank line the next time around the loop.
         ConsumeZeroOrMore<WhitespaceNoNewline>();
         TryConsume('\n');
         break;
       case SLASH_NOT_COMMENT:
         return true;
       case NO_COMMENT:
         if (TryConsume('\n')) {
           // Completely blank line.
           collector.Flush();
           collector.DetachFromPrev();
         } else {
           bool result = Next();
           if (!result || current_.text == "}" || current_.text == "]" ||
               current_.text == ")") {
             // It looks like we're at the end of a scope.  In this case it
             // makes no sense to attach a comment to the following token.
             collector.Flush();
           }
           return result;
         }
         break;
     }
   }
 }
  
 // -------------------------------------------------------------------
 // Token-parsing helpers.  Remember that these don't need to report
 // errors since any errors should already have been reported while
 // tokenizing.  Also, these can assume that whatever text they
 // are given is text that the tokenizer actually parsed as a token
 // of the given type.
  
 bool Tokenizer::ParseInteger(const std::string& text, uint64 max_value,
                              uint64* output) {
   // Sadly, we can't just use strtoul() since it is only 32-bit and strtoull()
   // is non-standard.  I hate the C standard library.  :(
  
   //  return strtoull(text.c_str(), NULL, 0);
  
   const char* ptr = text.c_str();
   int base = 10;
   if (ptr[0] == '0') {
     if (ptr[1] == 'x' || ptr[1] == 'X') {
       // This is hex.
       base = 16;
       ptr += 2;
     } else {
       // This is octal.
       base = 8;
     }
   }
  
   uint64 result = 0;
   for (; *ptr != '\0'; ptr++) {
     int digit = DigitValue(*ptr);
     if (digit < 0 || digit >= base) {
       // The token provided by Tokenizer is invalid. i.e., 099 is an invalid
       // token, but Tokenizer still think it's integer.
       return false;
     }
     if (digit > max_value || result > (max_value - digit) / base) {
       // Overflow.
       return false;
     }
     result = result * base + digit;
   }
  
   *output = result;
   return true;
 }
  
 double Tokenizer::ParseFloat(const std::string& text) {
   const char* start = text.c_str();
   char* end;
   double result = NoLocaleStrtod(start, &end);
  
   // "1e" is not a valid float, but if the tokenizer reads it, it will
   // report an error but still return it as a valid token.  We need to
   // accept anything the tokenizer could possibly return, error or not.
   if (*end == 'e' || *end == 'E') {
     ++end;
     if (*end == '-' || *end == '+') ++end;
   }
  
   // If the Tokenizer had allow_f_after_float_ enabled, the float may be
   // suffixed with the letter 'f'.
   if (*end == 'f' || *end == 'F') {
     ++end;
   }
  
   GOOGLE_LOG_IF(DFATAL, end - start != text.size() || *start == '-')
       << " Tokenizer::ParseFloat() passed text that could not have been"
          " tokenized as a float: "
       << CEscape(text);
   return result;
 }
  
 // Helper to append a Unicode code point to a string as UTF8, without bringing
 // in any external dependencies.
 static void AppendUTF8(uint32 code_point, std::string* output) {
   uint32 tmp = 0;
   int len = 0;
   if (code_point <= 0x7f) {
     tmp = code_point;
     len = 1;
   } else if (code_point <= 0x07ff) {
     tmp = 0x0000c080 | ((code_point & 0x07c0) << 2) | (code_point & 0x003f);
     len = 2;
   } else if (code_point <= 0xffff) {
     tmp = 0x00e08080 | ((code_point & 0xf000) << 4) |
           ((code_point & 0x0fc0) << 2) | (code_point & 0x003f);
     len = 3;
   } else if (code_point <= 0x1fffff) {
     tmp = 0xf0808080 | ((code_point & 0x1c0000) << 6) |
           ((code_point & 0x03f000) << 4) | ((code_point & 0x000fc0) << 2) |
           (code_point & 0x003f);
     len = 4;
   } else {
     // UTF-16 is only defined for code points up to 0x10FFFF, and UTF-8 is
     // normally only defined up to there as well.
     StringAppendF(output, "\\U%08x", code_point);
     return;
   }
   tmp = ghtonl(tmp);
   output->append(reinterpret_cast<const char*>(&tmp) + sizeof(tmp) - len, len);
 }
  
 // Try to read <len> hex digits from ptr, and stuff the numeric result into
 // *result. Returns true if that many digits were successfully consumed.
 static bool ReadHexDigits(const char* ptr, int len, uint32* result) {
   *result = 0;
   if (len == 0) return false;
   for (const char* end = ptr + len; ptr < end; ++ptr) {
     if (*ptr == '\0') return false;
     *result = (*result << 4) + DigitValue(*ptr);
   }
   return true;
 }
  
 // Handling UTF-16 surrogate pairs. UTF-16 encodes code points in the range
 // 0x10000...0x10ffff as a pair of numbers, a head surrogate followed by a trail
 // surrogate. These numbers are in a reserved range of Unicode code points, so
 // if we encounter such a pair we know how to parse it and convert it into a
 // single code point.
 static const uint32 kMinHeadSurrogate = 0xd800;
 static const uint32 kMaxHeadSurrogate = 0xdc00;
 static const uint32 kMinTrailSurrogate = 0xdc00;
 static const uint32 kMaxTrailSurrogate = 0xe000;
  
 static inline bool IsHeadSurrogate(uint32 code_point) {
   return (code_point >= kMinHeadSurrogate) && (code_point < kMaxHeadSurrogate);
 }
  
 static inline bool IsTrailSurrogate(uint32 code_point) {
   return (code_point >= kMinTrailSurrogate) &&
          (code_point < kMaxTrailSurrogate);
 }
  
 // Combine a head and trail surrogate into a single Unicode code point.
 static uint32 AssembleUTF16(uint32 head_surrogate, uint32 trail_surrogate) {
   GOOGLE_DCHECK(IsHeadSurrogate(head_surrogate));
   GOOGLE_DCHECK(IsTrailSurrogate(trail_surrogate));
   return 0x10000 + (((head_surrogate - kMinHeadSurrogate) << 10) |
                     (trail_surrogate - kMinTrailSurrogate));
 }
  
 // Convert the escape sequence parameter to a number of expected hex digits.
 static inline int UnicodeLength(char key) {
   if (key == 'u') return 4;
   if (key == 'U') return 8;
   return 0;
 }
  
 // Given a pointer to the 'u' or 'U' starting a Unicode escape sequence, attempt
 // to parse that sequence. On success, returns a pointer to the first char
 // beyond that sequence, and fills in *code_point. On failure, returns ptr
 // itself.
 static const char* FetchUnicodePoint(const char* ptr, uint32* code_point) {
   const char* p = ptr;
   // Fetch the code point.
   const int len = UnicodeLength(*p++);
   if (!ReadHexDigits(p, len, code_point)) return ptr;
   p += len;
  
   // Check if the code point we read is a "head surrogate." If so, then we
   // expect it to be immediately followed by another code point which is a valid
   // "trail surrogate," and together they form a UTF-16 pair which decodes into
   // a single Unicode point. Trail surrogates may only use \u, not \U.
   if (IsHeadSurrogate(*code_point) && *p == '\\' && *(p + 1) == 'u') {
     uint32 trail_surrogate;
     if (ReadHexDigits(p + 2, 4, &trail_surrogate) &&
         IsTrailSurrogate(trail_surrogate)) {
       *code_point = AssembleUTF16(*code_point, trail_surrogate);
       p += 6;
     }
     // If this failed, then we just emit the head surrogate as a code point.
     // It's bogus, but so is the string.
   }
  
   return p;
 }
  
 // The text string must begin and end with single or double quote
 // characters.
 void Tokenizer::ParseStringAppend(const std::string& text,
                                   std::string* output) {
   // Reminder: text[0] is always a quote character.  (If text is
   // empty, it's invalid, so we'll just return).
   const size_t text_size = text.size();
   if (text_size == 0) {
     GOOGLE_LOG(DFATAL) << " Tokenizer::ParseStringAppend() passed text that could not"
                    " have been tokenized as a string: "
                 << CEscape(text);
     return;
   }
  
   // Reserve room for new string. The branch is necessary because if
   // there is already space available the reserve() call might
   // downsize the output.
   const size_t new_len = text_size + output->size();
   if (new_len > output->capacity()) {
     output->reserve(new_len);
   }
  
   // Loop through the string copying characters to "output" and
   // interpreting escape sequences.  Note that any invalid escape
   // sequences or other errors were already reported while tokenizing.
   // In this case we do not need to produce valid results.
   for (const char* ptr = text.c_str() + 1; *ptr != '\0'; ptr++) {
     if (*ptr == '\\' && ptr[1] != '\0') {
       // An escape sequence.
       ++ptr;
  
       if (OctalDigit::InClass(*ptr)) {
         // An octal escape.  May one, two, or three digits.
         int code = DigitValue(*ptr);
         if (OctalDigit::InClass(ptr[1])) {
           ++ptr;
           code = code * 8 + DigitValue(*ptr);
         }
         if (OctalDigit::InClass(ptr[1])) {
           ++ptr;
           code = code * 8 + DigitValue(*ptr);
         }
         output->push_back(static_cast<char>(code));
  
       } else if (*ptr == 'x') {
         // A hex escape.  May zero, one, or two digits.  (The zero case
         // will have been caught as an error earlier.)
         int code = 0;
         if (HexDigit::InClass(ptr[1])) {
           ++ptr;
           code = DigitValue(*ptr);
         }
         if (HexDigit::InClass(ptr[1])) {
           ++ptr;
           code = code * 16 + DigitValue(*ptr);
         }
         output->push_back(static_cast<char>(code));
  
       } else if (*ptr == 'u' || *ptr == 'U') {
         uint32 unicode;
         const char* end = FetchUnicodePoint(ptr, &unicode);
         if (end == ptr) {
           // Failure: Just dump out what we saw, don't try to parse it.
           output->push_back(*ptr);
         } else {
           AppendUTF8(unicode, output);
           ptr = end - 1;  // Because we're about to ++ptr.
         }
       } else {
         // Some other escape code.
         output->push_back(TranslateEscape(*ptr));
       }
  
     } else if (*ptr == text[0] && ptr[1] == '\0') {
       // Ignore final quote matching the starting quote.
     } else {
       output->push_back(*ptr);
     }
   }
 }
  
 template <typename CharacterClass>
 static bool AllInClass(const std::string& s) {
   for (int i = 0; i < s.size(); ++i) {
     if (!CharacterClass::InClass(s[i])) return false;
   }
   return true;
 }
  
 bool Tokenizer::IsIdentifier(const std::string& text) {
   // Mirrors IDENTIFIER definition in Tokenizer::Next() above.
   if (text.size() == 0) return false;
   if (!Letter::InClass(text.at(0))) return false;
   if (!AllInClass<Alphanumeric>(text.substr(1))) return false;
   return true;
 }
  
 }  // namespace io
 }  // namespace protobuf
 }  // namespace google