git.ozlabs.org Git - ccan/blob - ccan/ccan_tokenizer/todo

   1 Write test for empty_char_constant
   2
   3 defined cannot be used as a macro name
   4 <strike>Add "defined" and only accept it in appropriate circumstances</strike>
   5
   6 Update that simple tokenizer compulsory test so things will compile
   7
   8 Handle cases like escaped question marks and pound symbols that I don't understand yet.
   9
  10 (done) Fix #include <stdio.h> to read include directive correctly
  11
  12 txt/orig state of affairs:
  13
  14 The problem is that there are two ways to interpret line,col:
  15         With respect to txt
  16         With respect to orig
  17
  18 This isn't a problem when txt and orig point to the same character, as in:
  19
  20 int in\
  21 dex
  22 int \
  23 index /*Here, the backslash break should be gobbled up by the space identifier*/
  24
  25 line,col has no ambiguity as to where it should point.  However, when they point to different characters (i.e. at the beginning of a line):
  26
  27 \
  28 int index
  29
  30 line,col could either point to orig or to the first real character.  Thus, we will do the latter.
  31
  32 Moreover, will a newline followed by backslash breaks generate a token that gobbles up said breaks?  I believe it will, but no need to call this mandatory.
  33
  34 Thus, on a lookup with a txt pointer, the line/col/orig should match the real character and not preceding backslash breaks.
  35
  36
  37 I've been assuming that every token starts with its first character, neglecting the case where a line starts with backslash breaks.  The question is, given the txt pointer to the first character, where should the derived orig land?
  38
  39 Currently, the orig lands after the beginning backslash breaks, when instead it should probably land before them.
  40
  41 Here's what the tokenizer's text anchoring needs:
  42         Broken/unbroken text pointer -> line/col
  43         Unbroken contents per token to identify identifier text
  44         Original contents per token to rebuild the document
  45         Ability to change "original contents" so the document will be saved with modifications
  46         Ability to insert new tokens
  47
  48 Solution:
  49         New tokens will typically have identical txt and orig, yea even the same pointer.
  50         txt/txt_size for unbroken contents, orig/orig_size for original
  51         modify orig to change the document
  52         txt identifies identifier text
  53         Line lookup tables are used to resolve txt/orig pointers; other pointers can't be resolved in the same fashion and may require traversing backward through the list.
  54
  55 What this means:
  56         Token txt/txt_size, orig/orig_size, orig_lines, txt_lines, and tok_point_lookup are all still correct.
  57         Token line,col will be removed
  58
  59 Other improvements to do:
  60         Sanity check the point lookups like crazy
  61         Remove the array() structures in token_list, as these are supposed to be read-only
  62
  63 Make sure tok_point_lookup returns correct values for every single pointer possible, particularly those in orig that are on backslash-breaks
  64
  65 Convert the tok_message_queue into an array of messages bound to tokens.
  66
  67 Ask Rusty about the trailing newline in this case:
  68
  69 /* Blah
  70  *
  71  * blah
  72  */
  73
  74 Here, rather than the trailing space being blank, it is "blank" from the comment perspective.
  75 May require deeper analysis.
  76
  77 Todos from ccan_tokenizer.h
  78 /*
  79 Assumption:  Every token fits in one and exactly one line
  80 Counterexamples:
  81         Backslash-broken lines
  82         Multiline comments
  83
  84 Checks to implement in the tokenizer:
  85
  86 is the $ character used in an identifier (some configurations of GCC allow this)
  87 are there potentially ambiguous sequences used in a string literal (e.g. "\0000")
  88 Are there stray characters?  (e.g. '\0', '@', '\b')
  89 Are there trailing spaces at the end of lines (unless said spaces consume the entire line)?
  90         Are there trailing spaces after a backslash-broken line?
  91
  92
  93 Fixes todo:
  94
  95 backslash-newline sequence should register as an empty character, and the tokenizer's line value should be incremented accordingly.
  96 */
  97
  98 Lex angle bracket strings in #include
  99
 100 Check the rules in the documentation
 101
 102 Examine the message queue as part of testing the tokenizer:
 103         Make sure there are no bug messages
 104         Make sure files compile with no warnings
 105 For the tokenizer sanity check, make sure integers and floats have valid suffixes respectively
 106         (e.g. no TOK_F for an integer, no TOK_ULL for a floating)
 107
 108 Update the scan_number sanity checks
 109 (done) Move scan_number et al. to a separate C file
 110
 111 Test:
 112         Overflow and underflow floats
 113         0x.p0
 114         (done) 0755f //octal 0755 with invalid suffix
 115         (done) 0755e1 //floating 7550
 116
 117 Figure out how keywords will be handled.
 118         Preprocessor directives are <strike>case-insensitive</strike> actually case-sensitive (except __VA_ARGS__)
 119         All C keywords are case sensitive
 120         __VA_ARGS__ should be read as an identifier unless it's in the expansion of a macro.  Otherwise, GCC generates a warning.
 121                 We are in the expansion of a macro after <startline> <space> # <space>
 122         Don't forget about __attribute__
 123         Except for __VA_ARGS__, all preprocessor keywords are proceeded by <startline> <space> # <space>
 124
 125 Solution:
 126         All the words themselves will go into one opkw dictionary, and for both type and opkw, no distinction will be made between preprocessor and normal keywords.
 127         Instead, int type will become short type; unsigned short cpp:1;
 128
 129 Merge
 130 Commit ccan_tokenizer to the ccan repo
 131 Introduce ccan_tokenizer to ccanlint
 132
 133 Write testcases for scanning all available operators
 134 Support integer and floating point suffices (e.g. 500UL, 0.5f)
 135 Examine the message queue after tokenizing
 136 Make sure single-character operators have an opkw < 128
 137 Make sure c_dictionary has no duplicate entries
 138 Write verifiers for other types than TOK_WHITE
 139
 140 What's been done:
 141
 142 Operator table has been organized
 143 Merged Rusty's changes
 144 Fixed if -> while in finalize
 145 Fixed a couple mistakes in run-simple-token.c testcases themselves
 146         Expected orig/orig_size sizes weren't right
 147 Made token_list_sanity_check a public function and used it throughout run-simple-token.c
 148 Tests succeed and pass valgrind
 149
 150 Lines/columns of every token are recorded
 151
 152 (done) Fix "0\nstatic"
 153 (done) Write tests to make sure backslash-broken lines have correct token locations.
 154 (done) Correctly handle backslash-broken lines
 155         One plan:  Separate the scanning code from the reading code.  Scanning sends valid ranges to reading, and reading fills valid tokens for the tokenizer/scanner to properly add
 156         Another plan:  Un-break backslash-broken lines into another copy of the input.  Create an array of the positions of each real line break so
 157 Annotate message queue messages with current token
 158
 159 Conversion to make:
 160         From:
 161                 Position in unbroken text
 162         To:
 163                 Real line number
 164                 Real offset from start of line
 165
 166 Thus, we want an array of real line start locations wrt the unbroken text
 167
 168 Here is a bro\
 169 ken line.  Here is a
 170 real line.
 171
 172 <LINE>Here is a bro<LINE>ken line.  Here is a
 173 <LINE>real line.
 174
 175 If we know the position of the token text wrt the unbroken text, we can look up the real line number and offset using only the array of real line start positions within the unbroken text.
 176
 177 Because all we need is the orig and orig_size with respect to the unbroken text to orient