compiler construction - Trying to understand the Lex syntax for Standard ML (ml-lex) -
i'm writing compiler. i'm @ first phase, trying tokenize everything. wrote up, error. i've read docs (smlnj) 3 or 4 times, , errors not informative.
i think must messing state change aspect of program, works fine things create tokens, when change state using yybegin, blows up.
here lex file:
type pos = int; type lexresult = tokens.token; val linenum = errormsg.linenum; val linepos = errormsg.linepos; val commentdepth = ref 0; fun inccom(cmdepth) = cmdepth := !cmdepth + 1; fun deccom(cmdepth) = cmdepth := !cmdepth - 1; fun err(p1,p2) = errormsg.error p1; fun eof() = let val pos = hd(!linepos) in tokens.eof(pos,pos) end; %% digits=[0-9]+; %s comment string; %% <initial,comment>\n => (linenum := !linenum+1; linepos := yypos :: !linepos; continue()); <initial>"type" => (tokens.type(yypos, yypos+4)); <initial>"var" => (tokens.var(yypos,yypos+3)); <initial>"function" => (tokens.function(yypos, yypos+8)); <initial>"break" => (tokens.break(yypos, yypos+5)); <initial>"of" => (tokens.of(yypos, yypos+2)); <initial>"end" => (tokens.end(yypos, yypos+3)); <initial>"in" => (tokens.in(yypos, yypos+2)); <initial>"nil" => (tokens.nil(yypos, yypos+3)); <initial>"let" => (tokens.let(yypos, yypos+3)); <initial>"do" => (tokens.do(yypos, yypos+2)); <initial>"to" => (tokens.to(yypos, yypos+2)); <initial>"for" => (tokens.for(yypos, yypos+3)); <initial>"while" => (tokens.while(yypos, yypos+5)); <initial>"else" => (tokens.else(yypos, yypos+4)); <initial>"then" => (tokens.then(yypos, yypos+4)); <initial>"if" => (tokens.if(yypos, yypos+2)); <initial>"array" => (tokens.array(yypos, yypos+5)); <initial>":=" => (tokens.assign(yypos, yypos+2)); <initial>"|" => (tokens.or(yypos, yypos+1)); <initial>"&" => (tokens.and(yypos, yypos+1)); <initial>">=" => (tokens.ge(yypos, yypos+2)); <initial>">" => (tokens.gt(yypos, yypos+1)); <initial>"<=" => (tokens.le(yypos, yypos+2)); <initial>"<" => (tokens.lt(yypos, yypos+1)); <initial>"<>" => (tokens.neq(yypos, yypos+2)); <initial>"=" => (tokens.eq(yypos, yypos+1)); <initial>"/" => (tokens.divide(yypos, yypos+1)); <initial>"*" => (tokens.times(yypos, yypos+1)); <initial>"-" => (tokens.minus(yypos, yypos+1)); <initial>"+" => (tokens.plus(yypos, yypos+1)); <initial>"." => (tokens.dot(yypos, yypos+1)); <initial>"}" => (tokens.rbrace(yypos, yypos+1)); <initial>"{" => (tokens.lbrace(yypos, yypos+1)); <initial>"]" => (tokens.rbrack(yypos, yypos+1)); <initial>"[" => (tokens.lbrack(yypos, yypos+1)); <initial>")" => (tokens.rparen(yypos, yypos+1)); <initial>"(" => (tokens.lparen(yypos, yypos+1)); <initial>";" => (tokens.semicolon(yypos, yypos+1)); <initial>":" => (tokens.colon(yypos, yypos+1)); <initial>"," => (tokens.comma(yypos,yypos+1)); <initial>{digits} => (tokens.int(valof(int.fromstring(yytext)), yypos, yypos + (size yytext))); <initial>[a-z][a-z0-9_]* => (tokens.id(yytext, yypos, yypos + (size yytext))); <initial>(").*(") => (tokens.string(yytext, yypos, yypos + (size yytext))); <initial>"\"" => (yybegin string; continue()); <string>"\"" => (yybegin initial; continue()); <initial>"/*" => (inccom commentdepth; yybegin comment; continue()); <comment>"/*" => (inccom commentdepth; continue()); <comment>"*/" => (print "other trace!\n"; deccom commentdepth; if !commentdepth <= 0 yybegin initial else (); continue()); <initial,comment>[\ \t]+ => (print "trace 22222\n"; continue()); <initial>. => (errormsg.error yypos ("illegal character " ^ yytext); continue());
and here source file i'm tokenizing:
var , 123 /* comment */ 234 "d"
it doesn't comments , doesn't strings. help.
edit: here updated lex file. have pinpointed breaks. detects start of new comment fine, switches comment state fine, detects space after comment fine, breaks, never gets point eats int.
comments terminated */
, not *\
. (<comment>"*\\" =>
). , surely need <comment>.
rule deal comment itself.
i don't see lexical rule state <string>
; if there isn't one, problem strings. otherwise, it's rules, think.
edit based on edited question (not best use of so, imho):
i'm not expert in sml lexing, seems me need rule deal contents of comments , strings (as said above in first paragraph). in other words, there no rule apply in state <comment>
or state <string>
when character other terminating sequence encountered (or, in case of comments, whitespace.)
Comments
Post a Comment