-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathgrammar.txt
More file actions
368 lines (279 loc) · 13.6 KB
/
grammar.txt
File metadata and controls
368 lines (279 loc) · 13.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
About the grammar for micro-C * sestoft@itu.dk * 2009-09-29
-----------------------------------------------------------
Micro-C is a sublanguage of C. By design, a syntactically well-formed
micro-C program is also a syntactically well-formed C program, and the
intention is that the meaning of the program should be the same in
both languages (except that micro-C is untyped and permits assignment
to array variables, which C does not).
Many simplifications have been made compared to real C:
* datatypes: only int and char variables, arrays, and pointers
* no structs, unions, doubles, function pointers, ...
* no initializers in variable declarations
* functions can return only int, char, void
First attempt at a grammar
--------------------------
This first version of the grammar shows what we want the language to
include. However, it is ambiguous and will have to be rewritten for
use in a parser specification (see below). Some grammar rules are
already complex enough to deserve an explanation:
* The variable description vardesc reflects the variable declaration
syntax of C and C++, where the type specification surrounds the
variable; unlike in Standard ML, F#, Java and C#, the type cannot
be isolated syntactically from the variable name.
* The definition of a comma-separated list (of variables or argument
expressions) must be split into three cases:
empty list
one element (and no comma)
more than one element (separated by commas)
This is best achieved using two nonterminal symbols, as
illustrated by comma-separated argument expression lists:
exprs ::=
<empty>
expr1
expr1 ::=
expr
expr , expr1
The non-recursive nonterminal exprs distinguishes the empty case
from the non-empty case. The recursive nonterminal expr1
distinguishes the one-element case from the multi-element case.
* We distinguish between access expressions (those that have an
lvalue), and general expressions. The address operator (&) can be
applied only to access expressions. In an array element access
e1[e2], expression e1 must be an access expression.
main ::= program
topdecs EOF
topdecs ::= top-level declarations
<empty>
topdec topdecs
topdec ::= top-level declaration
vardec ; global variable declaration
fundec function declaration
vardec ::= variable or parameter declaration
typ vardesc
vardesc ::= variable description
NAME variable
* vardesc pointer to
( vardesc ) parenthesized variable description
vardesc [ ] array of
vardesc [ int ] array of given size
fundec ::= function declaration
void NAME ( paramdecs ) block
typ NAME ( paramdecs ) block
paramdecs ::= comma-separated parameter list
<empty>
paramdecs1
paramdecs1 ::= non-empty parameter declaration list
vardec
vardec , paramdecs1
stmt ::=
if (expr) stmt if-statement
if (expr) stmt else stmt if-else statement
while (expr) stmt while-loop
expr ; expression as statement
return ; return
return expr ; return
block block statement
block ::= block statement
{ stmtordecseq }
stmtordecseq ::= statements and declarations
<empty> empty sequence
stmt stmtordecseq statement
vardec ; stmtordecseq local variable declaration
expr ::=
access access
access = expr assignment
const constant literal
NAME ( exprs ) function call
( expr ) parenthesized expression
& access address of
! expr logical negation
print expr print integer expression
expr + expr plus
expr - expr minus
expr * expr times
expr / expr quotient
expr % expr remainder
expr == expr equal to
expr != expr not equal to
expr > expr greater than
expr < expr less than
expr >= expr greater than or equal to
expr <= expr less than or equal to
expr && expr sequential and
expr || expr sequential or
access ::=
NAME local or global variable
* expr pointer dereferencing
access [ expr ] array indexing
exprs ::=
<empty> empty list of expressions
exprs1 non-empty list of expressions
exprs1 ::=
expr list with one expression
expr , exprs1 list with more than one expr
const ::= constant literals
CSTINT integer literal
- CSTINT negative integer
NULL the NULL literal
typ ::=
int integer type
char character type
Second attempt at a grammar, more suitable for a parser specification
---------------------------------------------------------------------
Important changes:
* We split statements into two kinds: stmtm and stmtu. In the
former, there are no if-statements at top-level without a matching
else-branch. In the latter there maybe if-statements without a
matching else-branch. This `dangling else' problem is discussed in
Mogensen's book. The implication is that
if (e1) if (e2) s1 else s2
is parsed the same way as
if (e1) { if (e2) s1 else s2 }
not as
if (e1) { if (e2) s1 } else s2
* The grammar would still be ambiguous and cause shift/reduce
conflicts, if we did not use precedence declarations on the
operators that can appear in an expression:
right = /* lowest precedence */
nonassoc PRINT
left ||
left &&
left == !=
nonassoc > < >= <=
left + -
left * / %
nonassoc ! &
nonassoc [ /* highest precedence */
Most of these are quite obvious and can be taken straight from a
textbook on C or Java. For instance, the assignment operator (=)
must bind less strongly than all of the logical and arithmetic
operators, the logical connectives must bind less strongly than the
comparisons, logical or (||) binds less strongly than logical and
(&&), the <, >, <=, >= comparisons must be non-associative, lowest
precedence, and so on.
The high precedence given to the left bracket ([) is necessary to
avoid ambiguity and parse conflicts in expressions and variable
declarations. For expressions it implies that
* the parsing of &a[2] is &(a[2]), not (&a)[2]
* the parsing of *a[2] is *(a[2]), not (*a)[2]
For variable declarations, it implies that
* the parsing of int *a[10] is int *(a[10]), not int (*a)[10]
The low precedence given to PRINT is necessary to avoid ambiguity
and parse conflicts in expressions with two-argument operators.
It implies that
* the parsing of print 2 + 5 is print (2 + 5), not (print 2) + 5
By introducing extra nonterminals and grammar rules, one can live
without the precedence declarations. Mogensen's book describes how
to do that too.
* Including access expressions into general expressions leads to
ambiguity and conflicts. We add a new nonterminal exprnotaccess to
explicitly represent non-access expressions.
* For non-access expressions, we distinguish atomic expressions (a
variable, a constant, or an expression within parentheses) from
non-atomic ones. This avoids ambiguity in pointer dereferencing
expressions, so that
* the parsing of *x*2 is (*x)*2, not *(x*2)
main ::=
topdecs EOF program
topdecs ::= top-level declarations
<empty>
topdec topdecs
topdec ::= top-level declaration
vardec ;
fundec
vardec ::= variable or parameter declaration
typ vardesc
vardesc ::= variable description
NAME variable
* vardesc pointer to
( vardesc ) parenthesized variable description
vardesc [ ] array of
vardesc [ int ] array of (with allocation)
fundec ::= function declaration
void NAME ( paramdecs ) block
typ NAME ( paramdecs ) block
paramdecs ::= comma-separated parameter list
<empty>
paramdecs1
paramdecs1 ::= non-empty parameter declaration list
vardec
vardec , paramdecs1
stmt ::= statement
stmtm without unmatched trailing if-else
stmtu with unmatched trailing if-else
stmtm ::= no unmatched trailing if-else
expr ; expression statement
return ; return
return expr ; return
block block statement
if (expr) stmtm else stmtm if-else statement
while (expr) stmtm while-statement
stmtu ::=
if (expr) stmtm else stmtu unmatched trailing if
if (expr) stmt unmatched if
while (expr) stmtu unmatched
block ::= block statement
{ stmtordecseq }
stmtordecseq ::=
<empty> empty sequence
stmt stmtordecseq statement
vardec ; stmtordecseq local variable declaration
expr ::=
access access expression
exprnotaccess other expression types
exprnotaccess ::=
atexprnotaccess atomic expression, not access
access = expr assignment
NAME ( exprs ) function call
! expr logical negation
print expr print expr's value
println print value and newline
expr + expr plus
expr - expr minus
expr * expr times
expr / expr quotient
expr % expr remainder
expr == expr equal to
expr != expr not equal to
expr > expr greater than
expr < expr less than
expr >= expr greater than or equal to
expr <= expr less than or equal to
expr && expr sequential and
expr || expr sequential or
atexprnotaccess ::=
const constant
( exprnotaccess ) parenthesis
& access address operator
access ::=
NAME variable
( access ) parenthesis
* access pointer dereferencing
* atexprnotaccess pointer dereferencing
access [ expr ] array element access
exprs ::=
<empty> empty list of expressions
exprs1 non-empty list of expressions
exprs1 ::=
expr list with one expression
expr , exprs1 list with more than one expr
const ::=
CSTINT non-negative integer constant
- CSTINT negative integer constant
null null constant
type ::=
int the int type
char the char type
Lexical matters: tokens and comments
------------------------------------
NAME: [`a`-`z``A`-`Z`][`a`-`z``A`-`Z``0`-`9`]*
except for the keywords, which are:
char else false if int null print println return true void while
CSTINT: [0-9]+
CSTBOOL: false | true
CSTSTRING: `"`...`"`
with C string escapes \a \b \t \n \v \f \r \" \\ \ddd \uxxxx
OP: + - * / % = == != < > <= >= && ||
There are two kinds of comments (not inside strings, not nested):
// comment extends to end of line
/* ... */ delimited comment