The Syntax Extension Myth

May 14, 2012·
Didier Verna
Didier Verna
· 5 min read

Here’s a little Monday Troll.

To my greatest disappointment, I discovered today that it is not possible to replace Lisp parenthesis by, say, curly braces. What a shame. Hell, it’s not even possible to freely mix the two. Very naively, I had expected that

(set-macro-character #\{ (get-macro-character #\())
(set-macro-character #\} (get-macro-character #\)))

would suffice, but no. All implementations that I have tested seem to agree on this, although the error messages may differ. For instance, trying to evaluate {and} gives you an “unmatched close parenthesis error” except for CMU-CL which chooses to ignore it, but then report an end-of-file error. The unmatched close parenthesis, of course, is the closing curly brace. So what is going on here?

When an opening curly brace is read, the original left paren macro function is called. In SBCL for instance, this is sb-impl::read-list, which looks for a hardwired right parenthesis on the stream. Yuck. It doesn’t find one, but it finds my closing brace which triggers the “standalone” right paren behavior (spurious parenthesis alert). In passing, it also surprised me that sb-impl::read-list is not implemented in terms of read-delimited-list.

EDIT: as mentioned in several comments, we could use read-delimited-list to look for a closing curly brace, but even this won’t work completely. The problem is with dotted lists (see Pascal’s comment). SBCL hard-wires #\) in its dotted lists parsing procedures.

So it appears that dispatching macro characters are only shaky. What we miss is a true concept of syntactic categories (Common Lisp character syntax types are close, but not quite there yet). In fact, TeX, with its notion of “catcodes” (category codes), seems to be the only language that gets this right. Ideally, any character with a status of list terminator should do as good as a right paren (the problem is only with closing, not opening).

Instead of hard-wiring the right parenthesis in the Lisp parser, a quick workaround would be to check whether the next character on the stream is a dispatching one, and in such a case, whether its macro function is the one originally associated with the right paren. If so, it should then simply stand as a list terminator. This is actually an interesting idea I think: could the built-in macro functions become equivalent to actual category codes, and could we completely remove hard-wired characters in Lisp parsers?

Anyway, this whole story is a true scandal because it ruined an otherwise cool live demo of mine. So much for syntax extensibility.

Comments

From Pascal Bourguignon

Indeed, the main point is that CL doesn’t export facilities to read dotted-lists, only proper lists. And implementing dotted list reading is not even easy (you cannot just use peek-char and read, you need lower level parsing).

Nonetheless, syntax extensibility can be as easy as:

cl-user> (make-lisp-list-macro-character #\{)
#\}
cl-user> '{a {1 2 3} {} c . d}
(a (1 2 3) nil c . d)

See http://paste.lisp.org/display/129445.

PS: Thanks for that blog entry, I found a bug in my lisp reader 😉

From Eivind Midtgård

This is explained in CLTL2, just the way you found it to be, in the entry for set-syntax-from-char, p.541, 542.

From Greg Pfeil

You can achieve your goal almost as simply, though:

(set-macro-character #\{
  (lambda (stream char) (read-delimited-list #\} stream t)))
(set-macro-character #\} (get-macro-character #\)))

The problem with your original attempt, IMO, is that if CL were defined in a way that allowed that, the result would be to allow (and} and {and) in addition to the expected {and}, which seems very non-intuitive to me.

If you want to (} and {) to work, it’s possible, but it does (and should) require more work to do so.

That said, with Unicode, an implementation might be able to define #( such that it matched the mirrored character in the Unicode DB, and then (set-macro-character #\{ (get-macro-character #\()) could work in the intuitive way, but I don’t know if that would be conforming, and it certainly wouldn’t be portable.

From DanM

I’m sorry you’re disappointed, but think that your expectation is simply not in accord with the CL reader’s implementation model. You should be able to quite easily do what you wanted thus:

(set-macro-character #\{
  (lambda (stream char) (read-delimited-list #\} stream)))
(set-macro-character #\}
  (lambda (stream char)
	(error "an object cannot start an object with #\}")))

The second call serves two purposes: it causes the closing brace to be treated as something other than a constituent character when it appears alone, and (because the optional non-terminating-p argument to set-macro-character is allowed to default to nil) makes it terminate a symbol without becoming part of it.

So you can do what you want easily with no additional infrastructure.

This is quite elegant and requires no further.

From me

@Pascal: exactly. There are 2 other places in SBCL where #\) is hard-wired, and these have to do with reading dotted lists. Thanks for the paste BTW!

@Eivind: thank you for the pointer.

@Greg, Cosman, Dan: yes, you can use read-delimited-list, but this will still not function properly in all cases. See Pascal’s comment (I should have mentionned this in my original post).

From Jerry Boetje

Yes, CLforJava and I are still around and working. When I saw post, I went to see how CLforJava handles your test. Of course, it had a similar result. However, after reading the rest of your text, I remembered that we devised a character syntax type system into the base of the reader. Might be an interesting job for a student…

From Marc Espie

TeX gets it right, except it does not have enough introspection.

It’s darn nigh impossible to get it to regurgitate text and fudge with the catcodes, leading to appendix D and the very obscure \afterassignment hack.