Heads up: the reposturgeon is mutating!

A few days ago I released reposurgeon 2.43. Since then I’ve been finishing up yet another conversion of an ancient repository – groff, this time, from CVS to git at the maintainer’s request. In the process, some ugly features and irregularities in the reposurgeon command language annoyed me enough that I began fixing them.

This, then, is a reposurgeon 3.0 release warning. If you’ve been using 2.43 or earlier versions, be aware that there are already significant non-backwards-compatible changes to the language in the repository head version and may be more before I ship. Explanation follows, embedded in more general thoughts about the art of language design.

First, a justification. Most computer languages (including domain-specific languages like reposurgeon’s) incur high costs when they change incompatibly. It’s a bad thing when a program breaks halfway through its expected lifetime – or worse, when its behavior changes in subtle ways without visibly breaking. Responsible language maintainers don’t make such changes at all if they can help it, and never do so casually.

But reposurgeon has an unusual usage pattern. Lift procedures written in reposurgeon are generally written once, used for a repository conversion, then discarded. This means that users are exposed to incompatibility problems only if they change versions while a conversion is in progress. This is usually easy to avoid, and when it can’t be avoided the lift recipes are generally short and relatively easy to verify.

Thus, the costs from reposurgeon compatibility breakage are unusually low, and I have correspondingly more freedom to experiment than most language designers. Still, conservatism about breaking compatibility sometimes does deter me, because I don’t want to casually obsolesce the knowledge of reposurgeon in my users’ heads. Making them re-learn the language at every release would be rude and obtrusive of me.

That conservatism has a downside beyond just slowing the evolution of the language, however. It can sometimes lead to design decisions, made to preserve compatibility, that produce warts on the language and that you come to regret later. Over time these pile up as a kind of technical debt that eventually has to be discharged. That discharge is what’s happening to reposurgeon now.

Now I’ll stop speaking abstractly and point at some actual ugly spots. The early design of reposurgeon’s language was strongly influenced by the sorts of things you can easily do in Python’s Cmd class for building line-oriented interpreters. What Cmd wants you to do is write command handler methods that are chosen based on the first whitespace-separated token on the line, and get the rest of the line as an argument. Thus, when reposurgeon interprets this:

read foobar.fi random extra text

what actually happens is that it’s turned into a method call to

do_read("foobar.fi random extra text")

and how you parse that text input in do_read() is up to you. Which is why, in the original reposurgeon design, I used the simplest possible syntax. If you said

read foobar.svn
delete /nasty content/ obliterate
write foobar2.fi

this was interpreted as “read and parse the Subversion stream dump in foobar.svn in the file foobar.fi, delete every commit for which the change comment includes the string “nasty content”, then write out the resulting history as an fast-import stream to the file foobar2.fi.

Looks innocent enough, yes? But there’s a problem lurking here. I first bumped into it when I wanted to specify an optional behavior for stream writes. In some circumstances you want some extra metainformation appended to each change comment as it goes out, a fossil identification (like, say, a Subversion commit number) retained from the source version control system. The obvious syntax for this would look like this:

write fossilize foobar2.fi

or, possibly, with the ‘fossilize’ command modifier after the filename rather than before it. But there’s a problem; “write” by itself on a line means “stream the currently selected history to standard output”, just as “read” means “read a history dump from standard input”. So, if I write

write fossilize

what do I mean? Is this “write a fossilized stream to standard output”, or “write an unfossilized stream to the file ‘fossilize'”? Ugh…

What the universe was trying to tell me is that my Cmd-friendly token-oriented syntax wasn’t rich enough for my semantic domain. What I needed to do was take the complexity hit in my command language parser to allow it to look at this

write --fossilize foobar2.fi

and say “aha, –fossilize is led with two dashes so it’s an option rather than a command argument” The handler would be called more or less like this:

do_read("foobar2.fi, options=["--fossilize"])

I chose at the time not to do this because I wanted to keep the implementation simplicity of just treating whitespace-separated tokens on the command line as positional arguments. What I did instead was introduces a “set” command (and a dual “clear” command) to manipulate global option flags. So the fossilized write came to look like this.

set fossilize
write foobar2.fi
clear fossilize

That was my first mistake. Those of you with experience at this sort of design will readily anticipate what came of opening this door – an ugly profusion of global option flags. By the time I shipped 2.43 there were seven of them.

What’s wrong with this is that global options don’t naturally have the same lifetime as the operations they’re modifying. You can get unexpected behavior in later operations due to persistent global state. That’s bad design; it’s a wart on the language.

Eventually I ended up having to write my own command parser anyway, for a different reason. There’s a “list” command in the language that generates summary listings of events in a history. I needed to be able to save reports from it to a file for later inspection. But I ran into the modifier-syntax problem again. How is the do_list() handler supposed to know which tokens in the line passed to it are target filenames?

Command shells like reposurgeon have faced this problem before. Nobody has ever improved on the Unix solution to the problem, which is to have an output redirection syntax. Here’s a reminder of how that works:

ls foo          # Give me a directory listing of foo on standard output
ls >bar         # Send a listing of the current directory to file bar 
ls foo >bar     # Send a listing of foo to the file bar
ls >bar foo     # same as above - ls never sees the ">bar"

In reposurgeon-2.9 I bit the bullet and implemented redirection parsing in a general way. I found almost all the commands that could be described as report generators and used my new parser to make them support. A few commands that took file inputs got re-jiggered to use “<” instead.

For example, there’s an “authors read” command that reads text files mapping local Subversion- and CVS-style usernames to DVCS-style IDs. Before 2.9, the command to apply an author map looked like this:

authors read foo.map

That changed to

authors read <foo.map

But notice that I said “almost all”. To be completely consistent, the expected syntax of my first example should have changed to look like this:

read <foobar.svn
delete /nasty content/ obliterate
write >foobar2.fi

That is, read and write should have changed to always require redirection rather than ever taking filenames as arguments. But when I got to that point, I retained I/O filename arguments for those commands only, also supporting the new syntax but not decommissioning the old.

That was my second mistake. Technical debt piling up…but, you see, I thought I was being kind to my users. The other commands I had changed to require redirection were rarely used; “read” and “write”, on the other hand, pretty much have to occur in every lift script. Breaking my users’ mental model of them seemed like the single most disruptive change I could possibly make. Put plainly, I chickened out.

Now we fast-forward to 2.42 and the groff conversion, during which the technical debt finally piled high enough to topple over.

There’s a reposurgeon command ‘unite” that’s used to merge multiple repositories into one. I won’t go into the full algorithm it uses except to note that if you give it two repositories that are linear, and the root of one of them was committed later than the tip of the other, the obvious graft occurs – the later root commit is made the child of the earlier tip commit. I needed this during the groff conversion.

Every time you do a unite you have a namespace-management problem. The repositories you are gluing together may have collisions in their branch and tag names – in fact they almost certainly have one collision, on the default branch name “master”. The unite primitive needs to do some disambiguation.

The policy it had before 2.43 was very simple; every tag and branch name gets either prefixed or suffixed with the name of the repo it came from. Thus, if you merge two repos named “early” and “late”, you end up with two tags named “master-early” and “master-late”.

This turns out to be dumb and heavyhanded when applied to to two linear repos with “master” as the only collision. The natural thing to do in that case is to leave all the (non-colliding) names alone, rename the early tip branch to “early-master” and leave the late repo’s “master” branch named “master”.

I decided I wanted to implement this as a policy option for unite – and then ran smack dab into the modifier-syntax problem again, Here’s what a unite command looks like (actual example from recent work):

unite groff-old.fi groff-new.fi

Aarrgh! Redirection sequence won’t save me this time. Any token I could put in that line as a policy switch would look like a third repository name. Dammit, I need a real modifier syntax and I need it now.

After reflecting on the matter, I once again copied Unix tradition and added a new syntax rule: tokens beginning with “–” are extracted from the command line and put in a separate option set also available to the command handler. Because why invent a a new syntactic style when your audience already knows one that will suit? It’s good interface engineering to re-use classic notations.

I mentioned near the beginning of this rant that this is what I should have done to the parser much sooner. Now my new unite policy can be invoked something like this:

unite --natural groff-old.fi groff-new.fi

OK, so I implemented option extraction in my command parser. Then it hit me: if I’m prepared to accept a compatibility break, I can get rid of most or all of those ugly global flags – I can turn them into options for the read and write commands. Cue angelic choirs singing hosannahs…

Momentary aside: This is not exceptional. This is what designing domain-specific languages is like all the time. You run into these same sorts of tradeoffs over and over again. The interplay between domain semantics and expressive syntax, the anxieties about breaking compatibility, even the subtle sweetness of finding creative ways to re-use classic tropes from previous DSLs…I love this stuff. This is my absolute favorite kind of design problem.

So, I gathered up my shovels and rakes and other implements of destruction and went off to abolish global flags, re-tool the read & write syntax, and otherwise strive valiantly for truth, justice, and the American way. And that’s when I received my just comeuppance. I collided head-on with a kluge I had put in place to preserve the old, pre-redirection syntax of read and write.

Since 2.9 the code had supported two different syntaxes

read foobar.fi       # Old
read <foobar.fi      # New

The problem was the easiest way to do this had been to look at the argument line before the redirection parser sees it, and prepend “<” if it doesn’t already begin with one. But that means that if I type “read –fossilize foobar.fi” the read handler will get this: “<–fossilize foobar.fi”. With the “<” in entirely the wrong place!

Friends, when this sort of thing happens to you, here is what you will do if you are foolish. You will compound your kluge with another kluge, groveling through the string with some kind of rule like “insert

29 comments

  1. I’ve no comment, or opinion on reposurgeon, and probably will never need it. (I started with bzr, and still use it, the tools work well for me.)

    However, I do want to say, this sort of posting is interesting regardless.

    Are there resources that you would recommend for learning how to deal with these sort of design decisions, or do you just have to learn by making mistakes?

    1. >Are there resources that you would recommend for learning how to deal with these sort of design decisions, or do you just have to learn by making mistakes?

      There’s that, and studying examples of good design. Nobody teaches this stuff effectively that I know of.

  2. Personally, I always use the following convention for any command-line type syntax and it has served me well :

    cmd [subcmd] [-opt value*]*

    Where an ‘option’ is always prefixed by ‘-‘ and may be abbreviated (eg. -f, -fo are both valid abbreviations of -foo) and may have zero or more white-space separated values. So, the following command/argument combinations are valid :

    foo
    foo bar (where ‘bar’ is a valid sub-command)
    foo -bar
    foo bar -baz
    foo -bar “tom”
    foo -bar “tom” “dick” -baz
    foo -bar tom -baz dick harry
    etc…

    and the following are invalid :

    foo bar baz
    foo bar (when ‘bar’ is a value and not a valid sub-command)

    I tend to think that more than one sub-command is never necessary.

    In my cases, for file, directory, link, etc arguments, I usually just use -f (-file) as the option and let the method/parser function interpret the value type based on the command context.

    BTW, in your posting above, “read foo” for a directory read would not be valid, regardless of how tempting it might be to have a shortened syntax; For example, the valid alternatives from my point of view would be “read -f foo”, “read -dir foo”, “read dir foo”, etc…

    Just a thought…feel free to shoot down in flames…

  3. Well, you usually use “cat file”, and not “cat <file", so I'd leave "read file" alone.

    Though other changes (getting rid of global variables and state, adding command options) are a very good idea. BTW. do you implement "–"?

  4. @esr: Have you considered using argparse internally to parse your commands? It doesn’t do redirection but otherwise is a very powerful command line parser with nice option definition syntax. (Assuming reposurgeon is still a Python program.)

    1. >Have you considered using argparse internally to parse your commands?

      Not for a second. I’ve used argparse and don’t like it. I find it heavy and overcomplicated for what it does.

  5. There’s something to be said for splitting all these commands into separate executables and just letting people use their shell of choice. Not sure if it’s practical for this particular case, but that would be my first impulse instead of creating my own CLI.

    1. >There’s something to be said for splitting all these commands into separate executables and just letting people use their shell of choice. Not sure if it’s practical for this particular case, but that would be my first impulse instead of creating my own CLI.

      But it’s a complete non-starter. They have to share very complex data structures representing version control histories.

  6. @esr:

    > Not for a second. I’ve used argparse and don’t like it. I find it heavy and overcomplicated for what it does.

    Seconded. I’ve used it a few times when I had something requiring **lots** of arguments, and would say that in that scenario, it’s somewhat better than rolling your own. But that is not the use case here — you would need separate argparse parsers for every command, and most of those probably don’t have that many arguments.

    I find that if I only need a couple of arguments, and I don’t use it very often, I have to reach for the manual every time.

  7. >But it’s a complete non-starter. They have to share very complex data structures representing version control histories.

    You could have the executables output valid Python function-call syntax to their standard outputs, fork and exec the user’s $SHELL and use os.pipe, os.dup2 and friends to pipe its output back to the parent process. A named pipe would be better, since that way you could run other shell commands without confusing Reposurgeon. Then you’d have the parent process drop into a REPL using one of these methods so it can execute Python function calls generated by the child process.

    I guess it depends on how much work you’d be willing to do to avoid writing your own interactive shell.

  8. @Michael:
    > Are there resources that you would recommend for learning how to deal with these sort of design decisions, or do you just have to learn by making mistakes?

    Well, there is “The Architecture of Open Source Applications”, volume I and II, at http://aosabook.org

  9. @Patrick Maupin and @esr: And ditto for its predecessor, optparse. Like Patrick said, even for a situation where you need to handle dozens of arguments (in which case, there’s probably a better way), optparse/argparse is only marginally better than rolling your own.

    In the case of a DSL, though, argparse is almost definitely the wrong way to do that.

    @Max E

    You could have the executables output valid Python function-call syntax to their standard outputs, fork and exec the user’s $SHELL and use os.pipe, os.dup2 and friends to pipe its output back to the parent process. A named pipe would be better, since that way you could run other shell commands without confusing Reposurgeon. Then you’d have the parent process drop into a REPL using one of these methods so it can execute Python function calls generated by the child process.

    Methinks the “cure” is worse than the “disease.”

  10. > But it’s a complete non-starter. They have to share very complex data structures representing version control histories.

    What about making it a module that people are expected to write a python script to utilize?

    1. >What about making it a module that people are expected to write a python script to utilize?

      And this would gain…what?

      We’re not talking about a domain in which the things programming languages are good at expressing are very important.

      Would people please actually study what the tool is doing before making these sorts of suggestions?

  11. @Random832:

    > What about making it a module that people are expected to write a python script to utilize?

    Although I haven’t looked at the source, I think I can fairly confidently say that that is available, and what is under discussion is the UI layer…

  12. Out of curiosity Eric, what was your rationale for a DSL over a more generalized extension language such as lua or scheme. Portability, avoiding overkill, or just because you felt like it? Language design and parsing is one of my favorite things too so I can understand the latter.

    1. >Out of curiosity Eric, what was your rationale for a DSL over a more generalized extension language such as lua or scheme. Portability, avoiding overkill, or just because you felt like it?

      Python and its Cmd class were the tool I had nearest to hand that I was confident could handle the job.

      I don’t think either lua or scheme would have fit as well. Python’s native ontology of types is well suited to the problem domain; in lua or scheme I would have had to build an equivalent layer more or less from scratch.

  13. A co-worker and friend who had the title “technical architect” used the term “stack” (as a verb) for reusing an analogous or closely matching pattern in a new context. I found the word potentially confusing, but it’s saved by two things:
    1) I’ve never heard another verb coined for the practice
    2) other uses of “stack” in CS / IT are all nouns, not verbs.

    He would apply it when, say, nomenclature for software would be reflected in nomenclature for hardware (or virtual machines) this was “stacking” the naming standard.

    In his terms, you are “stacking” the reposturgeon syntax on typical Unix command and Unix shell syntax.

    I say this mainly to point out that we need a term for this practice, and to suggest that we choose one. In honor of my friend, I suggest stacking, despite my mild reservations about that particular choice.

  14. I want to add one more remark about this concept of “stacking” — the idea goes beyond modelling on, or imitating another convention: the point is that the imitation or resemblance created between a stacked thing (a name, a syntax convention, an algorithm) and the original is sufficiently well-done that a person with prior experience with the original inspiration would, upon learning the stacking, find it obvious and intuitive what the correspondence means in the new context. This goes beyond imitation or resemblance in that it makes a stronger claim to the semantic significance of the resemblance. It’s a claim to meaningful and helpful resemblance between separate things.

    My friend saw this as a good or best practice in architecture, and his name for it is the only one I’ve ever seen for this.

  15. OT but Chrome started blocking your site this evening, claiming you’re distributing a worm:

    What happened when Google visited this site?
    Of the 241 pages we tested on the site over the past 90 days, 0 page(s) resulted in malicious software being downloaded and installed without user consent. The last time Google visited this site was on 2013-12-09, and the last time suspicious content was found on this site was on 2013-11-19.
    Malicious software includes 4 trojan(s), 1 worm(s).

    This site was hosted on 1 network(s) including AS36850 (UNC-CH).

    1. >OT but Chrome started blocking your site this evening, claiming you’re distributing a worm:

      ibiblio has a problem, obviously. They’ll fix it.

  16. What happened when Google visited this site [ibiblio.org]?

    Of the 307 pages we tested on the site over the past 90 days, 67 page(s) resulted in malicious software being downloaded and installed without user consent. The last time Google visited this site was on 2013-12-10, and the last time suspicious content was found on this site was on 2013-12-10.
    Malicious software includes 4 trojan(s), 1 worm(s).
    Malicious software is hosted on 1 domain(s), including yestblock.com/.
    1 domain(s) appear to be functioning as intermediaries for distributing malware to visitors of this site, including operaget.pp.ua/.

  17. Actually the problem was probably on Google’s end, since they’ve now revised history:

    What happened when Google visited this site?
    Of the 20 pages we tested on the site over the past 90 days, 0 page(s) resulted in malicious software being downloaded and installed without user consent. The last time Google visited this site was on 2013-12-15, and suspicious content was never found on this site within the past 90 days.

    http://safebrowsing.clients.google.com/safebrowsing/diagnostic?site=http%3A%2F%2Fesr.ibiblio.org%2F%3Fp%3D5135

  18. @Patrick

    I’m pretty sure I saw the malware warnings for the the “esr” subdomain, before. I don’t have that tab open anymore, but I’ve never intentionally gone to the top-level “ibiblio” site, and when Chrome recommended I consult the safebrowsing site I’m sure it included “esr” in the query string.

    So, to clarify my statement, Google has determined that the “esr” subdomain was not the source of malware, even though they had lumped it in with the rest of “ibiblio” before. A very small revision of history, to be sure.

Leave a comment

Your email address will not be published. Required fields are marked *