2006-06-19

Futbol en masse

At the summer camp I attended as a yoot in the 60s, there was a game known as Mass Soccer. The chief feature of this game of games was that any number could play on either side. Kids being what they are, this meant that the entire center of the field was occupied by a permanent clot of forwards of all shapes and sizes, with a relative handful of backs whose chief task was to assist the goalies when a ball occasionally escaped from this huge scrum.

To keep things interesting, any number of balls could be in play at once. To keep things fair, each side was entitled to as many goalies as there were balls — when a ball went out of play, one of the goalies would do the usual thing to bring it back in while the game raged on around him. To keep things safe, the footwear rules were strictly enforced: no shoes allowed, socks required.

It was a hell of a lot of fun.

2006-06-15

TagSoup 1.0 Final released!

TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
TagSoup is free and Open Source software, licensed under the Academic Free License version 3.0, a cleaned-up and patent-safe BSD-style license which allows proprietary re-use. It's also licensed under the GNU GPL version 2.0, since unfortunately the GPL and the AFL are incompatible. You can choose to license TagSoup from me under either the GPL or the AFL.
This release represents the end of my current plans for TagSoup. I will continue to fix bugs, but it now does everything that I foresaw back in 2002 when I started this project, and a great deal more. Thanks to everyone on the tagsoup-friends mailing list for their efforts.

2006-05-26

RELAX NG rant

I've added a slide presentation/rant on RELAX NG to my home page.

2006-05-17

Plagiarism

This blog is about recycled knowledge. That means there are often facts or ideas in it that I remember, but don't remember the source of. If you read this and recognize your own ideas, please let me know and I'll fix it up.

Sometimes I deliberately don't say where an idea comes from or mention people in a "background" sort of way, because I don't know if they want their names attached to something they wrote in a more private space than the Web five or ten years ago. If you are one of those, and you do want credit, again let me know.

2006-05-10

No cross, no crown

Someone used the phrase "No cross, no crown" on a mailing list, and explained it as meaning "Don't discuss religion or politics". I was fairly sympathetic with the intent, but unfamiliar with this use of the phrase: I had always understood it to mean "If you don't take pains you won't achieve anything", and to be a specifically Christian metaphor: "No earthly cross of metaphorical crucifixion, no heavenly crown of sainthood." I decided to look into the question.

I quickly checked the first 500 Google references to the phrase, and all of them except three clearly referred to the sense I already knew, drawn from all over Christianity, mostly Catholic and Quaker, but Episcopal, AMEZ, Orthodox, and even Rosicrucian. The Christian Scientist symbol of a cross surmounted by a crown probably alludes to the saying as well. The saying is of course also found in purely secular contexts, with the same sense.

Two of the three exceptions are basically accidental: a song "Ojo por Ojo", which says "And in that place there is no cross / no crown, no sacred ground / all is done and left unsaid"; and a page denying that the Mormon Temple is a Christian edifice, enumerating the Christian symbols that it does not have (at least on the outside): "There is no cross, no crown, no alpha or omega, no icthys, no lion, no lamb, nor any other recognizable historical Christian symbol."

The final exception appears in a speech on Voltaire by Robert Ingersoll, the 19th-century agnostic, who is clearly using the expression in the mode of parody; he associates it with King James I's maxim "No bishop, no King", by which the King meant that if Presbyterianism came to dominate England as well as Scotland, he would swiftly find himself either out of a job or a tolerated figurehead.

The Quaker references clearly allude to a book of that name by William Penn, in which he says, "No pain, no palm; no thorns, no throne; no gall, no glory; no cross, no crown." The modern phrases "No pain, no gain" and "No guts, no glory" are clearly reminiscences of this. I also found "No pruning, no grapes; no grinding mill, no flour; no battle, no victory; no Cross, no Crown!" and "No laming, no naming, no struggle, no Promised Land; no cross, no crown" in the works of others.

2006-05-09

Changing names

The names of Unicode characters, once published, can't ever be changed again, not even when they are obviously wrong. In this case, stability is considered to trump correctness.

A participant in the Unicode development process once complained: "If biologists had insisted that names once assigned could not be changed because of advances in knowledge, or even to correct errors, then surely the system would have broken down centuries ago."

But in fact, the international Linnaean names of plants and animals are not changed for either of those reasons, nor for any other reason whatsoever: though we now know that Basilosaurus is a proto-whale and not any sort of reptile, Basilosaurus it will remain forever.

The only thing that can happen in Linnaean nomenclature is the recognition that two names are synonymous. In that case, there is a question which shall be the preferred name, and normally it is the first name to be published, but exceptions sometimes occur. Thus when the dinosaurs Brontosaurus and Apatosaurus were found to be the same, Apatosaurus was chosen as the preferred name because it was published first; however, this is not properly to be described as "changing the name of Brontosaurus to Apatosaurus". Brontosaurus is a perfectly good name and may still be used even though it is dispreferred.

When are later names preferred to earlier ones? Usually when the earlier name has long been forgotten, and the later name is widely used in the scientific literature.

And per se and

The name of the & character, ampersand, is short for and per se and, meaning and by itself and. People used to recite it at the end of the alphabet. About that much there's no doubt.

But of the two ands in that phrase, which one designates the ampersand? Is it and per se & or & per se and? It seemed clear to me that the former is the correct reading, so I did a little desultory research.

Most sources say the derivation is & per se and, but the story of reciting the alphabet is firmly established and I don't see how and, the conjunction, could possibly appear at the end. The Morris Dictionary of Word and Phrase Origins takes my point of view.

An alternative adopted by the American Heritage and Merriam-Webster dictionaries is that the words and per se and are to be construed as & by itself [means] 'and', but that seems far more strained to me than the natural x, y, z, and per se &.

So I suppose you can say what you like.

Zanzibar!

It was a dark and stormy night. I stalked my enemy through the tall grass. I saw the flash of his muzzle as his shot went whistling over my head. I fired! I killed him!

I walked to the nearest town. Casually smoking a cigarette, I entered the nearest bar.

"I have killed a man!", said I.

"His name?" demanded a tall, dark, and handsome stranger at the other end of the bar.

"His name? His name was Zanzibar!"

"Zanzibar! He was my brother! We must meet."

It was a dark and stormy night....

2006-05-05

Pity

"But this is terrible!" cried Frodo. "Far worse than the worst that I imagined from your hints and warnings. O Gandalf, best of friends, what am I to do? For now I am really afraid. What am I to do? What a pity that Bilbo did not stab that vile creature, when he had a chance!"

"Pity? It was Pity that stayed his hand. Pity, and Mercy: not to strike without need."

     --J.R.R. Tolkien

2006-04-24

The War (after Simonides)

This is a villanelle, a very special verse form that I have taken some liberties with. I had always wanted to write one, but I never could come up with a couplet strong enough to support all the required repetitions. It finally dawned on me that the couplet didn't have to be original with me, if I was clever enough about it.

Even if you don't click on all the links, be sure to mouse over them: they provide a first-level commentary on the poem.

"Go and tell the Spartans, passerby,"
   The man of Keos sings in lines that soar,
"That here, obedient to their laws, we lie."
The king of Lakedaimon will not fly,
   Though "Kill them all!" the hordes of Persia roar:
Go and tell the Spartans, passerby.
The news is carried of their terse goodbye
   From Hot Gates to far Atlantic shore
That there, obedient to their laws, they lie.
The law of nations ready to defy,
   Atlantis rising plots aggressive war --
Go and tell the Persians, passerby.
The Archon smiles, and smiles: real men don't cry.
   His friends sell not reality but lore
As here, obedient to his will, they lie.
Boxed in flags, the dead all verse deny.
   They cannot serve their country any more.
"Go and tell our people, passerby,
That here, obedient to their laws, we lie."

2006-04-11

Earth Day

This lyric was circulating around the time of the first Earth Day on April 22, 1970. George Carlin wrote the original version; this one's been modified by what Pete Seeger called the folk process. Sing it slowly, maestoso, and with great irony.

Oh beautiful for smoggy skies,
Insecticided grain,
For garbage mountain majesties
Above the asphalt plain:
America, America,
Man spreads his trash on thee,
And makes a mess with filthiness
From sea to oily sea.

(Original lyrics; melody.)

2006-03-29

Celebes Kalossi 2.0

I've decided it's time to post about my object-oriented programming model, Celebes Kalossi, again. All previous statements are inoperative, so you don't have to look back at my earlier postings. This posting will be mostly about terminology.

In CK, there are classes. A class contains declarations of state variables (aka instance variables, fields, data members) and both declarations and definitions of methods. State variables are only accessible from within the class: they are all private in Java terminology.

A declaration of a method specifies the method's name and its signature; that is, the type of its return value and the names and types of its arguments. In the model, no two methods in a class can have the same name; an actual implementation might provide Java-style method overriding, since overriding is resolved at compile time and is basically convenient syntactic sugar.

A definition of a method specifies everything the corresponding declaration does, but also includes the code of the method. If a class contains a definition of a method, it has no need to contain its declaration too.

A method may be public, private, or neither; the third type will be called standard methods here. A public method can be called from anywhere, and can be invoked on any object. A private method cannot be invoked outside the class in which it is defined, so there is no point in declaring one. Basically, it's just a subroutine. The difference between standard and private methods will be explained in another posting.

Standard and private methods can only be invoked on the self (this in Java) object, implicitly or explicitly. The most important rule of CK is that you cannot invoke a method on self that is not declared (not necessarily defined) in the current class.

Finally, by "Java" I mean "Java or C#". More later.

Speaking in Ander-Saxon

Some while back I wrote a posting on partially understanding languages that included a well-known quotation from Old English specialist Tom Shippey about how English became simplified over time.

Here's a translation (by me) of that explanation into Ander-Saxon, a variety of English in which French, Latin, and Greek words and roots are replaced by native English ones.

Reckon what happens when somebody who speaks, shall we say, good Old English from the south of the land runs into somebody from the northeast who speaks good Old Norse. They can without fear pass on with each other, but the hardnesses in both tongues are going to get lost. So if the Anglo-Saxon from the South wants to say (in good Old English) "I'll sell you the horse that pulls my cart", he says: Ic selle the that hors the draegeth minne waegn.
Now the old Norseman -- if he had to say this -- would say: Ek mun selja ther hrossit er dregr vagn mine. So, roughly speaking, they understand each other. One says waegn and the other says vagn. One says horsand draegeth; the other says hros and dregr, but broadly they are onpassing. They understand the root words. What they don't understand are the wizardly bits of the wholespeech.
For a showdeal, the man speaking good Old English says for one horse that hors, but for two horses he says tha hors. Now the Old Norse speaker understands the word hors all right, but he's not sound if it means one or two, byspring in Old English you say "one horse", "two horse". There is no apartness between the two words for "horse". The apartness is carted in the word for "the", and the old Norseman might not understand this, byspring his word for "the" doesn't behave like that. So: are you trying to sell me one horse or are you trying to sell me two horses? If you get enough sittings like that there is a strong drive toward straightening out the tongue.

(I ran this past Professor Shippey in email.)

2006-03-26

On this and that

Here's a few little bits scoured up from here and there.

Boswell on Johnson's Dictionary:

A few of his definitions must be admitted to be erroneous. Thus, Windward and Leeward, though directly of opposite meaning, are defined identically the same way; as to which inconsiderable specks it is enough to observe, that his Preface announces that he was aware there might be many such in so immense a work; nor was he at all disconcerted when an instance was pointed out to him. A lady once asked him how he came to define Pastern the knee of a horse: instead of making an elaborate defence, as she expected, he at once answered, "Ignorance, Madam, pure ignorance." His definition of Network ["Any thing reticulated or decussated, at equal distances, with interstices between the intersections"] has been often quoted with sportive malignity, as obscuring a thing in itself very plain.

To which we may add his definition of lexicographer: "a writer of dictionaries, a harmless drudge".

On the names for people with variously colored hair:

Blond and blonde are masculine and feminine forms, though the latter is rarely used as an adjective nowadays, only as a noun. Brunette, on the other hand, is feminine only; the form brunet which is sometimes found is not French, not English, and entirely barbarous. -ette is inherently both feminine and diminutive (though the latter sense dominates in English, as in cassette, diskette, kitchenette, statuette), and not to be split up into two separate affixes.

On whiteboards:

Whiteboards are common in corporations, but I have never seen one in any educational establishment in the U.S. (which is by no means to say there are none). The coolest variety have a large canvas which can be scrolled left or right, by full screens or by smaller steps, and can even save copies of what's currently in view using a giant scanner; you can hook up a conventional printer for hard copy or (I suppose) put them on a network. I only got to use such a Wundergerät once or twice, alas.

On Latin in Great Britain:

The Great Vowel Shift that changed the pronunciation of the English long vowels in the 15th century affected not only English but also the spoken Latin of the monasteries. Indeed, there was a period where English and Scottish Latiners could not understand one another, because Scottish Latin did not undergo the Shift even though Scots itself (mostly) did!

On how history could have gone:

Could the Internet have been invented if telephones hadn't been invented first? I think so. Telegraphy is a lot simpler than telephony, and telegraph operators had something socially very like the Internet (but involving a lot fewer people, of course) more than a hundred years ago. There were even routers and protocol gateways, instantiated by human beings.

A technical civilization might well go from semaphore telegraphs to electric telegraphs to teletypewriters to Morse-code radio to high-speed wired and wireless digital transmissions, missing analog telephones and radio altogether.

On the root *tag-:

Ruminating over the English words tact and tactics led me to realize how interestingly convergent in meaning they have become, descending from the same PIE root *tag- through different branches, respectively Latin tangere, tactus 'touch(ed)'; Greek taktikh 'deployment < arrangement'.

On tornadoes:

Conventional wisdom says tornadoes never happen in the Eastern U.S. Conventional wisdom, as all too often, does not know its history. Tornadoes have been recorded in all of the fifty states and D.C. Indeed, only the following 10 states have not had a major tornado (causing death or property damage) since 1980:

Alaska (1959), Hawaii (1971), Indiana (1974), Iowa (1979), Kentucky (1974), Minnesota (1978), Missouri (1973), North Dakota (1978), Vermont (1970), West Virginia (1974).

A Creative Choice

This piece was submitted by me to the mailing list Heroic Stories. It appears here in slightly modified form.

I used to work as a programmer for a news service, a small subsidiary of a larger news and financial information company. We write and publish medical news over the Internet; our customers include companies with medical websites, pharmaceutical companies, newspapers, and specialized and general-use web portals.

Back in 2002, advertising-supported media (which means most media) had fallen on hard times as a result of the slow economy. Our subsidiary, like many media companies, had to cut back on its staff. For us the need was particularly acute, as most of our customers were Internet-based, and about half of them went belly-up after the dot-com bubble burst in 2001.

We had staved off the problem for about a year, thanks to having annual contracts. But eventually we had to cut costs, and the only way we could do that and still maintain service to our remaining customers was to cut staff. As a result, in August 2002 the "powers that be" declared that one or two people would have to be sacrificed from each department: sales, financial, news-writing, and technical.

The financial department was abolished altogether and its functions transferred to a group in the parent company. Most of the other groups naturally suffered as a result of losing journalists, editors, and salespeople -- but they survived, still able to perform their missions.

Our technical department, however, consisted of just two programmers and a system administrator. Without the programmers, we couldn't maintain our existing systems and implement new ones. Without the system administrator, who doubled as a help-desk person, we would have been unable to support the rest of the subsidiary or keep our production systems backed up and running smoothly. Terminating any of us would have meant a massive workload for the remaining two, much of it work they were not trained to perform. It was an ugly choice to make.

The director of the technical department decided to meet the challenge in a creative way. She was going on maternity leave just after the announcement came out, and decided to terminate herself instead of one of her staff. She said that she considered herself the "most expendable" person in the technical department.

Management was shocked by the idea of losing a department manager instead of regular staff. They protested loudly and tried to make Sandra change her mind, but to no avail. Her clear-headed analysis prevailed, and it was decided that after Sandra's departure we would report jointly to a technical manager within the parent company and the CEO of our subsidiary.

Sandra returned from maternity leave and worked until the end of 2002, then left to devote herself to motherhood and free-lance work. As a result of her selfless action, the three of us who remained were able to fulfill our customers' and employer's needs. In the end, however, both of the other two were let go, leaving me to perform all the remaining technical functions until the end of 2005, when I too was laid off.

2006-03-24

I had to say that

Josiah Willard Gibbs was the greatest American physicist of the 19th century, perhaps the only world-class American physicist of his time.

He was well-known among his friends and peers not only for his brilliance but also for his extraordinary modesty. They were astounded, therefore, when on one occasion when Gibbs was testifying as an expert witness, and the opposing lawyer asked him what right he had to say such things, he replied "I am the greatest living authority on the matter."

Gibbs's explanation after the fact: "I had to say that; I was on oath".

French fries are not chips

It's commonly held that English "chips" are American "French fries", but I deny it. McDonald's-style fried potatoes are canonical French fries, but they are not canonical chips. Leftpondians don't eat chips very often, and perhaps think of them as "fat French fries" if they don't know any better, but the point is that the two terms refer to different things. Congress is the American Parliament, no doubt, but it would be absurd to say that the terms had the same referent!

Rightpondians often do claim that the fried potato products sold by McDonald's are "chips". Since they have only one term available, they will tend to use it for all fried potato products other than crisps ("potato chips" in American English), whether or not the potatoes are julienned (as in French fries proper) or sliced in large wedges or bars (as in chips proper). But taking a transatlantic perspective, the two terms are not really interchangeable, for they have different prototypes. This is not the case with "crisps" vs. "potato chips", which have only one prototype.

But when an American goes to a place (whether in America or elsewhere; in my case, about 600 meters away) where "chips" are served and openly called by that name, s/he will have quite a different gustatory experience from what results from eating "French fries".

To consider the flip side of the issue for the moment, if I went to England and saw an Erithecus rubecula, I would have to call it a "robin", because no other English term is available. That doesn't mean that I don't know it's a different bird from the specimens of Turdus migratorius that I commonly denote by that term.

Ralph vs. the Tortoise

Consider the following statements:

  1. Ralph believes that Ortcutt is not a spy.
  2. Ralph believes that the man in the brown hat is a spy.
  3. The man in the brown hat is Ortcutt.

Therefore:

  1. Ralph believes of Ortcutt that he is not a spy.
  2. Ralph believes of Ortcutt that he is a spy.

This is apparently no problem, as long as Ralph does not believe "Ortcutt is a spy and Ortcutt is not a spy", which he does not. People with appropriate false beliefs or appropriate ignorance can believe (de re) contradictory things.

But now consider Hofstadter's Tortoise:

  1. The Tortoise affirms "My shell is green".
  2. The Tortoise affirms "My shell is not green".
  3. The Tortoise rejects "My shell is green and my shell is not green".

It seems to follow that:

  1. The Tortoise believes of his shell that it is green.
  2. The Tortoise believes of his shell that is is not green.

Must we accept that the Tortoise's beliefs are not contradictory de re, but only de dicto? The de re version seems exactly parallel to Ralph's de re beliefs. Yet Ralph is merely ignorant of a key point (viz. #3), whereas the Tortoise seems to be "logically insane".

Writing out XML

You can't just embed plain text into an XML element or attribute; character content and attribute values have to be escaped in a number of ways, not necessarily obvious. Here's a checklist of things to make sure to do. (Once again, this post will look terrible in RSS readers that don't fully understand Atom.)

  1. Escape all & characters as &amp;.
  2. Escape all < characters as &lt;.
  3. Escape all > characters as &gt;. Technically it's enough to do so only when they are preceded by ]] in character content, but in my opinion making that check is more trouble than it's worth.
  4. Escape all carriage-return characters as &#xD;. These should be very rare in XML content, as they will have been converted to line-feeds on parsing.
  5. Escape all tab characters in attribute values as &#x9;. You can escape them in character content if you want, but it's not necessary.
  6. Escape all line-feed/newline characters in attribute values as &#xA; (not D as I first wrote).
  7. Output all line-feed/newline characters in character content as the local line terminator: carriage-return (on Mac Classic), line-feed (on Unix) or both (on Windows). You can provide alternative line terminators at user option.
  8. Escape all characters that can't be represented in the output character set. If the output character set is UTF-8 or UTF-16 (in any flavor), this step is not necessary.
  9. Directly output everything else.

I'm glad to say that XOM, my favorite XML tree representation, does all these things in its Serializer class.

2006-03-23

My favorite errata page

viii

ERRATA

p. viii: for "ERRATA" read "ERRATUM".