I guess I've been writing markup for 24 years already, but over the last year I realized I was missing the forest for the trees. This post lays out a markup categorization I came up with while prototyping some tools for single-sourcing documentation. Along the way, it'll also stop to examine two separation-of-concerns problems that it helped me recognize.
Before I dive in, I should probably note that my taxonomy is on a slightly different axis than the punctuational, presentational, procedural, descriptive, referential, metamarkup taxonomy used by others including Wikipedia, Tim Bray, and Gregory Aist. This doesn't mean I have any bones to pick with that taxonomy! It predated HTML by several years and Markdown by more than 15--and my broader project here is to wrestle with why I find HTML and Markdown (and many other LWMLs) disappointing.
I'll start with three types, and give each a color:
- Red markup describes how to present something. One way to spot presentational markup is that it usually doesn't make sense if you imagine changing from one media to another (if it's a print/web document, imagine translating it into an audio recording).
- Yellow markup describes the structure of the document itself. To spot structural markup, replace the content (or compare multiple documents in the same format) and see what labels are stable. You can also imagine porting the content to another format (article, book, slide show, and so on) and consider what would break.
- Blue markup describes the ontology of things that appear in the content--it names and categorizes them. It might indicate that a block of content is code, or clarify whether some inline text is referring to a person, place, book, class/function/argument, and so on.
To illustrate this, here's an example using labeled markup (pseudo HTML/XML, though the issue isn't specific to XML-like languages):
1<yellow >
2 I'll imply <red >uncomfortable</red > things about
3 <blue >markDirection</blue >
4</yellow >
5<paragraph>
6 I'll imply <bold>uncomfortable</bold> things about
7 <language>markDirection</language>
8</paragraph>
We've long learned to recognize the big trap here, so modern readers may know (or have heard) they should often prefer <strong>uncomfortable</strong>
over <bold>uncomfortable</bold>
. I think this pivot can be a subtler trap, so I'll give it it's own color to keep it visible:
- Purple markup describes how to emphasize something. This is a lot like describing how to present something, but in the abstract (it isn't tied to a specific media). (I think this actually describes something like verbal presentation, rhetoric, and prosody--but I'm focusing on just emphasis because it labels how something should sound instead of how it should look on a page.)
Note: Yes, I am avoiding a certain 8-letter S word. You can read more about why in semantic: the 8 letter s-word
The subtler trap here is mostly human. I think most of us learn to write on the physical/digital page in school, and come to markup later. Writing on the page typically involves memorizing typographic conventions (prescribed by some style guide) for marking up things like the titles of different kinds of content--we're trained to obscure this distinction. There's nothing to stop people from using <strong>
for its typographic effect and many markup-generating editors/tools effectively ensure this will happen.
In my opinion, the gravity of these effects is too strong for us to uncritically accept that purple-labeled markup is used correctly. Here's another color to help visualize this trap:
-
Magenta markup is ultimately purple or red--but we're going to have to puzzle out which. (Said the other way around, all purple markup might actually be red and vice versa.) Training may mitigate this, but it's guaranteed if your authoring tools can't even express the distinction.
(Markdown, for example, doesn't express the difference between italic and emphasis.)
I think we have enough building blocks in place to start applying this taxonomy to real-world markup languages.
I'm not sure how many of these I need to make the case, but I'll start with two. HTML:
1<html><!-- yellow -->
2 <head></head><!-- yellow -->
3 <body><!-- yellow -->
4 <section><!-- yellow -->
5 <p><!-- yellow -->
6 Text
7 <span>Text</span><!-- yellow -->
8 <bold>Text</bold><!-- red (magenta) -->
9 <pre>Text</pre><!-- red (magenta) -->
10 <emphasis>Text</emphasis><!-- purple (magenta) -->
11 <code>Text</code><!-- blue -->
12 <cite>Text</cite><!-- blue -->
13 Text
14 </p>
15 </section>
16 </body>
17</html>
And Markdown:
1# yellow
2
3Text `magenta` *magenta* _magenta_ ~magenta~.
4
5## yellow
6
7| magenta | magenta | magenta |
8|---------|---------|---------|
9| magenta | magenta | magenta |
10
11Exciting code block:
12
13 blue
Note: Whitespace is an essential component of some unlabeled typographic conventions in Markdown (among other languages). Line breaks are similarly significant in mdoc
(not putting an mdoc/troff example in here because it's less familiar, tedious to comment, and would require a fair amount of explaining). Stick a pin in this.
My initial definitions for yellow
and blue
markup didn't really dissect two questionable terms: document and content.
I may be projecting, but for a long time now I think it's been most natural to think of a document on the Web as its own thing that gets nestled into a container in a larger HTML layout. It's rare to build a site where there's a 1-to-1 relationship between complete HTML files saved on the server and the list of documents as the site's author sees them.
Call this inner document--the one closest to the document in the eyes of the author--an abstract-document. The abstract-document has no format/media-specific idioms; the author could write it with a different set of tools without affecting its form. Its structural elements (things like titles, sections, paragraphs, stanzas, and lines) mostly refer back to storied spoken/written/printed forms. Call the the non-structural parts of the document the content (because some of these parts could theoretically be re-used in other formats that don't share structural idioms).
Let's make a simple abstract document:
1<document>
2 <title>My abstract document</title>
3 <section>
4 <title>An incredibly comprehensive introduction</title>
5 <paragraph>...</paragraph>
6 ...
7 <section>
8 <title>Painfully comprehensive</title>
9 ...
10 </section>
11 </section>
12</document>
HTML wasn't designed for my ~modern assumption about what the document is. It was designed to be the document (call this an html-native document). We can keep this tucked away behind the curtain if we're writing our documents in HTML (or in something designed to map directly to html-native idioms), but it erupts into view if we try to compose our abstract document into an HTML template:
1<html>
2 <head>
3 <title>
4 <!--
5 Where do I come from? We'll need a template
6 language or javascript to reach across these
7 boundaries...
8 -->
9 </title>
10 </head>
11 <body>
12 <header>
13 <nav>...</nav>
14 </header>
15 <section>
16 <article>
17 <!--
18 We intend to nestle our document here, but
19 the title will have to break out of this
20 container and go into the head :(
21 -->
22 </article>
23 </section>
24 <footer>...</footer>
25 </body>
26</html>
If we drop the abstract document in here, we end up with HTML that is similar to our document, but it's also a weird hybrid of things that aren't our document (like the layout of the site). I'll call this an html-concrete document just to make it harder to forget that this isn't the kind of HTML-native document HTML was designed for!
HTML isn't just serving two masters with different sets of tags, here. The section
and title
elements (among others) can have semi-independent purposes and meanings at the layout and document level. I'll add three final colors that help stake out some differences between abstract and concrete documents:
-
Green markup is the abstract twin of yellow markup, so I'm basically splitting the previous definition of yellow/structural markup into two halves. Green markup describes the structure of the abstract/ideal document; yellow markup describes the structure of a specific concrete document format. (Maybe a contrived example helps: an ideal document has no marquee
element, but specific versions of HTML have had it.)
-
Chartreuse markup is ultimately green or yellow--but we're going to have to puzzle out which. This color just reflects our fundamental uncertainty about any part of a concrete document format that mirrors part of the ideal form.
-
Orange markup is an abuse of blue/ontological markup for red/presentational purposes. Concrete formats (like HTML) invite this abuse when they attach special presentation/behavior to ontological markup (such as the code
element).
This abuse is pretty common. For example, I'm authoring "notes" in this post as a block quote in Markdown just because it gives me the presentation I want. 🤷♂️
In the trenches of Getting Things Done it feels like we've settled on seeing HTML's blend of layout and document as a problem we have tools to solve. Go get a template engine, a preprocessor, an x2y converter, a component-based web framework, and so on. But this is where I missed the forest: the separation of concerns here is broken for most modern uses of HTML. We've been trying to mend it with all of these patches, but it's more like a broken rake than a broken leg.
HTML isn't unique here, but I focus on it because a lot of publishing and documentation workflows touch HTML or HTML-facing LWMLs (like Markdown) at some point--and the HTML/CSS/JS stack is often used as an example of separation of concerns. The ~original-sin of making HTML a markup language for both web layouts and documents spawned multiple separation-of-concerns problems.
For individual documents, I think it's still defensible to tease out layers for structure/content, style, and ~behavior. But when we're composing documents into a logically-independent site layout (i.e., a component-oriented use of the web stack...), the structure/content, style, and behavior of the layout get separately muddled with those of the document.
This explosion in complexity is a reason component-oriented web frameworks have been fruitful--they mostly square the web stack's separation of concerns with how we think about the interfaces we build. It also helps explain what people get out of a ~utility CSS framework like Tailwind: they can depend on a style layer that they don't have to explicitly maintain, and unify ~interface style+structure+content to beat back complexity. Arguments about the ~functional CSS approach vs traditional DRY CSS with thoughtful class names boil down to a value system conflict:
-
Style/markup separation makes it trivial to re-style a document with a different style sheet--and doing this with documents makes a lot of idiomatic sense. (i.e., for adapting old documents to new identity guidelines).
-
Style/markup in interface layouts and components are tightly coupled. Forcibly separating them inflicts pain for minimal benefit and makes little idiomatic sense.
Web components ~try to fix this, but the JS requirement only further muddles the separation-of-concerns problems by entangling markup and JS (and requiring critical markup outside of the scope of the component).
The colors I've picked are a modest color-temperature ~memnonic for avoiding markup antipatterns some language ecosystems encourage. Here they are plotted out on a color star:
My next post will get a little more specific (and critical) about how I think this applies to various markup ecosystems, so for now I'll just lay out three basic heuristics that I think lead in the right direction:
- Warm markup (which is coupled to specific media, formats, and technologies) is harder to reuse.
- If you need highly reusable content or documents, you need cool markup.
- If you need cool markup, it's best to avoid authoring in languages with warm markup (or use tooling to keep it from creeping in).
Most of our existing markup language ecosystems fail these tests, so I think there's a lot of room for innovation here.