microformats2-parsing: Difference between revisions
(expand header per feedback re: status, feedback) |
(add note backward compatibility details to help clarify backcompat parsing as defined in the algorithm) |
||
Line 207: | Line 207: | ||
* ignore <code><template></code> elements - stuff between <code><template></code> tags don't end up in the DOM | * ignore <code><template></code> elements - stuff between <code><template></code> tags don't end up in the DOM | ||
** test-case in the wild: http://sixtwothree.org/blog/now-accepting-webmentions/ | ** test-case in the wild: http://sixtwothree.org/blog/now-accepting-webmentions/ | ||
== note backward compatibility details == | |||
The parsing algorithm and details refer to "backcompat root classes" (backcompat roots for short) and "backcompat properties". These conditions and steps in the algorithm document how to parse pre-microformats2 microformats which all defined their own specific root class names and explicit sets of properties. | |||
Some details to be aware of (which are explicitly in the algorithm, this is just an informal summary) | |||
* If an element has one or more microformats2 root class name(s) (<code>h-*</code>) | |||
** all backcompat root class names are ignored on that element. | |||
** all backcompat properties, without an intervening root class name, are ignored inside that element | |||
* If an element has only a backcompat root class name (or names) | |||
** all microformats2 property class names (p-* u-* dt-* e-*), without an intervening element with root class name, are ignored inside that element | |||
** there is no implied property value parsing (p-name, u-url, u-photo) for that element | |||
== questions == | == questions == |
Revision as of 18:22, 25 November 2015
<entry-title>microformats2 parsing specification</entry-title> microformats2 is a simple, open format for marking up data in HTML. The microformats2 parsing specification describes how to implement a microformats2 parser, independent of any specific vocabularies.
- Status
- This is a Living Specification with several interoperable implementations
- Participate
- Wiki (Questions, Open issues)
- IRC: #microformats on Freenode
- License
- Per CC0, to the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work. In addition, as of 2025-01-25, the editors have made this specification available under the Open Web Foundation Agreement Version 1.0.
algorithm
parse a document for microformats
To parse a document for microformats, follow the HTML parsing rules and do the following:
- start with an empty JSON "items" array and "rels" & "rel-urls" hashes:
{
"items": [],
"rels": {},
"rel-urls": {}
}
- parse the root element for class microformats, adding to the JSON items array accordingly
- parse all hyperlink (
<link> <a>
) elements for rel microformats, adding to the JSON rels & rel-urls hashes accordingly - return the resulting JSON
Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).
parse an element for class microformats
To parse an element for class microformats:
- parse element class for root class name(s) "h-*" and if none, backcompat root classes
- if none found, parse child elements for microformats (depth first, doc order)
- else if found, start parsing a new microformat
- keep track of whether the root class name(s) was from backcompat
- create a new { } structure with:
type: [array of microformat "h-*" type(s) on the element],
properties: { }
- to be filled in when that element itself is parsed for microformats properties
- parse child elements (document order) by:
- if parsing a backcompat root, parse child element class name(s) for backcompat properties
- else parse a child element class for property class name(s) "p-*,u-*,dt-*,e-*"
- if such class(es) are found, it is a property element
- add properties found to current microformat's
properties: { }
structure
- add properties found to current microformat's
- parse a child element for microformats (recurse)
- if that child element itself has a microformat ("h-*" or backcompat roots) and is a property element, add it into the array of values for that property as a { } structure, add to that { } structure:
value
:- if it's a
p-*
property element, use the first p-name of the h-* child - else if it's an
e-*
property element, re-use its { } structure with existingvalue:
inside. - else if it's a
u-*
property element and the h-* child has a u-url, use the first such u-url - else use the parsed property value per p-*,u-*,dt-* parsing respectively
- if it's a
- else add found elements that are microformats to the "children" array
- if that child element itself has a microformat ("h-*" or backcompat roots) and is a property element, add it into the array of values for that property as a { } structure, add to that { } structure:
- imply properties for the found microformat (see below)
parse an element for properties
parsing a p- property
To parse an element for a p-x property value whether explicit "p-*" or backcompat equivalent:
- parse the element for the value-class-pattern, if a value is found then return it.
- if abbr.p-x[title], then return the title attribute
- else if data.p-x[value] or input.p-x[value], then return the value attribute
- else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
- else return the textContent of the element, replacing any nested
<img>
elements with theiralt
attribute if present, or otherwise theirsrc
attribute if present, resolving any relative URLs, and removing all leading/trailing whitespace.
parsing a u- property
To parse an element for a u-x property value whether explicit "u-*" or backcompat equivalent:
- if a.u-x[href] or area.u-x[href], then get the href attribute
- else if img.u-x[src] or audio.u-x[src] or video.u-x[src] or source.u-x[src], then get the src attribute
- else if object.u-x[data], then get the data attribute
- if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first
<base>
element if any). - else parse the element for the value-class-pattern, if a value is found then return it.
- else if abbr.u-x[title], then return the title attribute
- else if data.u-x[value] or input.u-x[value], then return the value attribute
- else return the textContent of the element after removing all leading/trailing whitespace.
parsing a dt- property
To parse an element for a dt-x property value whether explicit "dt-*" or backcompat equivalent:
- parse the element for the value-class-pattern including the date and time parsing rules, if a value is found then return it.
- if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
- else if abbr.dt-x[title], then return the title attribute
- else if data.dt-x[value] or input.dt-x[value], then return the value attribute
- else return the textContent of the element after removing all leading/trailing whitespace.
parsing an e- property
To parse an element for a e-x property value whether explicit "e-*" or backcompat equivalent:
- return a dictionary with two keys:
html
: the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm, with leading/trailing whitespace removed.value
: the textContent of the element, replacing any nested<img>
elements with theiralt
attribute if present, or otherwise theirsrc
attribute if present, resolving the URL if it’s relative.
parsing for implied properties
Imply properties only on explicit h-x class name root microformat element (no backcompat roots)
- if no explicit "name" property,
- then imply by:
- if img.h-x or area.h-x, then use its alt attribute for name
- else if abbr.h-x[title] then use its title attribute for name
- else if .h-x>img:only-child[alt]:not[.h-*] then use that img alt for name
- else if .h-x>area:only-child[alt]:not[.h-*] then use that area alt for name
- else if .h-x>abbr:only-child[title] then use that abbr title for name
- else if .h-x>:only-child>img:only-child[alt]:not[.h-*] then use that img alt for name
- else if .h-x>:only-child>area:only-child[alt]:not[.h-*] then use that area alt for name
- else if .h-x>:only-child>abbr:only-child[title] use that abbr title for name
- else use the textContent of the .h-x for name
- drop all leading and trailing white-space from name
- if no explicit "photo" property,
- then imply by:
- if img.h-x[src] then use src for photo
- else if object.h-x[data] then use data for photo
- else if .h-x>img[src]:only-of-type:not[.h-*] then use that img src for photo
- else if .h-x>object[data]:only-of-type:not[.h-*] then use that object data for photo
- else if .h-x>:only-child>img[src]:only-of-type:not[.h-*] then use that img src for photo
- else if .h-x>:only-child>object[data]:only-of-type:not[.h-*] then use that object data for photo
- if there is a gotten photo value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
- if no explicit "url" property,
- then imply by:
- if a.h-x[href] or area.h-x[href] then use that [href] for url
- else if .h-x>a[href]:only-of-type:not[.h-*] then use that [href] for url
- else if .h-x>area[href]:only-of-type:not[.h-*] then use that [href] for url
- if there is a gotten url value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
Note: The same markup for a property should not be causing that property to occur in both a microformat and one embedded inside - such a property should only be showing up on one of them. The parsing algorithm has details to prevent that, such as the :not[.h-*]
tests above.
parse a hyperlink element for rel microformats
To parse a hyperlink element (e.g. a or link) for rel microformats: use the following algorithm or an algorithm that produces equivalent results:
- if the "rel" attribute of the element is empty then exit
- set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first
<base>
element if any). - treat the "rel" attribute of the element as a space separate set of rel values
- for each rel value (rel-value)
- if there is no key rel-value in the rels hash then create it with an empty array as its value
- if url is not in the array of the key rel-value in the rels hash then add url to the array
- end for
- if there is no key with name url in the top-level "rel-urls" hash then add a key with name url there, with an empty hash value
- add keys to the hash of the key with name url for each of these attributes (if present) and key not already set:
- "hreflang": the value of the "hreflang" attribute
- "media": the value of the "media" attribute
- "title": the value of the "title" attribute
- "type": the value of the "type" attribute
- "text": the text content of the element if any
- if there is no "rels" key in that hash, add it with an empty array value
- set the value of that "rels" key to an array of all unique items in the set of rel values unioned with the current array value of the "rels" key
rel parse examples
Here are some examples to show how parsed rels may be reflected into the JSON (empty items key).
E.g. parsing this markup:
<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/b">author b</a>
<a rel="in-reply-to" href="http://example.com/1">post 1</a>
<a rel="in-reply-to" href="http://example.com/2">post 2</a>
<a rel="alternate home"
href="http://example.com/fr"
media="handheld"
hreflang="fr">French mobile homepage</a>
Would generate this JSON:
{
"items": [],
"rels": {
"author": [ "http://example.com/a", "http://example.com/b" ],
"in-reply-to": [ "http://example.com/1", "http://example.com/2" ],
"alternate": [ "http://example.com/fr" ],
"home": [ "http://example.com/fr" ]
},
"rel-urls": {
"http://example.com/a": {
"rels": ["author"],
"text": "author a"
},
"http://example.com/b": {
"rels": ["author"],
"text": "author b"
},
"http://example.com/1": {
"rels": ["in-reply-to"],
"text": "post 1"
},
"http://example.com/2": {
"rels": ["in-reply-to"],
"text": "post 2"
},
"http://example.com/fr": {
"rels": ["alternate", "home"],
"media": "handheld",
"hreflang": "fr",
"text": "French mobile homepage"
}
}
}
what do the CSS selector expressions mean
This section is non-normative.
Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.
Exception:
- :not[.h-*] is not a valid CSS selector but is used here to mean:
- does not have any class names that start with "h-"
note HTML parsing rules
This section is non-normative.
microformats2 parsers are expected to follow HTML parsing rules, which includes for example:
- ignore
<template>
elements - stuff between<template>
tags don't end up in the DOM- test-case in the wild: http://sixtwothree.org/blog/now-accepting-webmentions/
note backward compatibility details
The parsing algorithm and details refer to "backcompat root classes" (backcompat roots for short) and "backcompat properties". These conditions and steps in the algorithm document how to parse pre-microformats2 microformats which all defined their own specific root class names and explicit sets of properties.
Some details to be aware of (which are explicitly in the algorithm, this is just an informal summary)
- If an element has one or more microformats2 root class name(s) (
h-*
)- all backcompat root class names are ignored on that element.
- all backcompat properties, without an intervening root class name, are ignored inside that element
- If an element has only a backcompat root class name (or names)
- all microformats2 property class names (p-* u-* dt-* e-*), without an intervening element with root class name, are ignored inside that element
- there is no implied property value parsing (p-name, u-url, u-photo) for that element
questions
See the FAQ:
issues
See the issues page:
implementations
There are open source microformats2 parsers available for Javascript, node.js, PHP, Ruby and Python.
test suite
See:
Ports to/for other languages encouraged.
see also
- microformats2
- microformats2-parsing-faq
- microformats2-parsing-issues
- microformats2-parsing-brainstorming - for background, thinking, exploring possibilities
- microformats2-parsing-rdf
- microformats2-implied-properties