The Magic of XPath
What is it?
XPath (XML Path Language) is an oft-overlooked, highly expressive standard for navigating and extracting information from XML-like documents (think HTML). If you’re used to CSS-style selectors (things like
querySelector(".my-class")), but wish you were able to build in conditionals, or complex queries, then the XPath language is probably what you’re looking for. XPath allows us to query documents like so:
//div[contains(text(), "Next")]/a/@href (extracts the href attribute info from all anchor tags in a div containing the text “Next”).
document.evaluate(...), meaning you get all the expressive power of the XPath language without having to resort to external libraries. If you’re using a tool that interacts with HTML, you should make it a point to check if XPath is supported.
If you’re interested in looking at XPath more in depth, I recommend looking at the MDN XPath docs for JS, and keeping Mila Nic’s XPath reference at hand.
Let’s say I show you the following “selector”:
Well - it’s pretty self explanatory, right? We all recognize this as a system path, describing what folders to navigate through in order to return a selection of files. In this case: start at the root
/, find the folder named
Users, find the folder named
niko, find the folder named
Documents, and finally select all the files (or “nodes”) inside this last element.
Now let’s look at an XPath selector:
It’s surprisingly similar, no? Start at the root (context)
/, find the child element
html, then a child
body, a child
div, a child
p, and finally a child
/a. What gets returned depends on your implementation, but it will usually contain the selection along with attributes, like its name, contents, etc. - in JS, for example, we get back an
XPathResult object. These objects have contextual info we can examine; more importantly we can also use them as the starting point for another XPath “search”.
How about a slightly more complex selector:
//div[@id="my-id" and not(@class="not-this-class")]/a[contains(text(), "Next Page")]
Pretty cool right? You probably can intuit exactly what this selector returns. The important part is to realize just how expressive we can be compared to more standard CSS-style selectors.
The XPath syntax is remarkably simple. There’s only a few main concepts:
/ (root) - you can’t technically select this, but conceptually this “root” is where all XPaths start. Imagine it as the “document” as a whole. You technically can’t return the root, but it’s useful to keep in mind as your “starting point” or “context”.
name - a step, wherein we select all the available nodes at the current level, and then filter out those whose name is
name. Note that this functions is a condition: you’re filtering from an available set. This means that when we type
/html/body/div, what we’re doing is filtering all the available nodes first for an
html node (which is really the only available node at the root of an html document), then within that node filtering all available nodes looking for a
body node, and finally within that, any node with a name of
div. This means that we always get back a list of nodes, because we’re never selecting, only filtering!
name[predicate] - we can further specify how we filter nodes by providing a predicate. There’s a bunch of predicates we can use, but I’ve listed the main ones in a section below. Predicates simply mean “take all the nodes you’ve identified with
name and further filter them using the rules we pass in.
name/text() - easier understood as part of the above category, XPath includes some functions that allow us to access components of a given node (some of them we can use outside the predicate, like
text()). I’ve listed a couple below as well.
axis::name - axis specifiers allow you to move laterally, as well as up and down the given node tree, and select specific attributes of a given tag: this is where XPath’s power lies - you can construct a complex query that then serves as a base point to navigate a document in an arbitrary number of ways, like selecting the sibling link to a paragraph tag containing the word “Next”. Technically axes are specified everywhere: the default axis for any condition is
child:: - this is why axis syntax is called “unabbreviated”, meaning
/html is actually
//name - syntactic sugar for
/descendant-or-self::name: allows us to specify a path without having to start at the root element.
Returns: [<a>With Link</a>]
Returns: [<a>With Link</a>]
//div/p/a - Always greedy
Returns: [<a>Three</a>, <a>Four</a>]
Returns: [<p>One</p>, <p>Two</p>, <p><a>Three</a></p>, <p><a>Four</a></p>]
Returns: [<p><a>Three</a></p>, <p><a>Four</a></p>]
//p | //a - Combine two paths
Returns: [<p>One</p>, <p>Two</p>, <a>Three</a>, <a>Four</a>]
//div[@class] - Selects only elements with a
Sugar for: //div[attribute::class]
Returns: [<div class="a_class">One</p>, <div class="a_class">Four</div>]
this is different from
//div/@class, which will return the actual value of every div’s
//div[@*] - Selects elements with any attribute.
Sugar for: //div[attribute::*]
Returns: [<div class="a_class">One</p>, <div data-id="123">Four</div>]
//div[@class="the_class"] - Selects only elements where
class == "the_class".
Sugar for: //div[attribute::class="the_class"]
Returns: [<div class="the_class">Four</div>]
this is different from
//div/@class="the_class", which will return the boolean result of this test for each
//div[@id and @class="the_class"] - Combining predicates.
Sugar for: //div[attribute::div and attribute::class="the_class"]
Returns: [<div id="2" class="the_class">Four</div>]
//div[@id and not(@class="the_class")] - Negating predicates.
Returns: [<div id="1" class="a_class">One</div>]
//div[contains(text(), "Next")] - Does
text() is inner text of node to be tested) contain
Returns: [<div id="2">Next</div>, <div id="4">Next</div>]
//div[contains(name(), "A")] - Does the element’s name (
name() is element name) contain
//child::div/child::a - Weird, but we’ve already seen this!
/descendant-or-self::div/a - We’ve also seen this!
//div/descendant::* - Grab all descendants
Returns: [<p>Two</p>, <a><h1>Three</h1></a>, <a>Four</a>]
//h1/parent::* - Grab all parents (ONE level up)
//h1/ancestor::* - Grab all ancestors (ALL levels up)
//div[@id="1"]/following-sibling::* - Grab all siblings after
Returns: [<div id="2">Two</div>, <div id="3">Three</div>, <div id="4">Four</div>]
//div[@id="4"]/preceding-sibling::* - Grab all siblings before
Returns: [<div id="1">One</div>, <div id="2">Two</div>, <div id="3">Three</div>]