The Magic of XPath

Basics | Syntax | Paths | Predicates | Axes

What is it?

XPath (XML Path Language) is an oft-overlooked, highly expressive standard for navigating and extracting information from XML-like documents (think HTML). If you’re used to CSS-style selectors (things like querySelector(".my-class")), but wish you were able to build in conditionals, or complex queries, then the XPath language is probably what you’re looking for. XPath allows us to query documents like so: //div[contains(text(), "Next")]/a/@href (extracts the href attribute info from all anchor tags in a div containing the text “Next”).

It’s built into many tools and languages that often deal with XML: with Javascript for example, you can run XPath queries via document.evaluate(...), meaning you get all the expressive power of the XPath language without having to resort to external libraries. If you’re using a tool that interacts with HTML, you should make it a point to check if XPath is supported.

If you’re interested in looking at XPath more in depth, I recommend looking at the MDN XPath docs for JS, and keeping Mila Nic’s XPath reference at hand.

Basics

Let’s say I show you the following “selector”:

/Users/niko/Documents/*

Well - it’s pretty self explanatory, right? We all recognize this as a system path, describing what folders to navigate through in order to return a selection of files. In this case: start at the root /, find the folder named Users, find the folder named niko, find the folder named Documents, and finally select all the files (or “nodes”) inside this last element.

Now let’s look at an XPath selector:

/html/body/div/p/a

It’s surprisingly similar, no? Start at the root (context) /, find the child element html, then a child body, a child div, a child p, and finally a child /a. What gets returned depends on your implementation, but it will usually contain the selection along with attributes, like its name, contents, etc. - in JS, for example, we get back an XPathResult object. These objects have contextual info we can examine; more importantly we can also use them as the starting point for another XPath “search”.

How about a slightly more complex selector:

//div[@id="my-id" and not(@class="not-this-class")]/a[contains(text(), "Next Page")]

Pretty cool right? You probably can intuit exactly what this selector returns. The important part is to realize just how expressive we can be compared to more standard CSS-style selectors.

Syntax

The XPath syntax is remarkably simple. There’s only a few main concepts:

/ (root) - you can’t technically select this, but conceptually this “root” is where all XPaths start. Imagine it as the “document” as a whole. You technically can’t return the root, but it’s useful to keep in mind as your “starting point” or “context”.

name - a step, wherein we select all the available nodes at the current level, and then filter out those whose name is name. Note that this functions is a condition: you’re filtering from an available set. This means that when we type /html/body/div, what we’re doing is filtering all the available nodes first for an html node (which is really the only available node at the root of an html document), then within that node filtering all available nodes looking for a body node, and finally within that, any node with a name of div. This means that we always get back a list of nodes, because we’re never selecting, only filtering!

name[predicate] - we can further specify how we filter nodes by providing a predicate. There’s a bunch of predicates we can use, but I’ve listed the main ones in a section below. Predicates simply mean “take all the nodes you’ve identified with name and further filter them using the rules we pass in.

name/text() - easier understood as part of the above category, XPath includes some functions that allow us to access components of a given node (some of them we can use outside the predicate, like text()). I’ve listed a couple below as well.

axis::name - axis specifiers allow you to move laterally, as well as up and down the given node tree, and select specific attributes of a given tag: this is where XPath’s power lies - you can construct a complex query that then serves as a base point to navigate a document in an arbitrary number of ways, like selecting the sibling link to a paragraph tag containing the word “Next”. Technically axes are specified everywhere: the default axis for any condition is child:: - this is why axis syntax is called “unabbreviated”, meaning /html is actually /child::html.

//name - syntactic sugar for /descendant-or-self::name: allows us to specify a path without having to start at the root element.

Paths

/html/body/h3/a

<html>
  <body>
    <h1>First</h1>
    <h2>Second</h2>
    <h3>Third</h3>
    <h3><a>With Link</a></h3><!-- <a>With Link</a> IS SELECTED -->
  </body>
</html>

Returns: [<a>With Link</a>]

//div/p/a

<html>
  <body>
    <div><p>One</p></div>
    <div><p>Two</p></div>
    <div><p><a>Three</a></p></div> <!-- <a>Three</a> IS SELECTED -->
  </body>
</html>

Returns: [<a>With Link</a>]

//div/p/a - Always greedy

<html>
  <body>
    <div><p>One</p></div>
    <div><p>Two</p></div>
    <div><p><a>Three</a></p></div> <!-- <a>Three</a> IS SELECTED -->
    <div><p><a>Four</a></p></div> <!-- <a>Four</a> IS SELECTED -->
  </body>
</html>

Returns: [<a>Three</a>, <a>Four</a>]

//div/*

<html>
  <body>
    <div><p>One</p></div><!-- <p>One</p> IS SELECTED -->
    <div><p>Two</p></div><!-- <p>Two</p> IS SELECTED -->
    <div><p><a>Three</a></p></div> <!-- <p><a>Three</a></p> IS SELECTED -->
    <div><p><a>Four</a></p></div> <!-- <p><a>Four</a></p> IS SELECTED -->
  </body>
</html>

Returns: [<p>One</p>, <p>Two</p>, <p><a>Three</a></p>, <p><a>Four</a></p>]

results = content.xpath('//div')
results.xpath('./p/a') #Similar to a system path, "." means the current node, otherwise you start back at the root.

<html>
  <body>
    <div><p>One</p></div>
    <div><p>Two</p></div>
    <div><p><a>Three</a></p></div> <!-- <p><a>Three</a></p> IS SELECTED -->
    <div><p><a>Four</a></p></div> <!-- <p><a>Four</a></p> IS SELECTED -->
  </body>
</html>

Returns: [<p><a>Three</a></p>, <p><a>Four</a></p>]

//p | //a - Combine two paths

<html>
  <body>
    <div><p>One</p></div><!-- <p>One</p> IS SELECTED -->
    <div><p>Two</p></div><!-- <p>Two</p> IS SELECTED -->
    <div><p><a>Three</a></p></div> <!-- <a>Three</a> IS SELECTED -->
    <div><p><a>Four</a></p></div> <!-- <a>Four</a> IS SELECTED -->
  </body>
</html>

Returns: [<p>One</p>, <p>Two</p>, <a>Three</a>, <a>Four</a>]

Predicates

//div[@class] - Selects only elements with a class attribute. Sugar for: //div[attribute::class]

<html>
  <body>
    <div class="a_class">One</div><!-- <div class="a_class">One</div> IS SELECTED -->
    <div>Two</div>
    <div>Three</div>
    <div class="a_class">Four</div> <!-- <div class="a_class">Four</div> IS SELECTED -->
  </body>
</html>

Returns: [<div class="a_class">One</p>, <div class="a_class">Four</div>]

NOTE:
this is different from //div/@class, which will return the actual value of every div’s class.

//div[@*] - Selects elements with any attribute.
Sugar for: //div[attribute::*]

<html>
  <body>
    <div class="a_class">One</div><!-- <div class="a_class">One</div> IS SELECTED -->
    <div>Two</div>
    <div>Three</div>
    <div data-id="123">Four</div> <!-- <div data-id="123">Four</div> IS SELECTED -->
  </body>
</html>

Returns: [<div class="a_class">One</p>, <div data-id="123">Four</div>]

//div[@class="the_class"] - Selects only elements where class == "the_class". Sugar for: //div[attribute::class="the_class"]

<html>
  <body>
    <div class="a_class">One</div>
    <div>Two</div>
    <div>Three</div>
    <div class="the_class">Four</div> <!-- <div class="the_class">Four</div> IS SELECTED -->
  </body>
</html>

Returns: [<div class="the_class">Four</div>]

NOTE:
this is different from //div/@class="the_class", which will return the boolean result of this test for each class (either 0 or 1).

//div[@id and @class="the_class"] - Combining predicates. Sugar for: //div[attribute::div and attribute::class="the_class"]

<html>
  <body>
    <div id="1" class="a_class">One</div>
    <div>Two</div>
    <div class="the_class">Three</div>
    <div id="2" class="the_class">Four</div> <!-- <div id="2" class="the_class">Four</div> IS SELECTED -->
  </body>
</html>

Returns: [<div id="2" class="the_class">Four</div>]

//div[@id and not(@class="the_class")] - Negating predicates.

<html>
  <body>
    <div id="1" class="a_class">One</div> <!-- <div id="1" class="a_class">One</div> IS SELECTED -->
    <div>Two</div>
    <div class="the_class">Three</div>
    <div id="2" class="the_class">Four</div>
  </body>
</html>

Returns: [<div id="1" class="a_class">One</div>]

//div[contains(text(), "Next")] - Does text() (text() is inner text of node to be tested) contain "Next".

<html>
  <body>
    <div id="1">Previous</div> 
    <div id="2">Next</div><!-- <div id="2">Next</div> IS SELECTED -->
    <div id="3">Previous</div>
    <div id="4">Next</div><!-- <div id="4">Next</div> IS SELECTED -->
  </body>
</html>

Returns: [<div id="2">Next</div>, <div id="4">Next</div>]

//div[contains(name(), "A")] - Does the element’s name (name() is element name) contain "A".

<html>
  <body>
    <AB>Previous</AB><!-- <AB>Previous</AB> IS SELECTED -->
    <BC>Next</BC>
    <CD>Previous</CD>
    <DE>Next</DE>
  </body>
</html>

Returns: [<AB>Previous</AB>]

Axes

//child::div/child::a - Weird, but we’ve already seen this!
Sugarized: //div/a

<html>
  <body>
    <div>One</div> 
    <div>Two</div>
    <div>Three</div>
    <div><a>Four</a></div><!-- <div><a>Four</a></div> IS SELECTED -->
  </body>
</html>

Returns: [<div><a>Next</a></div>]

/descendant-or-self::div/a - We’ve also seen this!
Sugarized: //div/a

<html>
  <body>
    <div>One</div> 
    <div>Two</div>
    <div><a>Three</a></div><!-- <div><a>Three</a></div> IS SELECTED -->
    <div><a>Four</a></div><!-- <div><a>Four</a></div> IS SELECTED -->
  </body>
</html>

Returns: [<div><a>Next</a></div>]

//div/descendant::* - Grab all descendants

<html>
  <body>
    <div>One</div> 
    <div><p>Two</p></div><!-- <p>Two</p> IS SELECTED -->
    <div><a><h1>Three</h1></a></div><!-- <a><h1>Three</h1></a> IS SELECTED -->
    <div><a>Four</a></div><!-- <a>Four</a> IS SELECTED -->
  </body>
</html>

Returns: [<p>Two</p>, <a><h1>Three</h1></a>, <a>Four</a>]

//h1/parent::* - Grab all parents (ONE level up)
Sugarized: //h1/..*

<html>
  <body>
    <div>One</div> 
    <div><p>Two</p></div>
    <div><a><h1>Three</h1></a></div><!-- <a><h1>Three</h1></a> IS SELECTED -->
    <div><a>Four</a></div>
  </body>
</html>

Returns: [<a><h1>Three</h1></a>]

//h1/ancestor::* - Grab all ancestors (ALL levels up)

<html>
  <body>
    <div>One</div> 
    <div><p>Two</p></div>
    <div><a><h1>Three</h1></a></div><!-- <html><body><div><a><h1>Three</h1></a></div></body></html> IS SELECTED -->
    <div><a>Four</a></div>
  </body>
</html>

Returns: [<html><body><div><a><h1>Three</h1></a></div></body></html>]

//div[@id="1"]/following-sibling::* - Grab all siblings after

<html>
  <body>
    <div id="1">One</div> 
    <div id="2">Two</div><!-- <div id="2">Two</div> IS SELECTED -->
    <div id="3">Three</div><!-- <div id="3">Three</div> IS SELECTED -->
    <div id="4">Four</div><!-- <div id="4">Four</div> IS SELECTED -->
  </body>
</html>

Returns: [<div id="2">Two</div>, <div id="3">Three</div>, <div id="4">Four</div>]

//div[@id="4"]/preceding-sibling::* - Grab all siblings before

<html>
  <body>
    <div id="1">One</div><!-- <div id="1">One</div> IS SELECTED -->
    <div id="2">Two</div><!-- <div id="2">Two</div> IS SELECTED -->
    <div id="3">Three</div><!-- <div id="3">Three</div> IS SELECTED -->
    <div id="4">Four</div>
  </body>
</html>

Returns: [<div id="1">One</div>, <div id="2">Two</div>, <div id="3">Three</div>]