Scraping Real Estate Listings With Scrapy
Scrapy Primer |
Our Target |
Our Scraper |
Parsing the Response |
The Scrapy Shell |
Extracting the Item |
Passing Arguments |
Adding a Cache |
Post-Processing Data |
Exporting |
Conclusion |
Github Repo
If you’re ever fortunate enough to be considering purchasing property, scraping a few property-listing sites is a great way to collect a dataset that you can then play around with, or use as a reference when evaluating potential purchases. Scraping for data is all the more important if you’re not intimately familiar with the area you’re looking at, or you want an objective, data-driven “second opinion” to go along with your broker’s recommendations.
In this writeup, I’ll be showing you how I scraped around 10,000 records from a local real-estate listing website, integrated currency conversion into the data cleanup as well as a small cache mechanism for future scraping rounds, and exported everything as a sanitized CSV file.
You can hand-roll your scraper, or lean on a more robust library. I’ll be using Scrapy, a Python-based scraping and crawling framework. Note that I mention crawling (traversing links while on a site) - this is vital if you’re scraping more “mainstream” sites, given that you’ll probably have to deal with pagination, and doing so manually is a bore.
Scrapy Primer
To create a new Scrapy project, first make sure you’ve installed Scrapy onto your local machine, and then run:
scrapy startproject real_estate
When starting a new project, Scrape populates a folder with a bunch of new files, all of which adhere to a simple lexicon, and exist in a pipeline of sorts:
Spiders
- Appropriately named, given that these classes allow us to crawl and scrape a site. These are the core of any Scrapy program, and where our most complex logic lives. A spider should return or yield
items as they are scraped from a site.
Items
- What spiders return, usually a dict
filled with information from a page you’ve scraped.
Middlewares
- Hooks that let us inject code in between Scrapy’s call to a website and its response.
Pipelines
- Where we handle data cleaning, post-processing.
We could technically write our entire project in the one spiders.py
file, but the whole point of separating out our concerns is to help us deal with the complexity inherent in pulling a ton of data out of hundreds of pages. It’s good practice to try and adhere to Scrapy’s structure.
Our Target
In my case, this whole exercise came about while looking at properties in Costa Rica, but to generalize a bit, I’ve adapted the examples to scrape Trulia instead (US property listings). I’d suggest you look up a site relevant to your geographic location, otherwise this whole exercise won’t be nearly as fun.
From each listing, we’ll be extracting the listing’s title, id, city, location, size, number of bathrooms, bedrooms, parking spaces, and a few other data points that might be useful.
As you can see, this is a non-trivial amount of information - a manual equivalent to what we want to do would be possible, albeit extremely tedious.
Our Scraper
Let’s get started writing our scraper! First, creat a real_estate_spider.py
file in the spiders/
folder (or name it whatever you like). I called mine trulia_spider.py
after the name of the site we’ll be crawling.
Go ahead and write out the following code in our spider file:
Our first scraper is nothing to be too proud of - it doesn’t even “scrape” anything yet. What it does is illustrate how the Spider class works - it needs a start_requests
method that gets called when the Spider is first started, and then this method calls a scrapy.Request
with a url
to pull, and a callback function to use to parse the results of calling this url. If you’re wondering about the yield
keyword, you should check out this stackoverflow thread.
If you now run scrapy crawl trulia
, you should see a bunch of debug stastics, along with our two print statements at the very end. The <200 https://www.trulia.com/NY/New_York/>
output is the response object that Scrapy gives to play around with, and includes the status of the request.
Let’s flesh out our start_requests a bit. I’m going to add a map of some city names to urls I want to scrape, so I can then iterate through these and call the scraper. After all, I want information on both New York and Brooklyn:
We’ll deal with passing the city name as an argument to our callback a bit later. Now, every entry in our urls
dict gets called, and the result gets passed into self.parse
.
Parsing the response
The key to the scraping game is being able to describe, with selectors, how to actually extract information we care about from a given HTML document. Scrapy facilitates two methods to do so: CSS selectors (which you’ll be familiar with if you’ve ever written CSS), and XPath (an expressive language used to navigate XML-like documents). In this tutorial I’m primarily using XPath, which I’ve covered extensively here: The Magic of XPath.
We’ve called a scrapy.Request
, which has passed the result of the request into self.parse. Because we’re iterating through a list of Cities/URLs, the parse function will receive two response
objects, one for New York, and the other for Brooklyn. Each of these contains the first page of property listings - we’re going to want to get the links to each individual listing contained in the page, as well as the link to the “Next Page” link, so that we can crawl the list until the very end.
We’re using XPath to select all div
tags that have a data-testid
attribute that contains “home-card-sale” as its value. This will give us a list of all the listings featured in this page. Next, we’re pulling out the href
value from all links with a given rel
and aria-label
attribute (our next-page link). Whenever XPath cannot find a given node, it will simply return None
, meaning that we can simply check for truthiness when determining if we continue parsing.
Next, let’s iterate through every listing link, go to the page in question, and extract the information we’re concerned with.
Often links in websites are relative to the current location - Scrapy provides a convenience function to make the url absolute, meaning we can then pass it into a new Request. Finally, we call a new parsing function that will deal with this particular listing we’ve just extracted, called self.parse_listing
. We’ll write this new function in a bit.
In the meantime - what do we do about that next_page_link
we defined earlier? We want to navigate to it as well, no? Let’s make sure we do that:
Notice that we’re engaging in a bit of good old recursion here. Because we assume the next page of the listing is going to be pretty much the same as the one we’re currently parsing, we can simply call our parse
function again with the result of calling the “Next Page” link. If the link happens to be None
, meaning we’ve reached the end, then Scrapy will simply stop recurring.
Now we’re ready to actually begin extracting the object that we’ll be saving as we parse the site.
The Scrapy Shell
Before we begin extracting the information we’re concerned with, let’s imagine we weren’t quite sure how to get to it. How annoying would it be to have to run and re-run our programming endlessly until we got our selectors right? Can you imagine the amount of print
statements we’d need to debug our info?
Luckily, the Scrapy designers have thought of this already, and provide us with a brilliant little tool: the Scrapy shell. The Scrapy shell allows us to fetch the contents of a given url, and interactively figure out our selectors, without having to ping the site a million times every time we run the program.
Let’s open the shell in our terminal:
scrapy shell
We can fetch an example of a document we’ll be scraping:
fetch('https://trulia.com/NY/New_York/')
If the fetch is successful, the result gets assigned to the global variable response
, meaning now we can play with our selectors to our heart’s content.
Try it - how about selecting a div with a given class?
response.xpath(‘//div[contains(@data-testid, "home-card-sale")]’)
The Scrapy shell returns a list of results it’s found. It’s an incredibly useful tool (this is how I found the XPath selectors we used before), and can be used for much more than just navigating the page - for now we’ll limit ourselves to using it to make sure we’ve got our selectors right.
Extracting the “Item”
This part is easy if we use the shell - we just try and try until we get our selectors right, and throw them in a dict structure we return from the function (we’ll put all this functionality in the self.parse_listing
function we declared before but haven’t yet written). Navigate to any of the individual listing pages and try to start identifying, using the Scrapy shell, the selectors you’ll need.
We can iterate between the shell and our code in order to build up our selectors. Let’s finally write our listing-page scraper function:
We’re also going to add a regular expression to our Spider class so we can pull out the given listing’s ID number, which we’ll use when we write a cache for our Spider:
A few things to note: MAKE SURE YOU TURN OFF JS WHEN YOU’RE LOOKING AT WHAT TO SCRAPE. When Scrapy makes a request, it receives only the HTML - you might be tricked into thinking that there’s something on the page when really it’s being added dynamically by a JS script after the page-load. Google how to turn off Javascript in your current browser session and save yourself a couple of hours of debugging.
Also note that all we’re doing in this function is yielding a dict
with the information we want - nothing special at all. In fact, that’s pretty much all there is to this game. We’re about to run our scraper for the first time (properly), but first, let’s incorporate a little counter so that we don’t actually scrape the entire Trulia page (we don’t want to be pinging the server unnecessarily).:
This way, we’ll be able to limit the pages we parse while testing.
Let’s run our scraper to make sure it’s working. In the root folder of the Scrapy project, run:
scrapy crawl trulia -o output.jl
Make sure that trulia
is the name of the spider you declared in self.name
way back at the beginning of the tutorial as part of the Spider class. What this command does is dump every dict
the parser yields into a .jl
file, which is nothing more than a bunch of JSON structs separated by newlines. Once the command is done, you can open it up in an editor and check out all the information we’ve managed to scrape!
Passing arguments to our callbacks
Often you’ll want to pass an argument into a particular parser - in our case, I’d like to pass the city
information into self.parser
, and into our self.listing_parser
. Doing this is fairly straightforward - let’s refactor both our self.parse
and our self.parse_listing
functions to accept these arguments. Let’s also change _city
to city
in our start_requests
function, given that now we’re using it:
And now let’s actually do something with the variables:
Here we’re passing both the province and each listing’s id (which we’ve extracted from the listing page) into our parse_listing
function. We do this by passing in a cb_kwargs
parameter to our scrapy.Request
, and making sure we include these parameters in the callback we’re calling.
At this point we’re pretty much done with all the major functionality of our scraper. We’re going to do one last thing, which is add a very naive cache so that every time we scrape, we add onto our output file only if there’s new entries. Otherwise, we skip requesting the info: this saves us from pinging the site too often, and makes it easy to deploy the scraper as a service that doesn’t hog the line every time we run it.
Adding a cache
Let’s add two new functions to our spider file (the one we’ve been working in):
These two functions simply load and serialize/dump a dict of values, which we can declare in the default
argument of load_cache.
In our case, we’re going to want that dict to hold as keys the id
of each listing - we don’t care about the value, we only want a performant lookup of whether the id
exists in our cache. Yes - it’s a very naive way of going about this, but for our intents and purposes it illustrates the point quite nicely.
Let’s actually implement the cache in our spider.
What we’re doing here is simply loading our cache into a field of our spider class (the first time we load it, because the file doesn’t exist, we’ll instead pass in the default cache object we just declared). Then, whenever we successfully call one of the the parse_listing callbacks, we add a key into our cache dict
. It doesn’t really matter what we set as the value, given that we only need python to tell us if the key exists or not. At the end of every location iteration we save out our current cache.
Now, every time we call our scraper again, we’ll only parse new listings, while retaining the old ones (as long as we’re appending to our original parse output file). If we want to scrape everything all over again, we can simply delete this file.
Post-processing Data
There’s still a few things I’d like to do with the data we’re pulling - namely, turn strings into numbers where appropriate, and maybe prune some other data. We’ll build all of this in the dedicated pipelines.py
file, which is meant exactly for this type of data cleaning/postprocessing.
Open pipelines.py
and type in the following code:
This is our basic Pipeline class - the functions we’ve declared are fairly intuitive: open_spider
runs whenever the Spider is started. close_spider
runs when the Spider is done crawling. process_item
actually processes the dict
we yield from our parse_listing
callback in our Spider class. In order to activate the Pipeline, we have to make sure we’re declaring it in our settings.py
file, by uncommenting the ITEM_PIPELINES
section, and making sure the declaration in the associated dict
matches our class name. As you can see, you can have multiple Pipelines working together if you wanted to. The number after the pipeline simply serves to put the pipelines in order, ignore it.
First things first, let’s add a helper function to parse out numbers that we receive (just add it outside the Pipeline class). At any point, we could receive a None
value, a string meant to be an integer, a float, or an actual number (if we’ve decided to parse out numbers in our parse_listing
function). Let’s handle all the edge-cases here.
Nothing too fancy, just a bunch of edge-cases. Most importantly, we run a regular expression against the string to make sure it’s actually a number we’re going to try and parse, otherwise we make the decision to return a default value we can declare.
I’ll also add a helper function to extract the actual amount of a price given with a currency character in front of it, or a number in a string (like “2 Beds”).
Now we can actually flesh out process_item
. You’ll recognize it’s very similar to our parse_listing
function - it just adds one more step in the data processing.
We parse all the values we’re concerned with, and raise a DropItem
exception if the item doesn’t actually have an associated price
. DropItem
won’t stop the execution of our spider, rather it’ll simply make it ignore this item in particular.
This is also where I incorporated a currency-conversion function in my own, Costa Rica specific crawler. In open_spider
I declared a self.currency_rate
field for the class, and called the fixer.io
API to retrieve the appropriate currency conversion rate, and simply added one more function to wrap the price fields in our data, converting them if the currency was Costa-Rican Colones. Point being - if there’s anything else you want to do with your data, just tack it on at this stage.
Exporting
The last thing we probably want to do is export a CSV file with the data that we’ve scraped, so we can pass it along to a statistical model, or open it with any application we like. To do so, we’re going to leverage the Pipeline again, though in theory Scrapy includes the concept of a Feed
which could also take care of this (we’ll avoid it given the length of the tutorial so far).
Let’s open a file when the Spider starts, to which we can write the csv lines. Let’s also make sure to import the csv
library.
Here we open a file in “write” mode when the Spider starts, and write in our first row, which will act as the table columns. When the Spider finishes working, at close_spider
, we close our csv file. Now let’s add a line to write in our data as it’s processed into the csv file.
Make sure the order and number of items you’re writing out matches the order and number of columns! The next time you run the scraper (make sure you remove the cache first), you’ll write out the CSV file. If you want, you can add the same mechanism except with a JSON Line file (like the one we’ve been invoking with the shell, except hardcoded), and have the CSV file only be generated based on the final output of the .jl
file - that way you don’t have to clear your cache every time you want to export a new CSV file.
Conclusion
Hopefully this serves as a good starting point for buildling up your own crawler/scraper. This (relatively) modest amount of work has given you the ability to traverse entire sites automatically, extract pretty much any datapoint you want from individual pages (as long as it’s not generated by JS), clean and post-process the data, and write it out to a file. Scrapy is a fantastic framework on which to build even more complex use-cases, given its clean separation of concerns, and the insane amount of utility functions it provides out of the box. Go forth and play!