Scraping Real Estate Listings With Scrapy
Scrapy Primer | Our Target | Our Scraper | Parsing the Response | The Scrapy Shell | Extracting the Item | Passing Arguments | Adding a Cache | Post-Processing Data | Exporting | Conclusion | Github Repo
If you’re ever fortunate enough to be considering purchasing property, scraping a few property-listing sites is a great way to collect a dataset that you can then play around with, or use as a reference when evaluating potential purchases. Scraping for data is all the more important if you’re not intimately familiar with the area you’re looking at, or you want an objective, data-driven “second opinion” to go along with your broker’s recommendations.
In this writeup, I’ll be showing you how I scraped around 10,000 records from a local real-estate listing website, integrated currency conversion into the data cleanup as well as a small cache mechanism for future scraping rounds, and exported everything as a sanitized CSV file.
You can hand-roll your scraper, or lean on a more robust library. I’ll be using Scrapy, a Python-based scraping and crawling framework. Note that I mention crawling (traversing links while on a site) - this is vital if you’re scraping more “mainstream” sites, given that you’ll probably have to deal with pagination, and doing so manually is a bore.
To create a new Scrapy project, first make sure you’ve installed Scrapy onto your local machine, and then run:
scrapy startproject real_estate
When starting a new project, Scrape populates a folder with a bunch of new files, all of which adhere to a simple lexicon, and exist in a pipeline of sorts:
Spiders - Appropriately named, given that these classes allow us to crawl and scrape a site. These are the core of any Scrapy program, and where our most complex logic lives. A spider should return or
yield items as they are scraped from a site.
Items - What spiders return, usually a
dict filled with information from a page you’ve scraped.
Middlewares - Hooks that let us inject code in between Scrapy’s call to a website and its response.
Pipelines - Where we handle data cleaning, post-processing.
We could technically write our entire project in the one
spiders.py file, but the whole point of separating out our concerns is to help us deal with the complexity inherent in pulling a ton of data out of hundreds of pages. It’s good practice to try and adhere to Scrapy’s structure.
In my case, this whole exercise came about while looking at properties in Costa Rica, but to generalize a bit, I’ve adapted the examples to scrape Trulia instead (US property listings). I’d suggest you look up a site relevant to your geographic location, otherwise this whole exercise won’t be nearly as fun.
From each listing, we’ll be extracting the listing’s title, id, city, location, size, number of bathrooms, bedrooms, parking spaces, and a few other data points that might be useful.
As you can see, this is a non-trivial amount of information - a manual equivalent to what we want to do would be possible, albeit extremely tedious.
Let’s get started writing our scraper! First, creat a
real_estate_spider.py file in the
spiders/ folder (or name it whatever you like). I called mine
trulia_spider.py after the name of the site we’ll be crawling.
Go ahead and write out the following code in our spider file:
Our first scraper is nothing to be too proud of - it doesn’t even “scrape” anything yet. What it does is illustrate how the Spider class works - it needs a
start_requests method that gets called when the Spider is first started, and then this method calls a
scrapy.Request with a
url to pull, and a callback function to use to parse the results of calling this url. If you’re wondering about the
yield keyword, you should check out this stackoverflow thread.
If you now run
scrapy crawl trulia, you should see a bunch of debug stastics, along with our two print statements at the very end. The
<200 https://www.trulia.com/NY/New_York/> output is the response object that Scrapy gives to play around with, and includes the status of the request.
Let’s flesh out our start_requests a bit. I’m going to add a map of some city names to urls I want to scrape, so I can then iterate through these and call the scraper. After all, I want information on both New York and Brooklyn:
We’ll deal with passing the city name as an argument to our callback a bit later. Now, every entry in our
urls dict gets called, and the result gets passed into
Parsing the response
The key to the scraping game is being able to describe, with selectors, how to actually extract information we care about from a given HTML document. Scrapy facilitates two methods to do so: CSS selectors (which you’ll be familiar with if you’ve ever written CSS), and XPath (an expressive language used to navigate XML-like documents). In this tutorial I’m primarily using XPath, which I’ve covered extensively here: The Magic of XPath.
We’ve called a
scrapy.Request, which has passed the result of the request into self.parse. Because we’re iterating through a list of Cities/URLs, the parse function will receive two
response objects, one for New York, and the other for Brooklyn. Each of these contains the first page of property listings - we’re going to want to get the links to each individual listing contained in the page, as well as the link to the “Next Page” link, so that we can crawl the list until the very end.
We’re using XPath to select all
div tags that have a
data-testid attribute that contains “home-card-sale” as its value. This will give us a list of all the listings featured in this page. Next, we’re pulling out the
href value from all links with a given
aria-label attribute (our next-page link). Whenever XPath cannot find a given node, it will simply return
None, meaning that we can simply check for truthiness when determining if we continue parsing.
Next, let’s iterate through every listing link, go to the page in question, and extract the information we’re concerned with.
Often links in websites are relative to the current location - Scrapy provides a convenience function to make the url absolute, meaning we can then pass it into a new Request. Finally, we call a new parsing function that will deal with this particular listing we’ve just extracted, called
self.parse_listing. We’ll write this new function in a bit.
In the meantime - what do we do about that
next_page_link we defined earlier? We want to navigate to it as well, no? Let’s make sure we do that:
Notice that we’re engaging in a bit of good old recursion here. Because we assume the next page of the listing is going to be pretty much the same as the one we’re currently parsing, we can simply call our
parse function again with the result of calling the “Next Page” link. If the link happens to be
None, meaning we’ve reached the end, then Scrapy will simply stop recurring.
Now we’re ready to actually begin extracting the object that we’ll be saving as we parse the site.
The Scrapy Shell
Before we begin extracting the information we’re concerned with, let’s imagine we weren’t quite sure how to get to it. How annoying would it be to have to run and re-run our programming endlessly until we got our selectors right? Can you imagine the amount of
Luckily, the Scrapy designers have thought of this already, and provide us with a brilliant little tool: the Scrapy shell. The Scrapy shell allows us to fetch the contents of a given url, and interactively figure out our selectors, without having to ping the site a million times every time we run the program.
Let’s open the shell in our terminal:
We can fetch an example of a document we’ll be scraping:
If the fetch is successful, the result gets assigned to the global variable
response, meaning now we can play with our selectors to our heart’s content.
Try it - how about selecting a div with a given class?
The Scrapy shell returns a list of results it’s found. It’s an incredibly useful tool (this is how I found the XPath selectors we used before), and can be used for much more than just navigating the page - for now we’ll limit ourselves to using it to make sure we’ve got our selectors right.
Extracting the “Item”
This part is easy if we use the shell - we just try and try until we get our selectors right, and throw them in a dict structure we return from the function (we’ll put all this functionality in the
self.parse_listing function we declared before but haven’t yet written). Navigate to any of the individual listing pages and try to start identifying, using the Scrapy shell, the selectors you’ll need.
We can iterate between the shell and our code in order to build up our selectors. Let’s finally write our listing-page scraper function:
We’re also going to add a regular expression to our Spider class so we can pull out the given listing’s ID number, which we’ll use when we write a cache for our Spider:
Also note that all we’re doing in this function is yielding a
dict with the information we want - nothing special at all. In fact, that’s pretty much all there is to this game. We’re about to run our scraper for the first time (properly), but first, let’s incorporate a little counter so that we don’t actually scrape the entire Trulia page (we don’t want to be pinging the server unnecessarily).:
This way, we’ll be able to limit the pages we parse while testing.
Let’s run our scraper to make sure it’s working. In the root folder of the Scrapy project, run:
scrapy crawl trulia -o output.jl
Make sure that
trulia is the name of the spider you declared in
self.name way back at the beginning of the tutorial as part of the Spider class. What this command does is dump every
dict the parser yields into a
.jl file, which is nothing more than a bunch of JSON structs separated by newlines. Once the command is done, you can open it up in an editor and check out all the information we’ve managed to scrape!
Passing arguments to our callbacks
Often you’ll want to pass an argument into a particular parser - in our case, I’d like to pass the
city information into
self.parser, and into our
self.listing_parser. Doing this is fairly straightforward - let’s refactor both our
self.parse and our
self.parse_listing functions to accept these arguments. Let’s also change
city in our
start_requests function, given that now we’re using it:
And now let’s actually do something with the variables:
Here we’re passing both the province and each listing’s id (which we’ve extracted from the listing page) into our
parse_listing function. We do this by passing in a
cb_kwargs parameter to our
scrapy.Request, and making sure we include these parameters in the callback we’re calling.
At this point we’re pretty much done with all the major functionality of our scraper. We’re going to do one last thing, which is add a very naive cache so that every time we scrape, we add onto our output file only if there’s new entries. Otherwise, we skip requesting the info: this saves us from pinging the site too often, and makes it easy to deploy the scraper as a service that doesn’t hog the line every time we run it.
Adding a cache
Let’s add two new functions to our spider file (the one we’ve been working in):
These two functions simply load and serialize/dump a dict of values, which we can declare in the
default argument of load_cache.
In our case, we’re going to want that dict to hold as keys the
id of each listing - we don’t care about the value, we only want a performant lookup of whether the
id exists in our cache. Yes - it’s a very naive way of going about this, but for our intents and purposes it illustrates the point quite nicely.
Let’s actually implement the cache in our spider.
What we’re doing here is simply loading our cache into a field of our spider class (the first time we load it, because the file doesn’t exist, we’ll instead pass in the default cache object we just declared). Then, whenever we successfully call one of the the parse_listing callbacks, we add a key into our cache
dict. It doesn’t really matter what we set as the value, given that we only need python to tell us if the key exists or not. At the end of every location iteration we save out our current cache.
Now, every time we call our scraper again, we’ll only parse new listings, while retaining the old ones (as long as we’re appending to our original parse output file). If we want to scrape everything all over again, we can simply delete this file.
There’s still a few things I’d like to do with the data we’re pulling - namely, turn strings into numbers where appropriate, and maybe prune some other data. We’ll build all of this in the dedicated
pipelines.py file, which is meant exactly for this type of data cleaning/postprocessing.
pipelines.py and type in the following code:
This is our basic Pipeline class - the functions we’ve declared are fairly intuitive:
open_spider runs whenever the Spider is started.
close_spider runs when the Spider is done crawling.
process_item actually processes the
dict we yield from our
parse_listing callback in our Spider class. In order to activate the Pipeline, we have to make sure we’re declaring it in our
settings.py file, by uncommenting the
ITEM_PIPELINES section, and making sure the declaration in the associated
dict matches our class name. As you can see, you can have multiple Pipelines working together if you wanted to. The number after the pipeline simply serves to put the pipelines in order, ignore it.
First things first, let’s add a helper function to parse out numbers that we receive (just add it outside the Pipeline class). At any point, we could receive a
None value, a string meant to be an integer, a float, or an actual number (if we’ve decided to parse out numbers in our
parse_listing function). Let’s handle all the edge-cases here.
Nothing too fancy, just a bunch of edge-cases. Most importantly, we run a regular expression against the string to make sure it’s actually a number we’re going to try and parse, otherwise we make the decision to return a default value we can declare.
I’ll also add a helper function to extract the actual amount of a price given with a currency character in front of it, or a number in a string (like “2 Beds”).
Now we can actually flesh out
process_item. You’ll recognize it’s very similar to our
parse_listing function - it just adds one more step in the data processing.
We parse all the values we’re concerned with, and raise a
DropItem exception if the item doesn’t actually have an associated
DropItem won’t stop the execution of our spider, rather it’ll simply make it ignore this item in particular.
This is also where I incorporated a currency-conversion function in my own, Costa Rica specific crawler. In
open_spider I declared a
self.currency_rate field for the class, and called the
fixer.io API to retrieve the appropriate currency conversion rate, and simply added one more function to wrap the price fields in our data, converting them if the currency was Costa-Rican Colones. Point being - if there’s anything else you want to do with your data, just tack it on at this stage.
The last thing we probably want to do is export a CSV file with the data that we’ve scraped, so we can pass it along to a statistical model, or open it with any application we like. To do so, we’re going to leverage the Pipeline again, though in theory Scrapy includes the concept of a
Feed which could also take care of this (we’ll avoid it given the length of the tutorial so far).
Let’s open a file when the Spider starts, to which we can write the csv lines. Let’s also make sure to import the
Here we open a file in “write” mode when the Spider starts, and write in our first row, which will act as the table columns. When the Spider finishes working, at
close_spider, we close our csv file. Now let’s add a line to write in our data as it’s processed into the csv file.
Make sure the order and number of items you’re writing out matches the order and number of columns! The next time you run the scraper (make sure you remove the cache first), you’ll write out the CSV file. If you want, you can add the same mechanism except with a JSON Line file (like the one we’ve been invoking with the shell, except hardcoded), and have the CSV file only be generated based on the final output of the
.jl file - that way you don’t have to clear your cache every time you want to export a new CSV file.
Hopefully this serves as a good starting point for buildling up your own crawler/scraper. This (relatively) modest amount of work has given you the ability to traverse entire sites automatically, extract pretty much any datapoint you want from individual pages (as long as it’s not generated by JS), clean and post-process the data, and write it out to a file. Scrapy is a fantastic framework on which to build even more complex use-cases, given its clean separation of concerns, and the insane amount of utility functions it provides out of the box. Go forth and play!