Scraping Real Estate Listings With Scrapy


Scrapy Primer | Our Target | Our Scraper | Parsing the Response | The Scrapy Shell | Extracting the Item | Passing Arguments | Adding a Cache | Post-Processing Data | Exporting | Conclusion | Github Repo

If you’re ever fortunate enough to be considering purchasing property, scraping a few property-listing sites is a great way to collect a dataset that you can then play around with, or use as a reference when evaluating potential purchases. Scraping for data is all the more important if you’re not intimately familiar with the area you’re looking at, or you want an objective, data-driven “second opinion” to go along with your broker’s recommendations.

In this writeup, I’ll be showing you how I scraped around 10,000 records from a local real-estate listing website, integrated currency conversion into the data cleanup as well as a small cache mechanism for future scraping rounds, and exported everything as a sanitized CSV file.

You can hand-roll your scraper, or lean on a more robust library. I’ll be using Scrapy, a Python-based scraping and crawling framework. Note that I mention crawling (traversing links while on a site) - this is vital if you’re scraping more “mainstream” sites, given that you’ll probably have to deal with pagination, and doing so manually is a bore.

Scrapy Primer


To create a new Scrapy project, first make sure you’ve installed Scrapy onto your local machine, and then run:

scrapy startproject real_estate

When starting a new project, Scrape populates a folder with a bunch of new files, all of which adhere to a simple lexicon, and exist in a pipeline of sorts:

real_estate/

	scrapy.cfg # Global scrapy configuration - you’ll only really have to touch this if deploying Scrapy
	
	real_estate/ # Notice the nested recursion, this can be confusing at first

		items.py # Optionally define the structure of the items you’ll be dealing with (i.e. returned from Spiders)

		middlewares.py # Hook into the responses Scrapy receives when calling a site

		pipelines.py # Handle data transformation and cleaning after it’s been scraped

		settings.py # Project settings

		spiders/

			spiders.py # We'll create this file soon, our spider/s live here!


Spiders - Appropriately named, given that these classes allow us to crawl and scrape a site. These are the core of any Scrapy program, and where our most complex logic lives. A spider should return or yield items as they are scraped from a site.

Items - What spiders return, usually a dict filled with information from a page you’ve scraped.

Middlewares - Hooks that let us inject code in between Scrapy’s call to a website and its response.

Pipelines - Where we handle data cleaning, post-processing.

We could technically write our entire project in the one spiders.py file, but the whole point of separating out our concerns is to help us deal with the complexity inherent in pulling a ton of data out of hundreds of pages. It’s good practice to try and adhere to Scrapy’s structure.

Our Target


In my case, this whole exercise came about while looking at properties in Costa Rica, but to generalize a bit, I’ve adapted the examples to scrape Trulia instead (US property listings). I’d suggest you look up a site relevant to your geographic location, otherwise this whole exercise won’t be nearly as fun.

From each listing, we’ll be extracting the listing’s title, id, city, location, size, number of bathrooms, bedrooms, parking spaces, and a few other data points that might be useful.

As you can see, this is a non-trivial amount of information - a manual equivalent to what we want to do would be possible, albeit extremely tedious.

Our Scraper

Let’s get started writing our scraper! First, creat a real_estate_spider.py file in the spiders/ folder (or name it whatever you like). I called mine trulia_spider.py after the name of the site we’ll be crawling.

Go ahead and write out the following code in our spider file:

import scrapy


class TruliaSpider(scrapy.Spider):
  name = "trulia" # The name of your spider!

  def start_requests(self):
    url = 'https://www.trulia.com/NY/New_York/'
    yield scrapy.Request(url=url, callback=self.parse)

  def parse(self, response):
    print('I got a response!')
    print(response)


Our first scraper is nothing to be too proud of - it doesn’t even “scrape” anything yet. What it does is illustrate how the Spider class works - it needs a start_requests method that gets called when the Spider is first started, and then this method calls a scrapy.Request with a url to pull, and a callback function to use to parse the results of calling this url. If you’re wondering about the yield keyword, you should check out this stackoverflow thread.

If you now run scrapy crawl trulia, you should see a bunch of debug stastics, along with our two print statements at the very end. The <200 https://www.trulia.com/NY/New_York/> output is the response object that Scrapy gives to play around with, and includes the status of the request.

Let’s flesh out our start_requests a bit. I’m going to add a map of some city names to urls I want to scrape, so I can then iterate through these and call the scraper. After all, I want information on both New York and Brooklyn:

  def start_requests(self):
    urls = {
      'New York': "https://www.trulia.com/NY/New_York/",
      'Brooklyn': "https://www.trulia.com/NY/Brooklyn/"
    }

    for _city, url in urls.entries():
      yield scrapy.Request(url=url, callback=self.parse)


We’ll deal with passing the city name as an argument to our callback a bit later. Now, every entry in our urls dict gets called, and the result gets passed into self.parse.

Parsing the response


The key to the scraping game is being able to describe, with selectors, how to actually extract information we care about from a given HTML document. Scrapy facilitates two methods to do so: CSS selectors (which you’ll be familiar with if you’ve ever written CSS), and XPath (an expressive language used to navigate XML-like documents). In this tutorial I’m primarily using XPath, which I’ve covered extensively here: The Magic of XPath.

We’ve called a scrapy.Request, which has passed the result of the request into self.parse. Because we’re iterating through a list of Cities/URLs, the parse function will receive two response objects, one for New York, and the other for Brooklyn. Each of these contains the first page of property listings - we’re going to want to get the links to each individual listing contained in the page, as well as the link to the “Next Page” link, so that we can crawl the list until the very end.

    def parse(self, response):
      listing_links = response.xpath('//div[contains(@data-testid, "home-card-sale")]')
      next_page_link = response.xpath('//a[@rel="next" and @aria-label="Next Page"]/@href').get()


We’re using XPath to select all div tags that have a data-testid attribute that contains “home-card-sale” as its value. This will give us a list of all the listings featured in this page. Next, we’re pulling out the href value from all links with a given rel and aria-label attribute (our next-page link). Whenever XPath cannot find a given node, it will simply return None, meaning that we can simply check for truthiness when determining if we continue parsing.

Next, let’s iterate through every listing link, go to the page in question, and extract the information we’re concerned with.

    def parse(self, response):
      #…
      for link in listing_links:
        relative_url = link.xpath('.//a/@href').get()
        absolute_url = response.urljoin(relative_url)
        yield scrapy.Request(absolute_url, callback=self.parse_listing)


Often links in websites are relative to the current location - Scrapy provides a convenience function to make the url absolute, meaning we can then pass it into a new Request. Finally, we call a new parsing function that will deal with this particular listing we’ve just extracted, called self.parse_listing. We’ll write this new function in a bit.

In the meantime - what do we do about that next_page_link we defined earlier? We want to navigate to it as well, no? Let’s make sure we do that:

    def parse(self, response):
      #…
      #…for link in listing_links:
        #...
      absolute_next_page_link = response.urljoin(next_page_link)
      yield scrapy.Request(absolute_next_page_link, callback=self.parse)


Notice that we’re engaging in a bit of good old recursion here. Because we assume the next page of the listing is going to be pretty much the same as the one we’re currently parsing, we can simply call our parse function again with the result of calling the “Next Page” link. If the link happens to be None, meaning we’ve reached the end, then Scrapy will simply stop recurring.

Now we’re ready to actually begin extracting the object that we’ll be saving as we parse the site.

The Scrapy Shell


Before we begin extracting the information we’re concerned with, let’s imagine we weren’t quite sure how to get to it. How annoying would it be to have to run and re-run our programming endlessly until we got our selectors right? Can you imagine the amount of print statements we’d need to debug our info?

Luckily, the Scrapy designers have thought of this already, and provide us with a brilliant little tool: the Scrapy shell. The Scrapy shell allows us to fetch the contents of a given url, and interactively figure out our selectors, without having to ping the site a million times every time we run the program.

Let’s open the shell in our terminal:

scrapy shell

We can fetch an example of a document we’ll be scraping:

fetch('https://trulia.com/NY/New_York/')

If the fetch is successful, the result gets assigned to the global variable response, meaning now we can play with our selectors to our heart’s content.

Try it - how about selecting a div with a given class?

response.xpath(‘//div[contains(@data-testid, "home-card-sale")]’)

The Scrapy shell returns a list of results it’s found. It’s an incredibly useful tool (this is how I found the XPath selectors we used before), and can be used for much more than just navigating the page - for now we’ll limit ourselves to using it to make sure we’ve got our selectors right.

Extracting the “Item”


This part is easy if we use the shell - we just try and try until we get our selectors right, and throw them in a dict structure we return from the function (we’ll put all this functionality in the self.parse_listing function we declared before but haven’t yet written). Navigate to any of the individual listing pages and try to start identifying, using the Scrapy shell, the selectors you’ll need.

We can iterate between the shell and our code in order to build up our selectors. Let’s finally write our listing-page scraper function:

def parse_listing(self, response):
  yield {
    'location': response.xpath('//a[@data-testid="neighborhood-link"]/text()').get(),
    'price': response.xpath('//h3[@data-testid="on-market-price-details"]/div[contains(text(), "$")]/text()').get(),
    'details': {
      'bedrooms': response.xpath('//div[contains(text(), "Beds") and string-length(text())<10]/text()').get(),
      'bathrooms': response.xpath('//div[contains(text(), "Baths") and string-length(text())<10]/text()').get(),
      'size': response.xpath('//li[@data-testid="floor"]//div[contains(text(), "sqft")]/text()').get(),
      'year': response.xpath('//div[text()="Year Built"]/following-sibling::div/text()').get(),
      'parking': response.xpath('//div[text()="Parking"]/following-sibling::div/text()').get(),
      'heating': response.xpath('//div[text()="Heating"]/following-sibling::div/text()').get(),
      'cooling': response.xpath('//div[text()="Cooling"]/following-sibling::div/text()').get()
    }
  } 


We’re also going to add a regular expression to our Spider class so we can pull out the given listing’s ID number, which we’ll use when we write a cache for our Spider:

import re

class TruliaSpider(scrapy.Spider):
  name = 'trulia'

  def start_requests(self):
    self.url_id_regex = re.compile('[0-9]+$')
    #...

  def parse_listing(self, response):
    id_match_result = self.url_id_regex.search(response.url)
    listing_id = id_match_result.group()
    yield {
      'id': listing_id
      #...
      }
    } 


A few things to note: MAKE SURE YOU TURN OFF JS WHEN YOU’RE LOOKING AT WHAT TO SCRAPE. When Scrapy makes a request, it receives only the HTML - you might be tricked into thinking that there’s something on the page when really it’s being added dynamically by a JS script after the page-load. Google how to turn off Javascript in your current browser session and save yourself a couple of hours of debugging.

Also note that all we’re doing in this function is yielding a dict with the information we want - nothing special at all. In fact, that’s pretty much all there is to this game. We’re about to run our scraper for the first time (properly), but first, let’s incorporate a little counter so that we don’t actually scrape the entire Trulia page (we don’t want to be pinging the server unnecessarily).:

def start_requests(self):
  self.page_count = 0
  self.max_page_count = 3
  #...

def parse(self, response):
  #...
  if self.page_count < self.max_page_count:
    self.page_count += 1
    # make sure you move this function into this new 'if' block:
    yield scrapy.Request(absolute_next_page_link, callback=self.parse)


This way, we’ll be able to limit the pages we parse while testing.

Let’s run our scraper to make sure it’s working. In the root folder of the Scrapy project, run:

scrapy crawl trulia -o output.jl

Make sure that trulia is the name of the spider you declared in self.name way back at the beginning of the tutorial as part of the Spider class. What this command does is dump every dict the parser yields into a .jl file, which is nothing more than a bunch of JSON structs separated by newlines. Once the command is done, you can open it up in an editor and check out all the information we’ve managed to scrape!

Passing arguments to our callbacks


Often you’ll want to pass an argument into a particular parser - in our case, I’d like to pass the city information into self.parser, and into our self.listing_parser. Doing this is fairly straightforward - let’s refactor both our self.parse and our self.parse_listing functions to accept these arguments. Let’s also change _city to city in our start_requests function, given that now we’re using it:

    def start_requests(self):
      for city, url in urls:
        #...

    def parse(self, response, city):
      #...

    def parse_listing(self, response, city):
      #...


And now let’s actually do something with the variables:

    def start_requests(self):
      #...
      for city, url in urls.items():
        yield scrapy.Request(url=url, callback=self.parse, cb_kwargs=dict(city=city))

    def parse(self, response, city):
      #...
      for link in listing_links:
        #...
        yield scrapy.Request(absolute_url, callback=self.parse_listing, cb_kwargs=dict(city=city))

      if self.page_count < self.max_page_count:
        #...
        yield scrapy.Request(absolute_next_page_link, callback=self.parse, cb_kwargs=dict(city=city))

    def parse_listing(self, response, city, listing_id):
      yield {
        'city': city,
        #...
      }


Here we’re passing both the province and each listing’s id (which we’ve extracted from the listing page) into our parse_listing function. We do this by passing in a cb_kwargs parameter to our scrapy.Request, and making sure we include these parameters in the callback we’re calling.

At this point we’re pretty much done with all the major functionality of our scraper. We’re going to do one last thing, which is add a very naive cache so that every time we scrape, we add onto our output file only if there’s new entries. Otherwise, we skip requesting the info: this saves us from pinging the site too often, and makes it easy to deploy the scraper as a service that doesn’t hog the line every time we run it.

Adding a cache


Let’s add two new functions to our spider file (the one we’ve been working in):

import json

def load_cache(filename, default={}):
  try:
    with open(filename, 'r+') as cache:
      return json.load(cache)
  except (FileNotFoundError, json.JSONDecodeError) as _e:
    return default

def write_cache(filename, obj_to_write):
  with open(filename, 'w') as cache_out:
    json.dump(obj_to_write, cache_out)
    print(f'Successfully saved cache to {filename}')


These two functions simply load and serialize/dump a dict of values, which we can declare in the default argument of load_cache.

In our case, we’re going to want that dict to hold as keys the id of each listing - we don’t care about the value, we only want a performant lookup of whether the id exists in our cache. Yes - it’s a very naive way of going about this, but for our intents and purposes it illustrates the point quite nicely.

Let’s actually implement the cache in our spider.

class TruliaSpider(scrapy.Spider):
  #...
  def start_requests(self):
    self.urls_traversed = load_cache('./visited.json', {'New York': {}, 'Brooklyn': {}})

  def parse_listing(self, response, city):
    search_result = self.url_id_regex.search(response.url)
    listing_id = search_result.group()
    if not self.urls_traversed[city].get(listing_id, False):
      self.urls_traversed[city][listing_id] = True
      data_to_yield = {
        'city': city,
        'id': listing_id,
        #...
      }
      write_cache('./visited.json', self.urls_traversed)
      yield data_to_yield
    else:
      continue


What we’re doing here is simply loading our cache into a field of our spider class (the first time we load it, because the file doesn’t exist, we’ll instead pass in the default cache object we just declared). Then, whenever we successfully call one of the the parse_listing callbacks, we add a key into our cache dict. It doesn’t really matter what we set as the value, given that we only need python to tell us if the key exists or not. At the end of every location iteration we save out our current cache.

Now, every time we call our scraper again, we’ll only parse new listings, while retaining the old ones (as long as we’re appending to our original parse output file). If we want to scrape everything all over again, we can simply delete this file.

Post-processing Data


There’s still a few things I’d like to do with the data we’re pulling - namely, turn strings into numbers where appropriate, and maybe prune some other data. We’ll build all of this in the dedicated pipelines.py file, which is meant exactly for this type of data cleaning/postprocessing.

Open pipelines.py and type in the following code:

class TruliaPipeline: #Go ahead and rename this

  #def open_spider(self, spider):
  
  #def close_spider(self, spider):

  def process_item(self, item, spider):


This is our basic Pipeline class - the functions we’ve declared are fairly intuitive: open_spider runs whenever the Spider is started. close_spider runs when the Spider is done crawling. process_item actually processes the dict we yield from our parse_listing callback in our Spider class. In order to activate the Pipeline, we have to make sure we’re declaring it in our settings.py file, by uncommenting the ITEM_PIPELINES section, and making sure the declaration in the associated dict matches our class name. As you can see, you can have multiple Pipelines working together if you wanted to. The number after the pipeline simply serves to put the pipelines in order, ignore it.

First things first, let’s add a helper function to parse out numbers that we receive (just add it outside the Pipeline class). At any point, we could receive a None value, a string meant to be an integer, a float, or an actual number (if we’ve decided to parse out numbers in our parse_listing function). Let’s handle all the edge-cases here.

import re

num_reg = re.compile('[0-9\.]+')
def parse_num(num_string, default=None):
    if num_string is not None:
        if type(num_string) == int or type(num_string) == float:
            return num_string
        elif num_reg.match(num_string):
            try:
                return int(num_string)
            except ValueError:
                return float(num_string)
        else:
            return default

Nothing too fancy, just a bunch of edge-cases. Most importantly, we run a regular expression against the string to make sure it’s actually a number we’re going to try and parse, otherwise we make the decision to return a default value we can declare.

I’ll also add a helper function to extract the actual amount of a price given with a currency character in front of it, or a number in a string (like “2 Beds”).

def exract_number(number_string, default=None):
    amount_regex = re.compile('[0-9,]+')
    if number_string is not None:
      try:
        return amount_regex.search(number_string).group().replace(',', '')
      except AttributeError:
        return default


Now we can actually flesh out process_item. You’ll recognize it’s very similar to our parse_listing function - it just adds one more step in the data processing.

def process_item(self, item, spider):
    if item['price']:
        item['price'] = parse_num(extract_number(item['price']))
    else:
        raise DropItem(f"Missing price for {item['title']}")

    item['details']['bedrooms'] = parse_num(extract_number(item['details']['bedrooms']))
    item['details']['bathrooms'] = parse_num(extract_number(item['details']['bathrooms']))
    item['details']['size'] = parse_num(extract_number(item['details']['size']))
    item['details']['year'] = parse_num(extract_number(item['details']['year']))

    return item


We parse all the values we’re concerned with, and raise a DropItem exception if the item doesn’t actually have an associated price. DropItem won’t stop the execution of our spider, rather it’ll simply make it ignore this item in particular.

This is also where I incorporated a currency-conversion function in my own, Costa Rica specific crawler. In open_spider I declared a self.currency_rate field for the class, and called the fixer.io API to retrieve the appropriate currency conversion rate, and simply added one more function to wrap the price fields in our data, converting them if the currency was Costa-Rican Colones. Point being - if there’s anything else you want to do with your data, just tack it on at this stage.

Exporting


The last thing we probably want to do is export a CSV file with the data that we’ve scraped, so we can pass it along to a statistical model, or open it with any application we like. To do so, we’re going to leverage the Pipeline again, though in theory Scrapy includes the concept of a Feed which could also take care of this (we’ll avoid it given the length of the tutorial so far).

Let’s open a file when the Spider starts, to which we can write the csv lines. Let’s also make sure to import the csv library.

import csv

class TruliaScrapePipeline:

  def open_spider(self, spider):
    self.csv_output = open('output.csv', 'w')
    self.csv_writer = csv.writer(self.csv_output, delimiter=',')
    self.csv_writer.writerow([
      'city', 
      'location', 
      'price', 
      'bedrooms', 
      'bathrooms', 
      'sqft',
      'year'])
  
  def close_spider(self, spider):
    self.csv_output.close()


Here we open a file in “write” mode when the Spider starts, and write in our first row, which will act as the table columns. When the Spider finishes working, at close_spider, we close our csv file. Now let’s add a line to write in our data as it’s processed into the csv file.

class TruliaScrapePipeline:

  def process_item(self, item, spider):
    #... Our item transforms
    
    self.csv_writer.writerow([
      item['city'], 
      item['location'], 
      item['price'], 
      item['details']['bedrooms'], 
      item['details']['bathrooms'], 
      item['details']['size'], 
      item['details']['year']])
    
    return item


Make sure the order and number of items you’re writing out matches the order and number of columns! The next time you run the scraper (make sure you remove the cache first), you’ll write out the CSV file. If you want, you can add the same mechanism except with a JSON Line file (like the one we’ve been invoking with the shell, except hardcoded), and have the CSV file only be generated based on the final output of the .jl file - that way you don’t have to clear your cache every time you want to export a new CSV file.

Conclusion


Hopefully this serves as a good starting point for buildling up your own crawler/scraper. This (relatively) modest amount of work has given you the ability to traverse entire sites automatically, extract pretty much any datapoint you want from individual pages (as long as it’s not generated by JS), clean and post-process the data, and write it out to a file. Scrapy is a fantastic framework on which to build even more complex use-cases, given its clean separation of concerns, and the insane amount of utility functions it provides out of the box. Go forth and play!