C# — 58.com Branded Apartments Crawler

Found a crawler tutorial on Zhihu over a weekend — “Amap API + Python to solve renting”. It hit my pain point as I was looking for a new place, so I jumped in right away. The tutorial on Shiyanlou has step-by-step instructions.

The project breaks down into two steps:

  1. Use Python to crawl data and generate a data file
  2. Import the data file, display listings on a map, select your workplace, and auto-calc routes and commute time

After trying the tutorial, I felt it was too rough for practical use, and it’s Beijing-only while I’m in Shanghai. You can tweak the Python data source and the front-end JS, but it still felt clunky. So I decided to build my own.

First, the original Python (excerpt):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
from urlparse import urljoin
import requests
import csv

url = "http://bj.58.com/pinpaigongyu/pn/{page}/?minprice=2000_4000"

page = 0
csv_file = open("rent.csv","wb") 
csv_writer = csv.writer(csv_file, delimiter=',')

while True:
    page += 1
    response = requests.get(url.format(page=page))
    html = BeautifulSoup(response.text)
    house_list = html.select(".list > li")
    if not house_list:
        break
    for house in house_list:
        house_title = house.select("h2")[0].string.encode("utf8")
        house_url = urljoin(url, house.select("a")[0]["href"])
        house_info_list = house_title.split()
        if "公寓" in house_info_list[1] or "青年社区" in house_info_list[1]:
            house_location = house_info_list[0]
        else:
            house_location = house_info_list[1]
        house_money = house.select(".money")[0].select("b")[0].string.encode("utf8")
        csv_writer.writerow([house_title, house_location, house_money, house_url])

csv_file.close()

It scrapes http://bj.58.com/pinpaigongyu/pn/{page}/?minprice=2000_4000, then writes to CSV for downstream use. Each listing is an li element:

1
2
3
4
5
6
7
8
<li>
  <a href="/pinpaigongyu/..." tongji_label="listclick">
    <div class="des">
      <h2>【合租】菊园新区 柳湖景庭 3室次卧</h2>
    </div>
    <div class="money"><span><b>1100</b>元/月</span></div>
  </a>
</li>

While the Python extracts from li, I noticed the key info actually lives in the a[tongji_label="listclick"] element. Rather than regex, I used HtmlAgilityPack in .NET for HTML parsing/manipulation.

Install via NuGet:

1
Install-Package HtmlAgilityPack

Controller snippet (core logic abbreviated):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public ActionResult Get58CityRoomData(int costFrom, int costTo, string cnName)
{
    // ... validate params ...
    var lstHouse = new List<HouseInfo>();
    string tempURL = "http://" + cnName + ".58.com/pinpaigongyu//pn/{0}/?minprice=" + costFrom + "_" + costTo;
    Uri uri = new Uri(tempURL);
    var htmlResult = HTTPHelper.GetHTMLByURL(string.Format(tempURL, 1));
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(htmlResult);
    var countNodes = htmlDoc.DocumentNode.SelectSingleNode(".//span[contains(@class,'list')]");
    int pageCount = 10;
    if (countNodes != null && countNodes.HasChildNodes)
    {
        pageCount = Convert.ToInt32(countNodes.ChildNodes[0].InnerText) / 20;
        if(pageCount==0) { return Json(new { IsSuccess = false, Error = "No results in this price range." }); }
    }
    for (int pageIndex = 1; pageIndex <= pageCount; pageIndex++)
    {
        htmlResult = HTTPHelper.GetHTMLByURL(string.Format(tempURL, pageIndex));
        htmlDoc.LoadHtml(htmlResult);
        var roomList = htmlDoc.DocumentNode.SelectNodes(".//a[contains(@tongji_label,'listclick')]");
        foreach (var room in roomList)
        {
            var houseTitle = room.SelectSingleNode(".//h2").InnerHtml;
            var houseURL = uri.Host + room.Attributes["href"].Value;
            var house_info_list = houseTitle.Split(' ');
            var house_location = (house_info_list[1].Contains("公寓") || house_info_list[1].Contains("青年社区"))
               ? house_info_list[0] : house_info_list[1];
            var money = room.SelectSingleNode(".//b").InnerHtml;
            lstHouse.Add(new HouseInfo { HouseTitle = houseTitle, HouseLocation = house_location, HouseURL = houseURL, Money = money });
        }
    }
    return Json(new { IsSuccess = true, HouseInfos = lstHouse });
}

Two key points:

  • The first page contains a hidden total count like <span class="listsum"><em>1813</em>条结果</span>, so pages = total/20.
  • Target the a with tongji_label="listclick"; inside it, h2 has the title/location, and b inside .money has the price.

That’s the backend part. The frontend (Amap/GAODE integration) will be for another day — time to play games with my partner :-)

Source code: https://github.com/liguobao/58HouseSearch

Live: 58 Apartment Map Search (China): https://woyaozufang.live

Built with Hugo
Theme Stack designed by Jimmy