一个案例让你入门爬虫之二:Q房网爬虫多层页面爬取实例
在上一篇中,我们仅仅爬取到了房源列表页面显示的房源信息,如果需要的房源信息只能在房源详情页中看到的话,就需要爬虫除了能够爬取房源列表页面,还要能够从房源列表页面中提取出房源详情页的URL,并爬取该URL(房源详情页面)的相关数据。
1.爬取详情页面分析
如果我们希望在上次爬取数据的基础上增加房屋年限、抵押信息等数据的话,而这些数据只能在详情页面中看到,如下图所示:

可以看到,交易属性那个栏目包含了房屋年限和抵押信息等。只有爬取详情页面才能抓取这些信息,所以需要在房源列表页面中提取房源的详情页面URL。

根据上图可以分析,很简单其实我们需要提取详情页面的URL,只需要把href属性的值用xpath解析出来,然后在前面加上 http://shenzhen.qfang.com ,即可构造出完整的房源详情页面URL,然后请求这个URL即可。
2.爬取详情页面代码实现
首先,导入需要的包,定义用户代理和网址前缀等所需常量。
import requestsfrom lxml import etreeimport csvimport time headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"}pre_url = 'https://shenzhen.qfang.com/sale/f'
因为本次爬虫需要爬取房源列表和房源详情两个页面,为了实现代码复用,这里定义一个专门的下载函数,这个下载函数主要就是使用requests下载页面,并返回一个页面信息提取器。
def download(url): html = requests.get(url,headers=headers) time.sleep(2) return etree.HTML(html.text)
接着,就是定义保存函数,这个上一篇也讲过。
def data_writer(item): with open('qfang_shenzhen_ershou.csv','a',encoding='utf-8',newline='') as f: writer = csv.writer(f) writer.writerow(item)
下面就是最主要的爬取函数了,它主要实现从房源列表页面中解析出房源详情页面的URL,然后请求这个URL,从中获取房源年限和抵押信息等需要爬取的数据。
def spider(list_url): #下载列表页 selector = download(list_url) house_list = selector.xpath("//div[@id='cycleListings']/ul//li[@class='clearfix']") #循环解析每套房源 for house in house_list: title = house.xpath("/div[1]/p[1]/a/text()")[0] apartment = house.xpath("/div[1]/p[2]/span[2]/text()")[0] area = house.xpath("/div[1]/p[2]/span[4]/text()")[0] decoration_type = house.xpath("/div[1]/p[2]/span[6]/text()")[0] cenggao = house.xpath("/div[1]/p[2]/span[8]/text()")[0].strip() orientation = house.xpath("/div[1]/p[2]/span[10]/text()")[0] build_finishtime = house.xpath("/div[1]/p[2]/span[12]/text()")[0] location = house.xpath("/div[1]/p[3]/span[2]/a/text()")[0] total_price = house.xpath("//div[@class='show-price']")[0].strip() #解析并构造详情页URL house_url = ('http://shenzhen.qfang.com' + house.xpath("div[1]/p[1]/a/@href")[0]) #下载详情页 sel = download(house_url) time.sleep(1) house_years = sel.xpath("//div[@class='housing-info']/ul/li[2]/div/ul/li[3]/div/text()")[0] mortgage_info = sel.xpath("//div[@class='housing-info']/ul/li[2]/div/ul/li[5]/div/text()")[0] item = [title,apartment,area,decoration_type,cenggao,orientation,build_finishtime,location,total_price,house_years,mortgage_info] print('正在爬取',title) data_writer(item)
上面代码提取出了房源的详情页面house_url,继续使用download函数下载这些页面,然后返回sel选择器继续使用xpath进行解析提取数据。
最后,定义主函数,运行爬虫进行爬取。
if __name__ == '__main__': for x in range(1,100): spider(pre_url + str(x))
这样就完成了本次爬取任务,只是简单地在上一篇基础扩展一下,完整代码如下:
import requestsfrom lxml import etreeimport csvimport time headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0" }pre_url = 'https://shenzhen.qfang.com/sale/f'def spider(list_url): #下载列表页 selector = download(list_url) house_list = selector.xpath("//div[@id='cycleListings']/ul//li[@class='clearfix']") #循环解析每套房源 for house in house_list: title = house.xpath("/div[1]/p[1]/a/text()")[0] apartment = house.xpath("/div[1]/p[2]/span[2]/text()")[0] area = house.xpath("/div[1]/p[2]/span[4]/text()")[0] decoration_type = house.xpath("/div[1]/p[2]/span[6]/text()")[0] cenggao = house.xpath("/div[1]/p[2]/span[8]/text()")[0].strip() orientation = house.xpath("/div[1]/p[2]/span[10]/text()")[0] build_finishtime = house.xpath("/div[1]/p[2]/span[12]/text()")[0] location = house.xpath("/div[1]/p[3]/span[2]/a/text()")[0] total_price = house.xpath("//div[@class='show-price']")[0].strip() #解析并构造详情页URL house_url = ('http://shenzhen.qfang.com' + house.xpath("div[1]/p[1]/a/@href")[0]) #下载详情页 sel = download(house_url) time.sleep(1) house_years = sel.xpath("//div[@class='housing-info']/ul/li[2]/div/ul/li[3]/div/text()")[0] mortgage_info = sel.xpath("//div[@class='housing-info']/ul/li[2]/div/ul/li[5]/div/text()")[0] item = [title,apartment,area,decoration_type,cenggao,orientation,build_finishtime,location,total_price,house_years,mortgage_info] print('正在爬取',title) data_writer(item)def download(url): html = requests.get(url,headers=headers) time.sleep(2) return etree.HTML(html.text)def data_writer(item): with open('qfang_shenzhen_ershou.csv','a',encoding='utf-8',newline='') as f: writer = csv.writer(f) writer.writerow(item)if __name__ == '__main__': for x in range(1,100): spider(pre_url + str(x))
本文链接:http://www.wstdnwx.com/?id=581 转载需授权!
目录 返回
首页