C RAWLER - Extracting hotel reviews from a review aggregation website

The crawler structure is displayed in Figure 2 and it follows closely the page structure of the site that was described previously. The crawler starts from a country page that lists cities in the country and follows each link to a city page with hotels in that city. The crawler selects links to hotel description pages in each city page and modifies the selected links to find review pages for each hotel that is listed in the city. The links on each page are inside HTML elements that are uniquely identified with CSS classes and ids. These can be used to select and extract the links with XPath from each page. On a hotel’s review page, the crawler edits review filters to get access to all of the reviews and then crawls all pages with reviews. The crawled review pages are passed on the scraper that extracts review data from the pages.

Figure 2: Parts of the crawler

3.4.1 Finding all review pages

Country-level pages contain a list of cities in that country that have listed hotels on Booking.com. City-level pages contain a list of all hotels available on Booking.com in that city. The process for extracting links from both country and city pages is similar. The items on both pages are listed alphabetically into sections, with each section being inside a division tag (div). Inside each section is an unordered list element (ul) that contains anchor tags (a) with the link the scraper needs. All sections are inside a single element that does not contain anything else, so the scraper doesn’t require logic for excluding other links on the page, such as airports or attractions.

City-page has links to hotel description pages and each hotel has a dedicated page that lists reviews for the hotel. Hotel pages use the following URL structure:

booking.com/hotel/<country code>/<hotel name>.<language code>.html

URL structure for a review page is as follows:

booking.com/reviews/<country code>/hotel/<hotel name>.< language code>.html

where country code is a code for the country the hotel is in, the hotel name is an identifier for the hotel, and language code is a country code for translation. The country code is “fi”

for all hotels in Finland. The hotel name is a unique identifier that is extracted from the links on a city page. Language code can be for example “en-gb” for English translation.

The URLs for the hotel description page and hotel review page are similar to each other and the review page URL for a hotel can be constructed by replacing “hotel/<country code>”

from the hotel information page URL with “reviews/<country code>/hotel”.

3.4.2 Crawling filtered reviews

By default, Booking.com shows “Featured reviews” in the user’s browser’s language on a hotel’s review page. The review page also contains a form with three dropdown menus:

review language, traveler type, and sorting. The review language option is used to filter reviews that are written in the selected language. Traveler type shows reviews from selected traveler types with options being all travelers, business travelers, couples, families, groups of friends, or solo travelers. Sorting determines the order the reviews are shown, options being date and score, both ascending or descending. Sorting also has the “Featured reviews”

option, which instead of sorting reviews only shows some reviews that are Booking has determined to be “featured”.

Booking.com shows all reviews for the hotel by selecting “All languages” as language, “All travelers” as traveler type, and sorting by some date for example.

A hotel may not have any recommended reviews in which case the page at first only contains the form for selecting other reviews. In both instances, the crawler selects appropriate options from the form, makes a POST request with the form data, and passes the response from that request along with hotel, city, and country name forward to the part of the program that handles pagination.

19 3.4.3 Pagination

Review pages have page links under the reviews if the hotel has more than 25 reviews.

Booking.com handles pagination with a GET variable “page”. There is also a variable

“rows” which determines how many rows a skipped page has. For example, a request with the “page” parameter being 2 and the “rows” parameter being 75, Booking.com shows reviews from 76 to 100.

It skips the first 75 reviews and shows reviews from 76 onward. It acts as if the first page had 75 reviews on it even though one review page only has 25 reviews. This means that by default, just using the “next page” link skips two-thirds of the reviews. Changing the “rows”

parameter to 25 fixes this.

The crawler can handle pagination by either altering the page parameter in the URL, sending requests for different pages, and having some logic to determine, when it has found the last page or follow the ‘Next page’ link on the page until there is no more a ‘Next page’ link.

The latter approach has the advantage that Booking.com has determined already how many pages there are, and the crawler doesn’t have to have logic for determining whether there are pages left to crawl or not and makes the crawler simpler and requires fewer requests. Scrapy's built-in duplicate filter handles duplication that could result from for example following both

‘Next page’-links on a page.

In document Extracting hotel reviews from a review aggregation website (sivua 20-23)