Relative X-Path HTML Scraper
For Kayak Airfares
By: Anthony Kilde
Importing the schedule library allows the
scraper to run periodically throughout the
day while also checking for the current
date to automatically stop the script when
needed.
Main Controller
I used selenium which is a Python library
with built-in functions to interact with
certain webdrivers including: Chrome,
Chromium, Internet Explorer, and Firefox.
This library is predominantly used to “crawl”
over webpages and interact with the
elements.
Driver Methods
Chrome’s webdriver has multiple developer
arguments that you can add for the
functionality you want. Returning the
webdriver gives access to the
flightScraper.py. We define any arguments
in the class initializer or the arguments list
in the class scope variable.
.
Driver Arguments Added
Headless is an argument that starts the
chrome driver in the background and
disables the use of the GPU when
opening Chrome.
Opening Elements
Data within Kayak’s webpage dynamically
populates when expanding the selected
element. Without expanding the element, the
HTML source will be missing most of the
information that we need to pull. To combat
this, I opened each element before trying to
scrape the data needed.
Kayak includes two elements for every flight, one
at the top of the page in the main container, and
one at the bottom in another container. To
ensure we don’t pull data that we have already
collected and to make sure we grab all the
flights, the scraper grabs the integer between
the parenthesis in “Saved (6)”.
Inspecting the
element and finding
a keyword to access
Saved(6).
dynCounter = 6
Iteration Counter
Chrome Scraper
Creating a loop, the scraper iterates
through the webpage HTML based on how
many flights you have saved (dynCounter),
finding one of the few keywords within the
source code that does not dynamically
update. The keyword “Flight” is our word
of choice, we access the xpath location
and create a data list filled with the
relevant information.
At this point, the data needs to be cleaned up to
only represent the variables needed for the
assignment. The data pulled from the scrape
contains the same exact variable every time so
we can remove the redundant parts by creating a
new list and sending it over to the csvWriter.
Organizing Data
All the data is organized into
a nested list so we can easily
insert our IDs before writing
rows neatly in excel.
Preprocessed Data
All the data is organized into a nested list so we can
easily insert our IDs and write rows neatly in excel.
Flight ID is based upon destination, time, dep airline
number, and ret airline number.
Adding Identifiers for DB
Organizing Data
Data is appended to an excel sheet once
every 4 hours for the duration of the set time-
frame. Data can also be written over to a
local MySQL database once a day.
If data is written to the local database, the
excel sheet is deleted and another one is
created fresh ONLY if the previous file deleted
properly.
Excel Data Layout
CSV Helper Class
Raspberry Pi Setup
I did not want to keep my desktop on 24/7 so I bought a Raspberry Pi along with a 7inch touch
screen monitor. The Raspberry Pi was kept on for nearly 2 months scraping data. This script also
works on Mac OS, and Windows.
5 consecutive scrapes on
Raspbian OS How I connected the Pi
Front end Development
Front end (written with tkinter) is functional and ready to connect to the backend, but that has yet to be done. The idea is to allow
users to save the data for each flight found in “Saved Flights” and allow them to visually see the database cells/rows for each airfare
they saved.

HTML Flight Scraper

  • 1.
    Relative X-Path HTMLScraper For Kayak Airfares By: Anthony Kilde
  • 2.
    Importing the schedulelibrary allows the scraper to run periodically throughout the day while also checking for the current date to automatically stop the script when needed. Main Controller
  • 3.
    I used seleniumwhich is a Python library with built-in functions to interact with certain webdrivers including: Chrome, Chromium, Internet Explorer, and Firefox. This library is predominantly used to “crawl” over webpages and interact with the elements. Driver Methods Chrome’s webdriver has multiple developer arguments that you can add for the functionality you want. Returning the webdriver gives access to the flightScraper.py. We define any arguments in the class initializer or the arguments list in the class scope variable. . Driver Arguments Added Headless is an argument that starts the chrome driver in the background and disables the use of the GPU when opening Chrome.
  • 4.
    Opening Elements Data withinKayak’s webpage dynamically populates when expanding the selected element. Without expanding the element, the HTML source will be missing most of the information that we need to pull. To combat this, I opened each element before trying to scrape the data needed. Kayak includes two elements for every flight, one at the top of the page in the main container, and one at the bottom in another container. To ensure we don’t pull data that we have already collected and to make sure we grab all the flights, the scraper grabs the integer between the parenthesis in “Saved (6)”.
  • 5.
    Inspecting the element andfinding a keyword to access Saved(6). dynCounter = 6 Iteration Counter
  • 6.
    Chrome Scraper Creating aloop, the scraper iterates through the webpage HTML based on how many flights you have saved (dynCounter), finding one of the few keywords within the source code that does not dynamically update. The keyword “Flight” is our word of choice, we access the xpath location and create a data list filled with the relevant information.
  • 7.
    At this point,the data needs to be cleaned up to only represent the variables needed for the assignment. The data pulled from the scrape contains the same exact variable every time so we can remove the redundant parts by creating a new list and sending it over to the csvWriter. Organizing Data All the data is organized into a nested list so we can easily insert our IDs before writing rows neatly in excel. Preprocessed Data
  • 8.
    All the datais organized into a nested list so we can easily insert our IDs and write rows neatly in excel. Flight ID is based upon destination, time, dep airline number, and ret airline number. Adding Identifiers for DB
  • 9.
    Organizing Data Data isappended to an excel sheet once every 4 hours for the duration of the set time- frame. Data can also be written over to a local MySQL database once a day. If data is written to the local database, the excel sheet is deleted and another one is created fresh ONLY if the previous file deleted properly. Excel Data Layout CSV Helper Class
  • 10.
    Raspberry Pi Setup Idid not want to keep my desktop on 24/7 so I bought a Raspberry Pi along with a 7inch touch screen monitor. The Raspberry Pi was kept on for nearly 2 months scraping data. This script also works on Mac OS, and Windows. 5 consecutive scrapes on Raspbian OS How I connected the Pi
  • 11.
    Front end Development Frontend (written with tkinter) is functional and ready to connect to the backend, but that has yet to be done. The idea is to allow users to save the data for each flight found in “Saved Flights” and allow them to visually see the database cells/rows for each airfare they saved.