HTML Flight Scraper

Relative X-Path HTML Scraper
For Kayak Airfares
By: Anthony Kilde

Importing the schedule library allows the
scraper to run periodically throughout the
day while also checking for the current
date to automatically stop the script when
needed.
Main Controller

I used selenium which is a Python library
with built-in functions to interact with
certain webdrivers including: Chrome,
Chromium, Internet Explorer, and Firefox.
This library is predominantly used to “crawl”
over webpages and interact with the
elements.
Driver Methods
Chrome’s webdriver has multiple developer
arguments that you can add for the
functionality you want. Returning the
webdriver gives access to the
flightScraper.py. We define any arguments
in the class initializer or the arguments list
in the class scope variable.
.
Driver Arguments Added
Headless is an argument that starts the
chrome driver in the background and
disables the use of the GPU when
opening Chrome.

Opening Elements
Data within Kayak’s webpage dynamically
populates when expanding the selected
element. Without expanding the element, the
HTML source will be missing most of the
information that we need to pull. To combat
this, I opened each element before trying to
scrape the data needed.
Kayak includes two elements for every flight, one
at the top of the page in the main container, and
one at the bottom in another container. To
ensure we don’t pull data that we have already
collected and to make sure we grab all the
flights, the scraper grabs the integer between
the parenthesis in “Saved (6)”.

Inspecting the
element and finding
a keyword to access
Saved(6).
dynCounter = 6
Iteration Counter

Chrome Scraper
Creating a loop, the scraper iterates
through the webpage HTML based on how
many flights you have saved (dynCounter),
finding one of the few keywords within the
source code that does not dynamically
update. The keyword “Flight” is our word
of choice, we access the xpath location
and create a data list filled with the
relevant information.

At this point, the data needs to be cleaned up to
only represent the variables needed for the
assignment. The data pulled from the scrape
contains the same exact variable every time so
we can remove the redundant parts by creating a
new list and sending it over to the csvWriter.
Organizing Data
All the data is organized into
a nested list so we can easily
insert our IDs before writing
rows neatly in excel.
Preprocessed Data

All the data is organized into a nested list so we can
easily insert our IDs and write rows neatly in excel.
Flight ID is based upon destination, time, dep airline
number, and ret airline number.
Adding Identifiers for DB

Organizing Data
Data is appended to an excel sheet once
every 4 hours for the duration of the set time-
frame. Data can also be written over to a
local MySQL database once a day.
If data is written to the local database, the
excel sheet is deleted and another one is
created fresh ONLY if the previous file deleted
properly.
Excel Data Layout
CSV Helper Class

Raspberry Pi Setup
I did not want to keep my desktop on 24/7 so I bought a Raspberry Pi along with a 7inch touch
screen monitor. The Raspberry Pi was kept on for nearly 2 months scraping data. This script also
works on Mac OS, and Windows.
5 consecutive scrapes on
Raspbian OS How I connected the Pi

Front end Development
Front end (written with tkinter) is functional and ready to connect to the backend, but that has yet to be done. The idea is to allow
users to save the data for each flight found in “Saved Flights” and allow them to visually see the database cells/rows for each airfare
they saved.

HTML Flight Scraper

More Related Content

What's hot

Similar to HTML Flight Scraper

Recently uploaded

HTML Flight Scraper