MUKUND WAGH et al.
ISSN (O): 2348-4098
Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752
A WEB SCRAPER FOR EXTRACTING ALUMNI INFORMATION FROM SOCIAL
NETWORKING WEBSITES
1MUKUND WAGH, 2SUPARNIKA MOHATA, 3 SHANTANU KAMDI, 4RASHMI MOHOD, 5DIKSHA PACHE, 6HIMANSHU
KAUTKAR
Student, Department of Computer Technology, YCCE, Nagpur, India, mukundwagh@gmail.com
1
2Student,Department of Computer Technology, YCCE, Nagpur, India, suparnika.mohata@gmail.com
3Student, Department of Computer Technology, YCCE, Nagpur, India, kamdishantanu@gmail.com
4Student, Department of Computer Technology, YCCE, Nagpur, India, rashmimohod@gmail.com
5Student, Department of Computer Technology, YCCE, Nagpur, India, dikshapache18@gmail.com
6Student, Department of Computer Technology, YCCE, Nagpur, India, kautkarhimanshu37@gmail.com
ABSTRACT
Scrapers are used for extracting information from the repository of web pages which can be stored in well‐defined structure to
be used for various purposes. The project is about design of an efficient web scrapper on keywords college name, batch, year,
etc. for extracting YCCE alumni information by scraping social networking websites Facebook and LinkedIn and design of a
database of extracted information. The generated reports using user interface can be used for various purposes.
Index Terms: Web Scraper, Social Networking Websites, Extraction, Keyword, Parsing, Profiles.
1. INTRODUCTION pages. Among the above mentioned web data extraction
techniques, some techniques extract flat records and
The computer world is giving birth to extensive amount of some other techniques are trying to extracts nested
data every day, to search the required data, concept of records.
scraper wasemerged. In general terms, web scraping isthe
way of fetching the data from WebPages using html V. B. Kadam [3], analyzed HTML aware web scraping
parsing technique. Either the regular expression in terms techniques. The techniques discussed by him includes the
of tags is used or the direct tag is being searched. In this RoadRunner, W4F (Wysiwyg Web Wrapper Factory),
paper, we are unveiling the algorithmic and application XWRAP, IEPAD, FiVaTech, DELA (Data Extraction and
based representation of the scraper for extracting the Label Assignment for Web Databases), DEPTA (Data
alumni information through the social networking Extraction based on Partial Tree Alignment), ViPER
websites and creating a database to store the scraped (Visual perception based Extraction of Records), ViNTs
information which can be accessed by user interface for (Visual information and Tag structure basedwrapper
batch wise or name wise retrieval of the records. We are generator), CTVS (Data Extraction and Alignment using
using graph API of Facebook for scraping profiles of Combining Tag and Data Value Similarity), Mining Data
alumni from Facebook while HTML parsing is used for Records in Web Pages, MSE (Multiple Section Extraction).
scraping LinkedIn profiles
Vasani Krunal A [4], introduced a solution on the tree edit
2. RELATED WORK distance problem which is related to semantic analysis
and improving the performance of tree edit distance
From the study of related works in data mining problem. It also focuses on higher bound time complexity
techniques,it is clear that scraping of websites can be of this algorithm.
implemented in variety of ways.
Shridevi Swami, [5], used an approach for the atomic data
Ramakrishnan R. [1], implemented Advanced Multimedia extraction on Web Scraping framework which uses Tag
Answer generation by scraping information through web . and Value similarity together for automatically extracting
It uses novel multimedia question answering (MMQA). data from query result pages. Web data extraction system,
This technique can enrich community‐contributed textual automatically and repeatedly extracts data from dynamic
answers in cQA with appropriate media data. It consists of web pages and can deliver the extracted data to a
three components: Answer medium selection, Query database or some other application.
generation for multimedia search, Multimedia data
selection and presentation. The algorithms used to create Govind M. Upadhyay [6], focused on the value of web
the system are: Stemming Algorithm & stop word content mining. The paper gives an insight into its
removal, Naive Bayes, Bigram text classification, POS techniques, processes and its applications in the current
Histogram. cut‐throat business environment as well in research and
Vidya V.L. [2] developed various information extraction extracting contents for educational purposes. It further
techniques like SoftMealy, OLERA, IEPAD, RoadRunner, explains how using web content mining plays an integral
EXALG, NET, FivaTech. It also gives the comparision role by getting rich set of contents and uses those
between the extraction technologies on the basis of
supervision type, learning algorithm and the number of
International Journal of Science, Engineering and Technology- www.ijset.in 445
MUKUND WAGH et al. ISSN (O): 2348-4098
Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752
contents in the decision making in the corporate 2. Fire the search query and load all the elements of the
environment, education and research. page by infinite scrolling.
3. Scrap URL elements of each person from HTML
Anthony J. Dreyer [7], analyzed the legal framework document.
surrounding scraping, addressing both the grounds for 4. Store all the scraped elements in excel sheet and
potential claims against scrapers. Common theories of replace URL with Graph API.
liability arising from scraping are copyright infringement, 5. Store Profession of each person from HTML document.
trespass to chattels, breach of contract, and violation of 6. Store the JSON array of each searched person from
the Computer Fraud and Abuse Act (CFAA).This article graph.Facebook.com
discusses the leading cases applying these legal theories 7. Convert all records stored in excel sheet.
to website scraping, and concludes that the most effective 8. Load all the scraped data in excel to access database.
way to create potential claims against scrapers is through 9. Access the database using UI.
carefully drafted prohibitions in a website’s terms of use.
Algorithm for scraping LinkedIn profiles:
A. Shingate [8], explained development of a Website 1. Login to LinkedIn with user credentials.
where users can get an optimized result for the different 2. Fire the search query in given text box and load all
opinions on different products or events or services on elements with next page loading click event.
different social networking websites. The project 3. Scrap profiles of alumni through HTML class of each
designed mainly deals with checking different opinions so element.
that we can get a quick idea of different users based on 4. Put the inner text of the tags into variables.
their opinions. 5. The values of variable can be stored in access file using
SQL queries.
P. P. Singh Bedi [9], introduced the technique of designing
of the web scraper using prolog server pages. The authors Graph API of Facebook gives access to developers open
attempt to establish a technique to scrap HTML pages and the profiles of Facebook user in HTML document and
utilize it as per the requirements of the data and its data retrieves the information of the public data of user. HTML
type. It also provides the information about how GUI can parsing allows us to store the text fields present in the
help to access the information extracted from web pages inner tags as the information of each user.
3.4 Flowchart
Richard Baron Penman [10], concluded about a tool called
Site Scraper that aims to address problems occurring
while extracting content of the web pages. Use of site
scraper allows user to focus on content rather on the
structure of the web page.
3. METHODOLOGY
3.1 HTML Parsing
Many websites are consisting of large sets of pages
generated dynamically from database. Data of the same
perspectives are typically encoded into similar pages by a
common script or template. In data mining, a program
that detects such kind of templates in a particular
information source, extracts its information and
translates it into a relational form, is called a wrapper or
scraper.
3.2 Graph API
The Graph API is the primary way to get data from
Facebook’s platform. The Graph API was launched in
March 2010 with the intention of replacing the older
REST API. There are three methods of using the Graph
API: requesting data, posting data, and deleting data.
Some data can be requested without authentication of the
user while most data is dependent on authentication. If
the user provides permission to access the profile
information, only then the developers can seek the data. Figure 1: Work flow chart for scraper
3.3 Proposed Algorithm The work flow of both modules is similar in certain
context. The user inserts login credential which will be
Algorithm for scraping Facebook profiles: filled automatically in the navigated web browser window
1. Login to Facebook with user credentials. of the social networking site. The searching of the profile
is based on particular key constraint, for example, here
International Journal of Science, Engineering and Technology- www.ijset.in 446
MUKUND WAGH et al. ISSN (O): 2348-4098
Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752
the keyword is “people who studied at Yeshwantrao ID and URL. The information fetched from Facebook is
Chavan College of Engineering”. within the bound of privacy. The information which is
kept private by the user, would not be fetched during
The keyword is transferred to search field of website and scraping.
start giving the profile of alumni. The profiles are then
fetched as records by procedure mentioned in the
algorithm given above for respective websites. The
information is stored in access database and can be
retrieved directly through the user interface designed for
it.
4. DERIVED RESULTS
The scraper selection form shown in Figure 2 allows to
choose scraper for LinkedIn/Facebook. The click event of
the button of LinkedIn and Facebook forwards the user
control to the chosen social networking website.
Figure 4: Storing of Alumni complete information fetched
from Facebook
After successful login in LinkedIn, we need to enter the
keyword for the searching and scraping profiles of alumni.
The keywords are reflected in search box of LinkedIn
webpage after we click on the ‘Search’ button on web
browser. The matched results are displayed in search
result container.
Figure. 5 shows the search container elements of the
LinkedIn search result. The “Scrap” button click makes the
required HTML tag inner text to get stored into the access
Figure 2: Scraping UI for LinkedIn / Facebook file. This access file checks for the redundant data and
store only those records which are unique. The LinkedIn
The keywords for searching alumni records are received scraper scraps the “Full Name”, “Location”, “Past Profile”
by the search box of the Facebook page as shown in and the “Current Profile” of the alumni.
Figure. 3. The click event of search button on this
webpage is invoked. This makes the required profiles to
get displayed on the web page. The webpage shown in
Figure 3 shows the list of matched profiles.
Figure 5: Keyword based searching of alumni profiles
from LinkedIn
Figure 3: Dynamic keyword based searching of alumni Figure 6 shows the information of about 25 alumni which
profiles from Facebook is fetched from LinkedIn Profiles. The access file records
shown above involve “Full Name”, “Location”, “Past
The excel sheet shown in Figure 4 contains all the Profile” and “Current Profile” of alumni.
information of the alumni which has been fetched from
Facebook. The fetched information includes the fields
such as First name, Last Name, Gender, Work, Facebook
International Journal of Science, Engineering and Technology- www.ijset.in 447
MUKUND WAGH et al. ISSN (O): 2348-4098
Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752
through scraper is useful for various purposes. The
scraper is not violating any authentication issue and the
scraped information is used for creation of alumni
database.
REFERENCES
[1]. Ramakrishnan.R, Jayalakshmi.A, Priyadharshani.S,
“Advanced Multimedia Answer Generation by Scraping
Information through Web”, International Journal of
Innovative Research in Computer and
Communication Engineering (An ISO 3297: 2007 Certified
Organization), Vol. 2, Issue 12, December 2014.
[2]. Vidya.V.L., “A Survey of Web Data Extraction
Techniques”,International Journal of Advance Research
inComputer Science and Management Studies, Volume 2,
Issue 9, September 2014.
Figure 6: Alumni Information fetched from LinkedIn [3]. Vinayak B. Kadam, Ganesh K. Pakle, “A Survey on
HTML Structure Aware and Tree Based Web Data
The merged database of LinkedIn and Facebook are as Scraping Technique”, International Journal of Computer
given below in figure 7. Some fields of Facebook retrieved Science and Information Technologies, Vol. 5 (2), 2014.
information and LinkedIn retrieved data are not same. So [4]. Vasani Krunal A, “Content evocation using web
we have replaced it by dummy keyword i.e. 0. scraping and semantic illustration” , IOSR Journal of
Computer Engineering, Volume 16, Issue 3, May‐Jun. 2014.
[5]. Shridevi Swami, Pujashree Vidap, “Web Scraping
Framework based on Combining Tag and Value
Similarity”,International Journal of Computer Science
Issues, Vol. 10, Issue 6, No 2, November 2013.
[6]. Govind Murari Upadhyay, Kanika Dhingra, “Web
Content Mining: Its Techniques and Uses”, International
Journal of Advanced Research in Computer Science and
Software Engineering, Volume 3, Issue 11, November 2013.
[7]. Anthony J. Dreyer and Jamie Stockton, “Internet Data
Scraping, a Primer for Counseling Clients”, New York
Law Journal Special Section, July 15, 2013.
[8]. Abhinav Shingate, Nayan Tayade, Rahul More,
ParagZaware, “Opinion Mining: Opinion Extractor from
Social Networking Sites [Single Page Result].”,
International Journal of Emerging Technology and
Advanced Engineering, Volume 2, Issue 4, April 2012.
[9]. Parminder Pal Singh Bedi, Sumit Kumar, “Web
scraping and implementation using prolog server pages in
Figure 7: Sample record from the merged database of
semantic web”, International Journal of Research in
LinkedIn and Facebook data
Engineering & Applied Sciences, Volume 2, Issue 2,February
5. CONCLUSION 2012.
[10]. Richard Baron Penman, Timothy Baldwin, David
This web scraper is effectively able to extract the profiles Martinez, “Web Scraping Made Simple with Site Scraper”,
of alumni and fetch relevant information from social International Journal of Research in Engineering & Applied
networking site such as Facebook and LinkedIn. The Sciences, Volume 4, Issue 3, May 2000.
algorithm specifies working of scraper for LinkedIn and
Facebook, both HTML parsing and Graph API are used for
retrieval of information. The retrieved information
International Journal of Science, Engineering and Technology- www.ijset.in 448