0% found this document useful (0 votes)

30 views4 pages

A Web Scraper For Extracting Alumni Information From Social

Developing WEB SCRAPER

Uploaded by

Himanshu Kautkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views4 pages

A Web Scraper For Extracting Alumni Information From Social

Developing WEB SCRAPER

Uploaded by

Himanshu Kautkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MUKUND WAGH et al.

ISSN (O): 2348-4098

Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752

A WEB SCRAPER FOR EXTRACTING ALUMNI INFORMATION FROM SOCIAL

NETWORKING WEBSITES

1MUKUND WAGH, 2SUPARNIKA MOHATA, 3 SHANTANU KAMDI, 4RASHMI MOHOD, 5DIKSHA PACHE, 6HIMANSHU
KAUTKAR
Student, Department of Computer Technology, YCCE, Nagpur, India, mukundwagh@gmail.com
1
2Student,Department of Computer Technology, YCCE, Nagpur, India, suparnika.mohata@gmail.com
3Student, Department of Computer Technology, YCCE, Nagpur, India, kamdishantanu@gmail.com
4Student, Department of Computer Technology, YCCE, Nagpur, India, rashmimohod@gmail.com
5Student, Department of Computer Technology, YCCE, Nagpur, India, dikshapache18@gmail.com
6Student, Department of Computer Technology, YCCE, Nagpur, India, kautkarhimanshu37@gmail.com

ABSTRACT

Scrapers are used for extracting information from the repository of web pages which can be stored in well‐defined structure to
be used for various purposes. The project is about design of an efficient web scrapper on keywords college name, batch, year,
etc. for extracting YCCE alumni information by scraping social networking websites Facebook and LinkedIn and design of a
database of extracted information. The generated reports using user interface can be used for various purposes.

Index Terms: Web Scraper, Social Networking Websites, Extraction, Keyword, Parsing, Profiles.

1. INTRODUCTION pages. Among the above mentioned web data extraction

techniques, some techniques extract flat records and
The computer world is giving birth to extensive amount of some other techniques are trying to extracts nested
data every day, to search the required data, concept of records.
scraper wasemerged. In general terms, web scraping isthe
way of fetching the data from WebPages using html V. B. Kadam [3], analyzed HTML aware web scraping
parsing technique. Either the regular expression in terms techniques. The techniques discussed by him includes the
of tags is used or the direct tag is being searched. In this RoadRunner, W4F (Wysiwyg Web Wrapper Factory),
paper, we are unveiling the algorithmic and application XWRAP, IEPAD, FiVaTech, DELA (Data Extraction and
based representation of the scraper for extracting the Label Assignment for Web Databases), DEPTA (Data
alumni information through the social networking Extraction based on Partial Tree Alignment), ViPER
websites and creating a database to store the scraped (Visual perception based Extraction of Records), ViNTs
information which can be accessed by user interface for (Visual information and Tag structure basedwrapper
batch wise or name wise retrieval of the records. We are generator), CTVS (Data Extraction and Alignment using
using graph API of Facebook for scraping profiles of Combining Tag and Data Value Similarity), Mining Data
alumni from Facebook while HTML parsing is used for Records in Web Pages, MSE (Multiple Section Extraction).
scraping LinkedIn profiles
Vasani Krunal A [4], introduced a solution on the tree edit
2. RELATED WORK distance problem which is related to semantic analysis
and improving the performance of tree edit distance
From the study of related works in data mining problem. It also focuses on higher bound time complexity
techniques,it is clear that scraping of websites can be of this algorithm.
implemented in variety of ways.
Shridevi Swami, [5], used an approach for the atomic data
Ramakrishnan R. [1], implemented Advanced Multimedia extraction on Web Scraping framework which uses Tag
Answer generation by scraping information through web . and Value similarity together for automatically extracting
It uses novel multimedia question answering (MMQA). data from query result pages. Web data extraction system,
This technique can enrich community‐contributed textual automatically and repeatedly extracts data from dynamic
answers in cQA with appropriate media data. It consists of web pages and can deliver the extracted data to a
three components: Answer medium selection, Query database or some other application.
generation for multimedia search, Multimedia data
selection and presentation. The algorithms used to create Govind M. Upadhyay [6], focused on the value of web
the system are: Stemming Algorithm & stop word content mining. The paper gives an insight into its
removal, Naive Bayes, Bigram text classification, POS techniques, processes and its applications in the current
Histogram. cut‐throat business environment as well in research and
Vidya V.L. [2] developed various information extraction extracting contents for educational purposes. It further
techniques like SoftMealy, OLERA, IEPAD, RoadRunner, explains how using web content mining plays an integral
EXALG, NET, FivaTech. It also gives the comparision role by getting rich set of contents and uses those
between the extraction technologies on the basis of
supervision type, learning algorithm and the number of

International Journal of Science, Engineering and Technology- www.ijset.in 445

MUKUND WAGH et al. ISSN (O): 2348-4098
Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752

contents in the decision making in the corporate 2. Fire the search query and load all the elements of the
environment, education and research. page by infinite scrolling.
3. Scrap URL elements of each person from HTML
Anthony J. Dreyer [7], analyzed the legal framework document.
surrounding scraping, addressing both the grounds for 4. Store all the scraped elements in excel sheet and
potential claims against scrapers. Common theories of replace URL with Graph API.
liability arising from scraping are copyright infringement, 5. Store Profession of each person from HTML document.
trespass to chattels, breach of contract, and violation of 6. Store the JSON array of each searched person from
the Computer Fraud and Abuse Act (CFAA).This article graph.Facebook.com
discusses the leading cases applying these legal theories 7. Convert all records stored in excel sheet.
to website scraping, and concludes that the most effective 8. Load all the scraped data in excel to access database.
way to create potential claims against scrapers is through 9. Access the database using UI.
carefully drafted prohibitions in a website’s terms of use.
Algorithm for scraping LinkedIn profiles:
A. Shingate [8], explained development of a Website 1. Login to LinkedIn with user credentials.
where users can get an optimized result for the different 2. Fire the search query in given text box and load all
opinions on different products or events or services on elements with next page loading click event.
different social networking websites. The project 3. Scrap profiles of alumni through HTML class of each
designed mainly deals with checking different opinions so element.
that we can get a quick idea of different users based on 4. Put the inner text of the tags into variables.
their opinions. 5. The values of variable can be stored in access file using
SQL queries.
P. P. Singh Bedi [9], introduced the technique of designing
of the web scraper using prolog server pages. The authors Graph API of Facebook gives access to developers open
attempt to establish a technique to scrap HTML pages and the profiles of Facebook user in HTML document and
utilize it as per the requirements of the data and its data retrieves the information of the public data of user. HTML
type. It also provides the information about how GUI can parsing allows us to store the text fields present in the
help to access the information extracted from web pages inner tags as the information of each user.

3.4 Flowchart
Richard Baron Penman [10], concluded about a tool called
Site Scraper that aims to address problems occurring
while extracting content of the web pages. Use of site
scraper allows user to focus on content rather on the
structure of the web page.

3. METHODOLOGY

3.1 HTML Parsing

Many websites are consisting of large sets of pages
generated dynamically from database. Data of the same
perspectives are typically encoded into similar pages by a
common script or template. In data mining, a program
that detects such kind of templates in a particular
information source, extracts its information and
translates it into a relational form, is called a wrapper or
scraper.

3.2 Graph API

The Graph API is the primary way to get data from
Facebook’s platform. The Graph API was launched in
March 2010 with the intention of replacing the older
REST API. There are three methods of using the Graph
API: requesting data, posting data, and deleting data.
Some data can be requested without authentication of the
user while most data is dependent on authentication. If
the user provides permission to access the profile
information, only then the developers can seek the data. Figure 1: Work flow chart for scraper

3.3 Proposed Algorithm The work flow of both modules is similar in certain
context. The user inserts login credential which will be
Algorithm for scraping Facebook profiles: filled automatically in the navigated web browser window
1. Login to Facebook with user credentials. of the social networking site. The searching of the profile
is based on particular key constraint, for example, here

International Journal of Science, Engineering and Technology- www.ijset.in 446

MUKUND WAGH et al. ISSN (O): 2348-4098
Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752

the keyword is “people who studied at Yeshwantrao ID and URL. The information fetched from Facebook is
Chavan College of Engineering”. within the bound of privacy. The information which is
kept private by the user, would not be fetched during
The keyword is transferred to search field of website and scraping.
start giving the profile of alumni. The profiles are then
fetched as records by procedure mentioned in the
algorithm given above for respective websites. The
information is stored in access database and can be
retrieved directly through the user interface designed for
it.

4. DERIVED RESULTS

The scraper selection form shown in Figure 2 allows to

choose scraper for LinkedIn/Facebook. The click event of
the button of LinkedIn and Facebook forwards the user
control to the chosen social networking website.

Figure 4: Storing of Alumni complete information fetched

from Facebook

After successful login in LinkedIn, we need to enter the

keyword for the searching and scraping profiles of alumni.
The keywords are reflected in search box of LinkedIn
webpage after we click on the ‘Search’ button on web
browser. The matched results are displayed in search
result container.

Figure. 5 shows the search container elements of the

LinkedIn search result. The “Scrap” button click makes the
required HTML tag inner text to get stored into the access
Figure 2: Scraping UI for LinkedIn / Facebook file. This access file checks for the redundant data and
store only those records which are unique. The LinkedIn
The keywords for searching alumni records are received scraper scraps the “Full Name”, “Location”, “Past Profile”
by the search box of the Facebook page as shown in and the “Current Profile” of the alumni.
Figure. 3. The click event of search button on this
webpage is invoked. This makes the required profiles to
get displayed on the web page. The webpage shown in
Figure 3 shows the list of matched profiles.

Figure 5: Keyword based searching of alumni profiles

from LinkedIn

Figure 3: Dynamic keyword based searching of alumni Figure 6 shows the information of about 25 alumni which
profiles from Facebook is fetched from LinkedIn Profiles. The access file records
shown above involve “Full Name”, “Location”, “Past
The excel sheet shown in Figure 4 contains all the Profile” and “Current Profile” of alumni.
information of the alumni which has been fetched from
Facebook. The fetched information includes the fields
such as First name, Last Name, Gender, Work, Facebook

International Journal of Science, Engineering and Technology- www.ijset.in 447

MUKUND WAGH et al. ISSN (O): 2348-4098
Date of Publication: May 05, 2015 Volume 3 Issue 2: 2015 ISSN (P): 2395-4752

through scraper is useful for various purposes. The

scraper is not violating any authentication issue and the
scraped information is used for creation of alumni
database.

REFERENCES
[1]. Ramakrishnan.R, Jayalakshmi.A, Priyadharshani.S,
“Advanced Multimedia Answer Generation by Scraping
Information through Web”, International Journal of
Innovative Research in Computer and
Communication Engineering (An ISO 3297: 2007 Certified
Organization), Vol. 2, Issue 12, December 2014.
[2]. Vidya.V.L., “A Survey of Web Data Extraction
Techniques”,International Journal of Advance Research
inComputer Science and Management Studies, Volume 2,
Issue 9, September 2014.
Figure 6: Alumni Information fetched from LinkedIn [3]. Vinayak B. Kadam, Ganesh K. Pakle, “A Survey on
HTML Structure Aware and Tree Based Web Data
The merged database of LinkedIn and Facebook are as Scraping Technique”, International Journal of Computer
given below in figure 7. Some fields of Facebook retrieved Science and Information Technologies, Vol. 5 (2), 2014.
information and LinkedIn retrieved data are not same. So [4]. Vasani Krunal A, “Content evocation using web
we have replaced it by dummy keyword i.e. 0. scraping and semantic illustration” , IOSR Journal of
Computer Engineering, Volume 16, Issue 3, May‐Jun. 2014.
[5]. Shridevi Swami, Pujashree Vidap, “Web Scraping
Framework based on Combining Tag and Value
Similarity”,International Journal of Computer Science
Issues, Vol. 10, Issue 6, No 2, November 2013.
[6]. Govind Murari Upadhyay, Kanika Dhingra, “Web
Content Mining: Its Techniques and Uses”, International
Journal of Advanced Research in Computer Science and
Software Engineering, Volume 3, Issue 11, November 2013.
[7]. Anthony J. Dreyer and Jamie Stockton, “Internet Data
Scraping, a Primer for Counseling Clients”, New York
Law Journal Special Section, July 15, 2013.
[8]. Abhinav Shingate, Nayan Tayade, Rahul More,
ParagZaware, “Opinion Mining: Opinion Extractor from
Social Networking Sites [Single Page Result].”,
International Journal of Emerging Technology and
Advanced Engineering, Volume 2, Issue 4, April 2012.
[9]. Parminder Pal Singh Bedi, Sumit Kumar, “Web
scraping and implementation using prolog server pages in
Figure 7: Sample record from the merged database of
semantic web”, International Journal of Research in
LinkedIn and Facebook data
Engineering & Applied Sciences, Volume 2, Issue 2,February
5. CONCLUSION 2012.
[10]. Richard Baron Penman, Timothy Baldwin, David
This web scraper is effectively able to extract the profiles Martinez, “Web Scraping Made Simple with Site Scraper”,
of alumni and fetch relevant information from social International Journal of Research in Engineering & Applied
networking site such as Facebook and LinkedIn. The Sciences, Volume 4, Issue 3, May 2000.
algorithm specifies working of scraper for LinkedIn and
Facebook, both HTML parsing and Graph API are used for
retrieval of information. The retrieved information

International Journal of Science, Engineering and Technology- www.ijset.in 448

Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Utilizing Python For Web Scraping and Incremental Data Extraction
No ratings yet
Utilizing Python For Web Scraping and Incremental Data Extraction
6 pages
Twitter Data Extraction Techniques
No ratings yet
Twitter Data Extraction Techniques
6 pages
E-commerce Review Scraper Project
No ratings yet
E-commerce Review Scraper Project
15 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Web Data Extraction Using The Approach of Segmentation and Parsing
No ratings yet
Web Data Extraction Using The Approach of Segmentation and Parsing
7 pages
Automated Web Scraping For Telecom Corpus Application
No ratings yet
Automated Web Scraping For Telecom Corpus Application
5 pages
Web Scraping of Social Networks: Nternational Ournal of Nnovative Esearch in Omputer and Ommunication Ngineering
No ratings yet
Web Scraping of Social Networks: Nternational Ournal of Nnovative Esearch in Omputer and Ommunication Ngineering
4 pages
Diouf 2019
No ratings yet
Diouf 2019
3 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Touch With Industry
No ratings yet
Touch With Industry
3 pages
Sma U-2
No ratings yet
Sma U-2
19 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
5 pages
BE IT Project Synopsis Format 2022 23 V1
No ratings yet
BE IT Project Synopsis Format 2022 23 V1
11 pages
Com 059
No ratings yet
Com 059
6 pages
Web Scraping Techniques Overview
No ratings yet
Web Scraping Techniques Overview
11 pages
Modern Web Scraping Techniques for Data Scientists
No ratings yet
Modern Web Scraping Techniques for Data Scientists
13 pages
Web Scraping with Python & Selenium
No ratings yet
Web Scraping with Python & Selenium
5 pages
Web Scraping with Machine Learning Techniques
No ratings yet
Web Scraping with Machine Learning Techniques
4 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Advanced Web Data Mining Projects
No ratings yet
Advanced Web Data Mining Projects
2 pages
Web Scraping for Job Portals
No ratings yet
Web Scraping for Job Portals
13 pages
Data Aggregation via Web Scraping
No ratings yet
Data Aggregation via Web Scraping
48 pages
Web Scraping Techniques Explained
100% (1)
Web Scraping Techniques Explained
25 pages
Web Scraping Techniques and Tools
No ratings yet
Web Scraping Techniques and Tools
3 pages
Web Scraping for Ontology Building
No ratings yet
Web Scraping for Ontology Building
11 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Pratik Report
No ratings yet
Pratik Report
32 pages
A Survey On Web Page Segmentation and Its Applications: U.Arundhathi, V.Sneha Latha, D.Grace Priscilla
No ratings yet
A Survey On Web Page Segmentation and Its Applications: U.Arundhathi, V.Sneha Latha, D.Grace Priscilla
6 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Data Cleaning and Web Scraping Guide
No ratings yet
Data Cleaning and Web Scraping Guide
4 pages
chp3A10.10072F978 3 319 32001 4 - 483 1
No ratings yet
chp3A10.10072F978 3 319 32001 4 - 483 1
4 pages
Overview of Web Mining Techniques
No ratings yet
Overview of Web Mining Techniques
41 pages
Overview of Web Data Extraction Techniques
No ratings yet
Overview of Web Data Extraction Techniques
10 pages
Study of Webcrawler: Implementation of Efficient and Fast Crawler
No ratings yet
Study of Webcrawler: Implementation of Efficient and Fast Crawler
6 pages
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
No ratings yet
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
11 pages
Web Crawler Assisted Web Page Cleaning For Web Data Mining
No ratings yet
Web Crawler Assisted Web Page Cleaning For Web Data Mining
75 pages
AI Jobs Dataset via Web Scraping
No ratings yet
AI Jobs Dataset via Web Scraping
7 pages
Document For Scribd
No ratings yet
Document For Scribd
54 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
A Survey On Web Scraping and Its Applications - IJCRT
No ratings yet
A Survey On Web Scraping and Its Applications - IJCRT
4 pages
Artificial Intelligence and Innovative A
No ratings yet
Artificial Intelligence and Innovative A
9 pages
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
No ratings yet
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
25 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Data Scraping: Techniques and Challenges
No ratings yet
Data Scraping: Techniques and Challenges
25 pages
Exemple Rapport Stage Web Scraping 2022 ZD
0% (1)
Exemple Rapport Stage Web Scraping 2022 ZD
26 pages
08 Gtu TPT Report
No ratings yet
08 Gtu TPT Report
37 pages
Relative Insertion of Business To Customer URL by Discover Web Information Schemas
No ratings yet
Relative Insertion of Business To Customer URL by Discover Web Information Schemas
4 pages
Web Crawler Toolkit for Developers
No ratings yet
Web Crawler Toolkit for Developers
6 pages
Icwet 1094
No ratings yet
Icwet 1094
6 pages
Sing Rodia 2019
No ratings yet
Sing Rodia 2019
6 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
134am - 125.EPRA JOURNALS 17145
No ratings yet
134am - 125.EPRA JOURNALS 17145
8 pages
Visual Architecture Based Web Information Extraction
No ratings yet
Visual Architecture Based Web Information Extraction
6 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
Unix Shell Scripting Projects
No ratings yet
Unix Shell Scripting Projects
14 pages
Google Ads Paid Search Strategies
No ratings yet
Google Ads Paid Search Strategies
52 pages
RPA Unit-1
No ratings yet
RPA Unit-1
23 pages
Blog SEO: How To Rank Your Blog Posts On Google
No ratings yet
Blog SEO: How To Rank Your Blog Posts On Google
13 pages
SEO Agency SOW Comparison Template
No ratings yet
SEO Agency SOW Comparison Template
4 pages
SEO Extreme: On-Site and Off-Site Guide
No ratings yet
SEO Extreme: On-Site and Off-Site Guide
18 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
7 pages
SEO Quiz
No ratings yet
SEO Quiz
10 pages
DM Project Report
No ratings yet
DM Project Report
15 pages
500+ Profile Backlinks List For 2020
100% (2)
500+ Profile Backlinks List For 2020
35 pages
2 NLP Pipeline
No ratings yet
2 NLP Pipeline
57 pages
Google Forms Document Overview
No ratings yet
Google Forms Document Overview
9 pages
Taobao Manuals Search Engine
No ratings yet
Taobao Manuals Search Engine
64 pages
Singer 1021/1022 Sewing Machine Instruction Manual
No ratings yet
Singer 1021/1022 Sewing Machine Instruction Manual
56 pages
Technics Sx-kn1000 Keyboard User Manual
No ratings yet
Technics Sx-kn1000 Keyboard User Manual
126 pages
Cookie List
No ratings yet
Cookie List
22 pages
Class 6 Pandas
No ratings yet
Class 6 Pandas
13 pages
URLs EXTRACTION USING PHOTON
No ratings yet
URLs EXTRACTION USING PHOTON
7 pages
Aptitude Training for Competitive Exams
No ratings yet
Aptitude Training for Competitive Exams
91 pages
Unit V - 4. Explain Search Engine
No ratings yet
Unit V - 4. Explain Search Engine
3 pages
SEO Insights for Digital Experts
No ratings yet
SEO Insights for Digital Experts
107 pages
Nce 035912
No ratings yet
Nce 035912
14 pages
SEO Content Writing Checklist
No ratings yet
SEO Content Writing Checklist
2 pages
Importance of Keyword Research
100% (1)
Importance of Keyword Research
33 pages
Data Science Using Python Lab 2024-2025
No ratings yet
Data Science Using Python Lab 2024-2025
55 pages
Youtube Seo Course
No ratings yet
Youtube Seo Course
23 pages
Keywords Report
No ratings yet
Keywords Report
1 page
Python Facebook Crawler Guide
No ratings yet
Python Facebook Crawler Guide
6 pages
Assignment 2 (SEO)
100% (1)
Assignment 2 (SEO)
11 pages
Search Engine Optimization Marketing (SEOM) : Aberazak Brahimi
No ratings yet
Search Engine Optimization Marketing (SEOM) : Aberazak Brahimi
39 pages