0% found this document useful (0 votes)

84 views5 pages

Banking Problem Database

The document outlines a bank data analysis project involving two datasets: structured (Chase_Bank.csv) and semi-structured (Chase_Bank_1.json). Key activities include loading data into MySQL, ingesting it into HDFS using SQOOP, performing ETL operations with PIG, and analyzing the data in Hive. The document provides detailed instructions for each step, including database creation, data cleansing, and output file management.

Uploaded by

Avijit Manna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views5 pages

Banking Problem Database

Uploaded by

Avijit Manna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Downloaded from: justpaste.

it/cdbm5

BANK DATA ANALYSIS

You are provided with two datasets(Structured & Semi-structured) containing information about bank deposit details.

The high level activities to be performed are :

1. Load all the structured data to MYSQL

2. Ingest structured data into the HDFS environment using SQOOP (Extract)
3. Ingest Semi-structured data into HDFS environment (Extract)
4. Perform ETL operation on JSON data using PIG scripts (Transform)
5. Perform ETL and Load sqoop output data into HIVE (Load)
6. Analyse data using HQL (Analyse)

INPUT FILES

The bank deposit input files are available in “

Desktop/Project/wingst2-banking-challenge/

” folder.

Structured Data :-

Chase_Bank.csv

Semi structured Data :- Chase_Bank_1.json

Important Instructions to be followed:-

HDFS Directories to store output files.

a. SQOOP output should be stored in hdfs:/user/labuser/

sqoop_bank

directory.
b. PIG output should to be stored in hdfs:/user/labuser/

bank1

directory.

Output files of your assessment (***.txt) should be present in local challenge folder ( /home/labuser/Desktop/Project/wings-xx-challenge).

Follow below steps to complete the assessment:-

Step 1:- Loading Data to MYSQL

Username: root
Password: labuserbdh

Create new Database and tables using MySQL commands to load structured data.
DB Name:- bank_db
Table:- bank .

Create Table and Load script is given below for you reference

create table bank (Id int , Institution_Name varchar(2000), Branch_Name varchar(2000), Branch_Number int, City varchar(2000),
County varchar(2000), State varchar(2000), Zipcode int, 2010_Deposits int,
2011_Deposits int, 2012_Deposits int, 2013_Deposits int, 2014_Deposits int,
2015_Deposits int, 2016_Deposits int);

Load

Chase_Bank.csv

data into the table

bank

load data local infile '/home/labuser/Desktop/Project/wingst2-banking-challenge/Chase_Bank.csv' into table bank fields terminated by ',' lines terminated by '\n' ignore 1 rows (Id,
Institution_Name, Branch_Name, Branch_Number, City, County, State, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits,
2016_Deposits);

Step 2:- Import Structured data to HDFS using SQOOP

Load data from MYSQL table-

bank

to HDFS (

/user/labuser/sqoop_bank

)using Sqoop based on the following conditions.

Columns to be imported :

Id, City, County, State, Zipcode,

2010_Deposits, 2011_Deposits, 2012_Deposits,

2013_Deposits, 2014_Deposits, 2015_Deposits,

2016_Deposits

Import only the records which are

NOT IN

below mentioned cities.

"
Rochester", "Austin", "Chicago", "Indianapolis"

Number of mappers should be 1

Run below command to copy sqoop output from HDFS to

sqoop_output.txt

file :

hdfs dfs -cat /user/labuser/sqoop_bank/* > sqoop_output.txt

Note:- Make sure that your output files are available in challenge folder.

You will be loading and analysing Sqoop output data to Hive using HQL in further steps.

Step 3:- Cleansing Semi-structured (Json) data Using PIG

You will be cleansing Json data (

Chase_Bank_1.json

) which is available in challenge input folder.

Load this json data to PIG using PIG Latin Scripts

Note:- You can either load this data directly from challenge input folder , or use required commands to copy to hdfs and then to P

Use Pig Latin scripts to

find Minimum no of deposits in 2016 (ie, MIN(Deposits_2016)) for each county . Assign

minimum_dep”

as the column name

Sort output in descending order based on

minimum_dep”

And read only first 50 records.

Sample output format:-

{"group":"Gillespie","minimum_dep":212776}{"group":"Imperial","minimum_dep":148284}

STORE result in hdfs ( /user/labuser/pigoutput) directory

Use required commands to copy PIG output to challenge folder in file

pig_output.txt.

Step 4:- Loading and Analysing Data in Hive

Now, you have to load sqoop output to HIVE tables.

Create below database & Tables in Hive

Database

: hive_db
Partition Table

: bank_part.
Columns

Id, City, County, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits, 2016_Depos

Partition

should be based on column

State

Read records which satisfies the below condition & Load to bank_part table
City in Bronx, NewYorkCity, Dallas, Houston, Columbus
State in "NY","OH","TX"
Hint:- Create a temporary table to load Sqoop output and then load data to partitioned table with necessary filters.

Analysing Data using HQL

Use the following command and execute hive query to remove the WARNING messages from the HIVE output

export HIVE_SKIP_SPARK_ASSEMBLY=true

Write a HQL query to fetch the records which satisfies below criteria. 2014_Deposits is greater than 50000, 2015_Deposi
ts is greater than 60000, 2016_Deposits is greater than 70000, City in NewYorkCity, Dallas, Houston.

Columns Required : -City, County, State, 2014_Deposits, 2015_Deposits, 2016_Deposits.

Save the output in file

hive_output.txt.

Note:- Given below the sample format to copy output to file from terminal.

hive S -e "use hive_db;select count(1) from

bank_part

;" >output.txt

VALIDATION :

Before closing the environment, ensure that all the output files are available in local directory

Desktop/Project/wingst2-banking-challenge/”

sqoop_output.txt

hive_output.txt

pig_output.txt

Click on

SUBMIT

button & validation will take place at backend.

Apache Spark Analytics Made Simple PDF
No ratings yet
Apache Spark Analytics Made Simple PDF
76 pages
Informatica
No ratings yet
Informatica
5 pages
Data Analysis for Python Users
No ratings yet
Data Analysis for Python Users
5 pages
KDD Process for Data Analysts
No ratings yet
KDD Process for Data Analysts
3 pages
Scala Code Output and Syntax Guide
No ratings yet
Scala Code Output and Syntax Guide
2 pages
Unit 1-MCQ-DV
No ratings yet
Unit 1-MCQ-DV
5 pages
Dump
No ratings yet
Dump
7 pages
About The Exam: Print Exit Print Mode
80% (5)
About The Exam: Print Exit Print Mode
65 pages
New Text Document
No ratings yet
New Text Document
5 pages
TCS Passages
No ratings yet
TCS Passages
91 pages
NoSQL - Database Revolution
No ratings yet
NoSQL - Database Revolution
10 pages
Practice Questions 0
No ratings yet
Practice Questions 0
19 pages
Informatica Kickoff Jan
No ratings yet
Informatica Kickoff Jan
6 pages
Final Exam
No ratings yet
Final Exam
17 pages
PRA Question
No ratings yet
PRA Question
3 pages
Computer Vision in Meat Quality
No ratings yet
Computer Vision in Meat Quality
9 pages
Python Hands On Answers
No ratings yet
Python Hands On Answers
15 pages
Informatica Interview Q&A Guide
No ratings yet
Informatica Interview Q&A Guide
5 pages
R
No ratings yet
R
15 pages
Image Classification Hands-On
100% (1)
Image Classification Hands-On
1 page
Spring Boot MCQ
No ratings yet
Spring Boot MCQ
28 pages
SPEED Goals and Attributes PDF Information
No ratings yet
SPEED Goals and Attributes PDF Information
1 page
Simulado 5
No ratings yet
Simulado 5
10 pages
54 Django Questions Answers
No ratings yet
54 Django Questions Answers
88 pages
Spark Preliminaries
No ratings yet
Spark Preliminaries
4 pages
Understanding Impala in Big Data
No ratings yet
Understanding Impala in Big Data
5 pages
Ang2 Build
No ratings yet
Ang2 Build
4 pages
Descriptor
No ratings yet
Descriptor
4 pages
Backbone Js
No ratings yet
Backbone Js
7 pages
Key Concepts of Functional Programming
No ratings yet
Key Concepts of Functional Programming
4 pages
AngularJS Routing and Directives Guide
No ratings yet
AngularJS Routing and Directives Guide
4 pages
Data Handling in R - Introduction To Dplyr
No ratings yet
Data Handling in R - Introduction To Dplyr
2 pages
Grunt.js Setup and Key Plugins Guide
No ratings yet
Grunt.js Setup and Key Plugins Guide
1 page
PowerShell Basics for IT Pros
No ratings yet
PowerShell Basics for IT Pros
4 pages
Sarulatha's Assessment Performance Report
No ratings yet
Sarulatha's Assessment Performance Report
51 pages
Key Concepts in Clustering and Distance Measures
No ratings yet
Key Concepts in Clustering and Distance Measures
2 pages
Informatica 10.x 100 MCQ
No ratings yet
Informatica 10.x 100 MCQ
30 pages
Java 8 Innards
No ratings yet
Java 8 Innards
16 pages
Informatica Kickoff
No ratings yet
Informatica Kickoff
8 pages
SpringBoot Scenario Based MCQs Batch1 Q1 To Q25
No ratings yet
SpringBoot Scenario Based MCQs Batch1 Q1 To Q25
2 pages
Informatica HandsOn Guide
No ratings yet
Informatica HandsOn Guide
2 pages
TensorFlow, Kibana, and Nginx Insights
No ratings yet
TensorFlow, Kibana, and Nginx Insights
10 pages
Java8 Innards FrescoPlay
50% (2)
Java8 Innards FrescoPlay
3 pages
EDM
No ratings yet
EDM
4 pages
Student Study Guide: Windows Development Fundamentals
No ratings yet
Student Study Guide: Windows Development Fundamentals
63 pages
Bizskill2 Notes 1
No ratings yet
Bizskill2 Notes 1
13 pages
Java 8 Lambda Expressions and Streams Guide
No ratings yet
Java 8 Lambda Expressions and Streams Guide
4 pages
Informatica Transformations
No ratings yet
Informatica Transformations
6 pages
Frescoplay Courses - Dump
No ratings yet
Frescoplay Courses - Dump
32 pages
QUIZ Continuous Deplyment
No ratings yet
QUIZ Continuous Deplyment
5 pages
Automation Anywhere
No ratings yet
Automation Anywhere
3 pages
Stat 2
No ratings yet
Stat 2
3 pages
Set
No ratings yet
Set
49 pages
Azure ML Fresco - Toaz - Info
No ratings yet
Azure ML Fresco - Toaz - Info
28 pages
Informatica Functions
100% (1)
Informatica Functions
46 pages
Solution Banking Challenge
No ratings yet
Solution Banking Challenge
2 pages
Banking Data Analysis with MySQL and Hive
No ratings yet
Banking Data Analysis with MySQL and Hive
21 pages
Bigdata Question
No ratings yet
Bigdata Question
16 pages
Bda Lab Answers Without Eclipse
No ratings yet
Bda Lab Answers Without Eclipse
8 pages
BC Ca1,2
No ratings yet
BC Ca1,2
31 pages
Data Scientist Expertise Overview
No ratings yet
Data Scientist Expertise Overview
6 pages
MapR Certified Data Analyst (MCDA) Study Guide 16Skmxd
No ratings yet
MapR Certified Data Analyst (MCDA) Study Guide 16Skmxd
34 pages
EMR Workshop: Hive and Pig Processing
No ratings yet
EMR Workshop: Hive and Pig Processing
4 pages
Hive and Impala
No ratings yet
Hive and Impala
46 pages
Big Data Analytics and Technologies Overview
No ratings yet
Big Data Analytics and Technologies Overview
5 pages
Lab - Eti Mannual
No ratings yet
Lab - Eti Mannual
57 pages
Hadoop MCQs and Answers Guide
75% (8)
Hadoop MCQs and Answers Guide
21 pages
Day 08
No ratings yet
Day 08
5 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
28 pages
Unit 1
No ratings yet
Unit 1
19 pages
BDC Final Record
No ratings yet
BDC Final Record
36 pages
Bda Index
No ratings yet
Bda Index
3 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
It6701 - Information Management: Unit I - Database Modelling, Management and Development
No ratings yet
It6701 - Information Management: Unit I - Database Modelling, Management and Development
35 pages
Big Data Analysis with Hadoop and Hive
No ratings yet
Big Data Analysis with Hadoop and Hive
27 pages
Iceberg Table Format for Big Data
No ratings yet
Iceberg Table Format for Big Data
34 pages
Spark-Powered Big Data Platform by Atigeo
No ratings yet
Spark-Powered Big Data Platform by Atigeo
17 pages
Certified Data Engineer Professional 1
No ratings yet
Certified Data Engineer Professional 1
201 pages
SIC - Big Data - Chapter 5 - Workbook
No ratings yet
SIC - Big Data - Chapter 5 - Workbook
68 pages
Leçon4 Hadoop Query Languages
No ratings yet
Leçon4 Hadoop Query Languages
21 pages
Attunity Replicate 5.5 Release Notes - August 2017
No ratings yet
Attunity Replicate 5.5 Release Notes - August 2017
26 pages
Hiho
No ratings yet
Hiho
12 pages
List of Vendors
No ratings yet
List of Vendors
5 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
55 pages
Azure Data Engineer Resume SEO
No ratings yet
Azure Data Engineer Resume SEO
6 pages
Data Engineering Questions Answers 1679109980
100% (1)
Data Engineering Questions Answers 1679109980
26 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
HCIA-Big Data V3.5 Learning Guide
No ratings yet
HCIA-Big Data V3.5 Learning Guide
156 pages
Day 02
No ratings yet
Day 02
6 pages
Knime Press Practicing Data Science 4.7 Plain
No ratings yet
Knime Press Practicing Data Science 4.7 Plain
158 pages

Banking Problem Database

Uploaded by

Banking Problem Database

Uploaded by

Downloaded from: justpaste.

BANK DATA ANALYSIS

The high level activities to be performed are :

1. Load all the structured data to MYSQL

The bank deposit input files are available in “

Semi structured Data :- Chase_Bank_1.json

HDFS Directories to store output files.

a. SQOOP output should be stored in hdfs:/user/labuser/

Follow below steps to complete the assessment:-

Step 1:- Loading Data to MYSQL

data into the table

Step 2:- Import Structured data to HDFS using SQOOP

Load data from MYSQL table-

)using Sqoop based on the following conditions.

Id, City, County, State, Zipcode,

2010_Deposits, 2011_Deposits, 2012_Deposits,

2013_Deposits, 2014_Deposits, 2015_Deposits,

Import only the records which are

below mentioned cities.

Number of mappers should be 1

Run below command to copy sqoop output from HDFS to

hdfs dfs -cat /user/labuser/sqoop_bank/* > sqoop_output.txt

Step 3:- Cleansing Semi-structured (Json) data Using PIG

You will be cleansing Json data (

) which is available in challenge input folder.

Load this json data to PIG using PIG Latin Scripts

Use Pig Latin scripts to

as the column name

Sort output in descending order based on

And read only first 50 records.

STORE result in hdfs ( /user/labuser/pigoutput) directory

Use required commands to copy PIG output to challenge folder in file

Step 4:- Loading and Analysing Data in Hive

Now, you have to load sqoop output to HIVE tables.

Create below database & Tables in Hive

should be based on column

Analysing Data using HQL

Columns Required : -City, County, State, 2014_Deposits, 2015_Deposits, 2016_Deposits.

hive S -e "use hive_db;select count(1) from

button & validation will take place at backend.

You might also like