Downloaded from: justpaste.
it/cdbm5
BANK DATA ANALYSIS
You are provided with two datasets(Structured & Semi-structured) containing information about bank deposit details.
The high level activities to be performed are :
1. Load all the structured data to MYSQL
2. Ingest structured data into the HDFS environment using SQOOP (Extract)
3. Ingest Semi-structured data into HDFS environment (Extract)
4. Perform ETL operation on JSON data using PIG scripts (Transform)
5. Perform ETL and Load sqoop output data into HIVE (Load)
6. Analyse data using HQL (Analyse)
INPUT FILES
The bank deposit input files are available in “
~/
Desktop/Project/wingst2-banking-challenge/
” folder.
Structured Data :-
Chase_Bank.csv
Semi structured Data :- Chase_Bank_1.json
Important Instructions to be followed:-
HDFS Directories to store output files.
a. SQOOP output should be stored in hdfs:/user/labuser/
sqoop_bank
directory.
b. PIG output should to be stored in hdfs:/user/labuser/
bank1
directory.
Output files of your assessment (***.txt) should be present in local challenge folder ( /home/labuser/Desktop/Project/wings-xx-challenge).
Follow below steps to complete the assessment:-
Step 1:- Loading Data to MYSQL
Login to MYSQL:
Username: root
Password: labuserbdh
Create new Database and tables using MySQL commands to load structured data.
DB Name:- bank_db
Table:- bank .
Create Table and Load script is given below for you reference
create table bank (Id int , Institution_Name varchar(2000), Branch_Name varchar(2000), Branch_Number int, City varchar(2000),
County varchar(2000), State varchar(2000), Zipcode int, 2010_Deposits int,
2011_Deposits int, 2012_Deposits int, 2013_Deposits int, 2014_Deposits int,
2015_Deposits int, 2016_Deposits int);
Load
Chase_Bank.csv
data into the table
bank
load data local infile '/home/labuser/Desktop/Project/wingst2-banking-challenge/Chase_Bank.csv' into table bank fields terminated by ',' lines terminated by '\n' ignore 1 rows (Id,
Institution_Name, Branch_Name, Branch_Number, City, County, State, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits,
2016_Deposits);
Step 2:- Import Structured data to HDFS using SQOOP
Load data from MYSQL table-
bank
to HDFS (
/user/labuser/sqoop_bank
)using Sqoop based on the following conditions.
Columns to be imported :
Id, City, County, State, Zipcode,
2010_Deposits, 2011_Deposits, 2012_Deposits,
2013_Deposits, 2014_Deposits, 2015_Deposits,
2016_Deposits
Import only the records which are
NOT IN
below mentioned cities.
"
Rochester", "Austin", "Chicago", "Indianapolis"
Number of mappers should be 1
Run below command to copy sqoop output from HDFS to
sqoop_output.txt
file :
hdfs dfs -cat /user/labuser/sqoop_bank/* > sqoop_output.txt
Note:- Make sure that your output files are available in challenge folder.
You will be loading and analysing Sqoop output data to Hive using HQL in further steps.
Step 3:- Cleansing Semi-structured (Json) data Using PIG
You will be cleansing Json data (
Chase_Bank_1.json
) which is available in challenge input folder.
Load this json data to PIG using PIG Latin Scripts
Note:- You can either load this data directly from challenge input folder , or use required commands to copy to hdfs and then to P
Use Pig Latin scripts to
find Minimum no of deposits in 2016 (ie, MIN(Deposits_2016)) for each county . Assign
minimum_dep”
as the column name
Sort output in descending order based on
minimum_dep”
And read only first 50 records.
Sample output format:-
{"group":"Gillespie","minimum_dep":212776}{"group":"Imperial","minimum_dep":148284}
STORE result in hdfs ( /user/labuser/pigoutput) directory
Use required commands to copy PIG output to challenge folder in file
pig_output.txt.
Step 4:- Loading and Analysing Data in Hive
Now, you have to load sqoop output to HIVE tables.
Create below database & Tables in Hive
Database
: hive_db
Partition Table
: bank_part.
Columns
Id, City, County, Zipcode, 2010_Deposits, 2011_Deposits, 2012_Deposits, 2013_Deposits, 2014_Deposits, 2015_Deposits, 2016_Depos
Partition
should be based on column
State
Read records which satisfies the below condition & Load to bank_part table
City in Bronx, NewYorkCity, Dallas, Houston, Columbus
State in "NY","OH","TX"
Hint:- Create a temporary table to load Sqoop output and then load data to partitioned table with necessary filters.
Analysing Data using HQL
Use the following command and execute hive query to remove the WARNING messages from the HIVE output
export HIVE_SKIP_SPARK_ASSEMBLY=true
Write a HQL query to fetch the records which satisfies below criteria. 2014_Deposits is greater than 50000, 2015_Deposi
ts is greater than 60000, 2016_Deposits is greater than 70000, City in NewYorkCity, Dallas, Houston.
Columns Required : -City, County, State, 2014_Deposits, 2015_Deposits, 2016_Deposits.
Save the output in file
hive_output.txt.
Note:- Given below the sample format to copy output to file from terminal.
hive S -e "use hive_db;select count(1) from
bank_part
;" >output.txt
VALIDATION :
Before closing the environment, ensure that all the output files are available in local directory
Desktop/Project/wingst2-banking-challenge/”
sqoop_output.txt
hive_output.txt
pig_output.txt
Click on
SUBMIT
button & validation will take place at backend.