Resume
Resume
Professional Summary
8+ years of experience in IT, which includes 4+ years of experience as Data Engineer and Data Analyst
including designing, developing and implementation of data models for enterprise-level applications Good in
System analysis, ER Dimensional Modeling, Database design and implementing RDBMS specific features.
Hands on experience in migrating on premise ETL's to Google Cloud Platform (GCP) using cloud native tools
such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
Experience in Worked on NoSQL databases - HBase, Cassandra & MongoDB, database performance tuning &
data modeling.
Expertise in writing Hadoop Jobs to analyze data using MapReduce, Apache Crunch, Hive, Pig, and Splunk.
Experience in GCP Dataproc, GCS, Cloud functions, BigQuery. • Experience in moving data between GCP and
Azure using Azure Data Factory.
Strong knowledge of Software Development Life Cycle (SDLC) and expertise in detailed design documentation.
Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
Knowledge and working experience on big data tools like Hadoop, Azure Data Lake, AWS Redshift.
Experience in handling python and spark context when writing Pyspark programs for ETL.
Great hands on experience with Pyspark for using Spark libraries by using python scripting for data analysis.
Implementing database Cloning using Python and Built backend support for Applications using Shell scripts.
Experience in working with data analytics, web Scraping and Extraction of data in Python.
Knowledge and working experience on big data tools like Hadoop, Azure Data Lake, AWS Redshift.
Experience in Developing Data Models for both OLTP/OLAP systems.
Hands on experience with building data pipelines in python/Pyspark/HiveSQL/Presto.
Exposure to Both Kimball and Bill Inmon Data Warehousing Approaches.
Good knowledge in developing and designing POC's using Scala, Spark SQL and MLlib libraries.
Experience on developing MapReduce jobs for data cleaning and data manipulation as required for the
business.
Experienced in Data Science with designing Conceptual, Logical and Physical Modeling.
Worked with users managing access, ecosystem alerts, technical issues, questions and reports.
Monitored the health of the Hadoop platform, improving, upgrading and securing the platform.
Good in BI, ETL, Data Integration, Data Profiling, Data Cleansing, and Data Integrity.
Experienced in creating cloud-based Hadoop Cluster sandbox installation along with Ambari server.
Working knowledge on UNIX /Linux/Bash systems including Experience on shell scripting and bash scripting.
Worked in using IDE like Eclipse and Jupyter Notebook.
Experience in UNIX shell scripting, Perl scripting and automation of ETL Processes.
Expertise in designing complex Mappings and have expertise in performance tuning and slowly-changing
Dimension Tables and Fact tables.
Work Experience:
Sr Data Engineer
IHG – Atlanta,GA November 2018 – Present
Responsibilities:
Designed and Developed data integration/engineering workflows on big data technologies and platforms
(Hadoop, Spark, MapReduce, Hive, HBase).
Involved in importing data into HDFS and Hive using Sqoop and involved in creating Hive tables, loading with
data and writing Hive queries.
Hive and Spark tuning with partitioning/bucketing of Parquet and executors/driver's memory.
Developed Hive queries and migrated data from RDBMS to Hadoop staging area.
Handled importing of data from various data sources, performed transformations using Spark, and loaded data
into S3.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark,
Effective & efficient Joins, Transformations and other during ingestion process itself.
Processed S3 data and created external tables using Hive and developed scripts to ingest and repair tables that
can be reused across the project.
Experience in building power bi reports on Azure Analysis services for better performance.
Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
Developed dataflows and processes for the Data processing using SQL (SparkSQL & Dataframes)
Developing spark programs using python API's to compare the performance of spark with HIVE and SQL and
generated reports monthly and daily basis.
Designed and developed Map Reduce program to analyze & evaluate multiple solutions, considering multiple
cost factors across the business as well operational impact on flight historical data.
Involved in planning process of iterations under the Agile Scrum methodology.
Working on Hive Meta store backup, Partitioning and bucketing techniques in hive to improve the
performance. Tuning Spark & Scala Jobs.
Working closely with Data science team and understand the requirement clearly and create hive table on HDFS.
Developed Spark scripts by using python commands as per the requirement.
Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and Generated detailed design
documentation for the source-to-target transformations..
Environment: Hadoop, MapReduce, Hive, PIG, Sqoop, Python, Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS
S3, AWS Redshift, Python, Scala, Pyspark, MapR, Java, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Cloudera Manager,
Zookeeper, Oracle, Kerberos and RedHat 6.5
Data Engineer
JP Morgan Chase and Company – Houston,TX November 2017 to October 2018
Responsibilities:
Extensive working knowledge structured query language (SQL), python, spark, Hadoop, HDFS, AWS, RDBMS,
data warehouses and document-oriented No-SQL databases.
Automated the process of downloading raw data into Data Lake from various sources systems like
SFTP/FTP/S3 using shell scripting, which helps business users to use the data in the form of Job as a service,
and query as a service.
Developed Hive Scripts for data parsing of raw data using EMR and store the results in S3 and ingest into Data
warehouse (Snowflake), which is utilized by enterprise customers.
Designed ETL Jobs to process the raw data using Spark and python in Glue, EMR, and Databricks.
Implemented Spark jobs in Python in AWS Glue, which process and transform semi-processed data to
processed data where data is utilized by data scientists.
Implemented connectors using python to pull raw data from various sources like Google DCM, DBM, AdWords,
Facebook, Twitter, Yahoo, and Tubular also this is parsed using Spark framework and injected the data into
Hive tables.
Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
Responsible for creating a Data pipeline flows, scheduling jobs programmatically (DAG) in Airflow workflow
engine, and providing support for the scheduled jobs.
Implemented MapReduce programs using pyspark to parse out the raw data as per business user requirements
and store the results in Data Lake (AWS S3).
Implemented several data pipeline jobs to pull the raw data from different sources to AWS S3 bucket, then
processed using pyspark in EMR cluster and store the processed data in AWS S3 bucket.
Created Spark jobs as per business requirements, jobs run on EMR and are triggered by Lambda.
Implemented AWS integration services such as SQS, SNS to notify the engineers about the job state.
Regularly interact with management and product owners on project status, priority setting and sprint
timeframe..
Environment: Hadoop YARN, Spark-Core 2.0, Spark-Streaming, Spark-SQL, Scala 2.10.4, Python, Kafka 1.1.0, Hive
2.2.0, Sqoop, Amazon AWS, Oozie, Impala, Cassandra, Cloudera, MySQL, Informatica Power Center 9.6.1, Linux,
Zookeeper, AWS EMR, EC2.
Data Engineer
Cadence Design Systems – San Jose,CA October 2016 to October 2017
Responsibilities:
Responsible for doing validations and cleansing the data.
Designed, documented operational problems by following standards and procedures using a software
reporting tool JIRA.
Create/Modify shell scripts for scheduling various data cleansing scripts and ETL loading process.
Load the data into Spark RDD and do in memory data Computation to generate the output response.
Implemented a batch process from RDS to downstream systems using AWS Glue, S3 and Python.
Imported/Exported data from Oracle databases using Sqoop, and loaded the same into Hive managed table,
which then these tables were used for Tableau visualization
Extending Hive core functionality by using custom User Defined Functions (UDF), and User Defined
Aggregating Functions (UDAF) for Hive.
Created Hive tables and implemented Partitioning, Dynamic Partitions, Buckets on the tables.
Managed Hive tables in a Big Data environment while facilitating transfer of data between HDFS to RDBMS
and vice versa.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Experience working with pyspark programming to transform the logs and ingest the data into Hive tables and
RDBMS.
Exported the result set from Hive to MySQL and PostgreSQL using Shell scripts.
Experience with version control tools such as GIT
Used Spark dataframes to ingest data from different databases.
Worked with ETL tools like SSIS for SQL server and reporting tools like SSRS, Power BI, and tableau.
Involved in full development life cycle including requirements analysis, high-level design, coding, testing, and
deployment.
Developed and executed detailed ETL related functional, performance, integration and regression tests, and
documentation
Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Zookeeper, Impala, Java (jdk1.6), Cloudera, Oracle, SQL
Server, UNIX Shell Scripting, Flume, Oozie, Scala, Spark, ETL, Sqoop, Python, Kafka, PySpark, AWS, S3, MongoDB.
Hadoop Developer
AT&T – Dallas,TX October 2015 to September
2016
Responsibilities:
Developed custom data Ingestion adapters to extract the log data and click stream data from external systems
and load into HDFS by making use of Spark/ Scala.
Developed spark program using scala API's to compare the performance of spark with hql.
Creating Hive tables, loading data and writing hive queries for building Analytical Datasets.
Worked on real time data ingestion and processing using Spark Streaming, and HBase.
Used Spark and Spark-SQL to read the data and create the tables in hive using the Scala API.
Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Work with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
Developed Kafka producer and Spark Streaming consumer to read the stream of events as per business rules.
Designed and developed Job flows using TWS.
Developed Sqoop commands to pull the data from Teradata.
Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the
log data.
Used AVRO, Parquet File formats and Snappy compression through the project.
Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
Configured Spark streaming to get ongoing information from the source and store the stream information to
HDFS.
Used various spark Transformations and Actions for cleansing the input data.
Developed shell scripts to generate the hive create statements from the data and load the data into the table.
Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
Involved in writing custom Map-Reduce programs using java API for data processing.
The hive tables are created as per requirement were Internal or External tables defined with appropriate static,
dynamic partitions and bucketing, intended for efficiency.
Load and transform large sets of structured, semi structured data using hive.
Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of
Data Frame and save the data as Parquet format in HDFS.
Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
Developed Hive queries for the analysts.
Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Involved in performing the analytics and visualization for the data from the logs and estimate the error rate
and study the probability of future errors using regressing models.
Environment: Pig, Sqoop, Kafka, Apache Cassandra, Elastic search, Oozie, Impala, Cloudera, AWS, AWS EMR,
Redshift, Flume, Apache Hadoop, HDFS, Hive, Map Reduce, Cassandra, Zookeeper, MySQL, Eclipse, Dynamo DB,
PL/SQL and Python.
Hadoop Developer
Ramco Systems - Chennai, Tamil Nadu May 2013 to April 2014
Responsibilities:
Developed Big Data Solutions that enabled the business and technology teams to make data-driven decisions
on the best ways to acquire customers and provide them business solutions.
Installed and configured Apache Hadoop, Hive, and HBase.
Worked on Hortonworks cluster, which is responsible for providing open source platform based on Apache
Hadoop for analysing, storing and managing big data.
Developed custom data Ingestion adapters to extract the log data and click stream data from external systems
and load into HDFS.
Used Spark as ETL tool to do complex Transformations, De-Normalization, Enrichment and some pre-
aggregations.
Creating Hive tables, loading data and writing hive queries for building Analytical Datasets.
Developed a working prototype for real time data ingestion and processing using Kafka, Spark Streaming, and
HBase.
Developed Kafka producer and Spark Streaming consumer to read the stream of events as per business rules.
Designed and developed Job flows using Oozie.
Developed simple and complex MapReduce programs in Java for Data Analysis on different data formats.
Developed multiple map reduce jobs in java for data cleaning and processing.
Used Hive to create partitions on hive tables and analyzes this data to compute various metrics for reporting.
Developed the LINUX shell scripts for creating the reports from Hive data.
Used Pig as ETL tool to do transformations, joins and pre-aggregations before loading data onto HDFS.
Worked on large sets of structured, semi structured and unstructured data
Responsible to manage data coming from different sources
Installed and configured Hive and also developed Hive UDFs to extend core functionality of hive
Responsible for loading data from UNIX file systems to HDFS.
Environment: Hadoop 3.0, Kafka 2.0.0, Pig 0.17, Hive 2.3, MVC, Scala 2.12, JDBC, Oracle 12c, POC, Sqoop 2.0,
Zookeeper 3.4, Python, Spark 2.3, HDFS, EC2, MySQL, Agile.
Java Developer
Yash Technologies, Hyderabad- India July 2011 to April 2013
Responsibilities:
Involved in developing, testing and implementation of the system using Struts, JSF, and Hibernate.
Developing, modifying, fixing, reviewing, testing and migrating the Java, JSP, XML, Servlet, SQLs, JSF.
Updated user-interactive web pages from JSP and CSSto Html5, CSS, and JavaScript for the best user
experience. Developed Servlets, Session and Entity Beans handling business logic and data.
Created enterprise deployment strategy and designed the enterprise deployment process to deploy Web
Services, J2EE programs on more than 7 different SOA/WebLogic instances across development, test and
production environments.
Designed user interface HTML, Swing, CSS, XML, Java Script and JSP.
Implemented the presentation using a combination of Java Server Pages (JSP) to render the HTML and well-
defined API interface to allow access to the application services layer.
Used Enterprise Java Beans (EJBs) extensively in the application Developed and deployed Session Beans to
perform user authentication.
Involve in Requirement Analysis, Design, Code Testing and debugging, Implementation activities.
Involved in the Performance Tuning of Database and Informatica. Improved performance by identifying and
rectifying the performance bottle necks.
Understanding how to apply technologies to solve big data problems and to develop innovative big data
solutions
Designed and developed Job flows using Oozie.
Developed Sqoop commands to pull the data from Teradata.
The data is collected from distributed sources into Avro models. Applied transformations and standardizations
and loaded into HBase for further data processing.
Wrote PL/SQL Packages and Stored procedures to implement business rules and validations.
Environment: JDK 1.3, J2EE, JDBC, Servlets, JSP, XML, XSL, CSS, HTML, DHTML, JavaScript, UML, Eclipse 3.0,
Tomcat 4.1, MySQL.
Technical Skills :
Big Data Eco Systems: HDFS, MapReduce, Hive, YARN, HBase, Pig, Sqoop, Kafka, Storm, Flume, Oozie, Zoo-
Keeper, Apache Spark, Apache Tez, Impala, NiFi.
No SQL Databases: HBase, Cassandra, MongoDB
Programming Languages: Java, Scala, Python, SQL, PL/SQL, Hive-QL, Pig Latin
Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL, JMS, JSP, Servlets, EJB
Frameworks: MVC, Struts, Spring, Hibernate
Operating Systems: RedHat Linux, Ubuntu Linux and Windows XP/Vista/7/8/10
Web Technologies: HTML, XML, AJAX, WSDL, SOAP
Web/Application servers: Apache Tomcat, WebLogic.
Version control: SVN, CVS, GIT
Databases: Oracle 9i/10g/11g, DB2, SQL Server, MySQL, Teradata
Cloud Technologies: Amazon Web Services (Amazon RedShift, S3), Microsoft Azure Insight