SANDEEP J
Location: Aubrun Hills, MI – 48326
Email: jsandeep.data@gmail.com Mobile: (502) 442-2296 LinkedIn: www.linkedin.com/in/sandeep-j-
03mar
Professional Summary:
Over 10 years of comprehensive experience in software development, with expertise in all phases of the
software development lifecycle, including analysis, design, development, debugging, and deployment.
7+ years of hands-on experience in Big Data technologies, specializing in the Hadoop ecosystem (HDFS,
MapReduce, Hive, Pig, Spark, Kafka, Flume) for scalable data solutions and performance optimization.
Proficient in cloud-based technologies with over 5 years of experience on Azure Cloud and Google Cloud
Platform (GCP), and hands-on expertise in executing POCs for migrating data from on-premises servers to AWS.
Expertise in functional programming (immutability, closures) using Scala, Spark, and Python on platforms like
Databricks for distributed data processing.
Skilled in designing end-to-end data architectures using Azure components, including Azure Blob Storage, Virtual
Machines, Azure Data Factory, Azure Synapse Analytics, HDInsight, Event Grid, Service Bus, Azure Data Migration
Service (DMS), Event Hubs, and Azure Functions for efficient data processing and orchestration.
Experience with real-time data processing using tools like Azure Functions, Event Hubs, and Apache Kafka, along
with batch processing using Apache Spark and Azure Databricks.
Proficient in managing ETL workflows using Azure Data Factory, Azure Data Lake, and Terraform for
orchestration, automation, and environment isolation.
Strong experience with data warehouse solutions using Snowflake and integrating with BI tools like Informatica
and Power BI for seamless data processing.
Demonstrated ability to troubleshoot and optimize data pipelines and processing workflows, ensuring data
quality, integrity, and scalability across platforms.
Delivered rapid prototyping, and proof of concept (POC) development, and provided accurate level of effort and
budget forecasts for numerous high-visibility, complex projects.
Proficient in creating and managing user stories, epics, and tasks in Jira and Azure Scrum Board, facilitating
project tracking and team collaboration.
Strong interpersonal and communication skills, with a proven ability to solve complex problems, quickly adapt to
new technologies, and contribute as an effective team member.
Professional Experience:
Client: PODS Enterprises – Clearwater, FL December 2023 – Present
Role: Cloud Engineer
Responsibilities:
Worked extensively with Microsoft Azure Cloud services such as Application Gateways, Load Balancing, Virtual
Machines, Virtual Networks, Subnets, Express Route, Azure Active Directory (AD), Azure Resource Manager
(ARM), Blob Storage, and SQL Database.
Well-versed with Azure Networking solutions including Virtual Networks (VNET), Gateways for point-to-site and
site-to-site VPN connectivity, Load Balancers, and Application Gateways.
Responsible for configuring Alert notifications to monitor Heartbeat, CPU, and Memory Alerts in Azure Monitor
and integrating them with PagerDuty to notify the team.
Worked extensively on creating and managing Azure Key Vault Keys, Secrets, and Certificates along with
assigning Access Policies for Service principals to access these Key Vaults.
Volunteered in Enabling Microsoft Azure Defender for Containers and implemented advanced threat protection
features for Virtual Machines, SQL Databases, Containers, and Web applications.
Participated in migrating on-premises to Windows Azure by creating an Azure Disaster Recovery Environment,
Azure Recovery Vault, and Azure Backups from the ground up using PowerShell scripts.
Assisted in setting up end-to-end pipelines in Azure DevOps using YAML scripts and configuring service
connections across multiple Projects in Azure DevOps Organization.
Extensive experience using the Azure DevOps pipeline to deploy all Microservices builds to the Docker registry
preceded by Kubernetes deployment, pod creation, and Kubernetes management.
Extensive experience in developing Azure DevOps pipelines to build lightweight Alpine Base Images using
Docker files, tag the custom images and push the image to Docker Hub.
Focused on creating resources in Microsoft Azure such as VNETs, Virtual Machines, Application Gateways,
Event Hubs, Storage Accounts, Azure Kubernetes Cluster, Key Vaults, and PostgreSQL across all environments
using reusable Terraform Modules.
Resolved state lock issues in the production environment of Terraform state files in Azure, also being proficient
in importing Terraform resources into state files to resolve the issues.
Played a significant role in using Terraform along with Jenkins and Packer tools to create custom machine
Images in Azure Compute Galleries and used Ansible to install the software dependencies once the
infrastructure was provisioned.
Used Kafka to obtain near real-time data.
Created Databricks Notebooks to streamline data and curate data for various business use cases.
Created many PySpark and Spark SQL scripts in synapse notebooks for data transformations per business
requirements.
Developed complex stored procedures and views and ingested them into SSIS packages. implemented slowly
changing dimensions while transforming data into SSIS.
Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
Worked on developing a PySpark script to encrypt the raw data using Hashing algorithm concepts on client-
specified columns.
Created various pipelines to load the data from Azure data lake into Staging SQLDB followed by Azure SQL DB.
Created external tables with partitions using Hive, and Azure SQL.
Created pipelines to load data from Lake to Databricks and Databricks to Azure SQL DB.
Developed various automated scripts for DI (Data Ingestion) and DL (Data Loading) using PySpark.
Migration of large data sets to Databricks (Spark), creation and administration of cluster, loading of data,
configuration of data pipelines, and loading of data from ADLS Gen2 to Databricks using ADF pipelines.
Transform data by running a Python activity in Azure Databricks.
Implemented the data warehousing solution in Azure Synapse Analytics.
Created UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.
Developed Sqoop scripts to extract the data from MYSQL and load it into HDFS.
Performed data purging and applied changes using Databricks and Spark data analysis.
Involved in Hive-HBase integration by creating hive external tables and specifying storage as HBase format.
Created dashboards for analyzing POS data using Power BI.
Utilized Docker containers to create images of the applications and dynamically provision slaves to Jenkins
CI/CD pipelines.
Wrote Python automation scripts for continuous flow of data into S3 buckets and scheduled cron jobs for
events.
Involved in working on Docker Hub, creating Docker images, and handling multiple images primarily for
middleware installations and domain configurations.
Created and managed a Docker deployment pipeline for custom application images in the cloud using Jenkins.
Created Cloud Functions and provisioned Google Compute Engine instances, implemented Firewall Rules, and
administered Google Virtual Private Cloud (VPC) at development stage for the POC
Developed Ansible playbooks at development stage for the POC with Python and SSH as wrappers for managing
Google Compute Engine configurations and testing playbooks on GCP instances.
Worked on Cloud Data Fusion in Dev environment for POC purpose, to configure data loads from Google Cloud
Storage to BigQuery.
Implemented POC level Cloud Workflows and Cloud Functions to automate and orchestrate tasks such as
publishing data to Google Cloud Storage, training machine learning models, and deploying them for predictions.
Client: Fanuc American Cooperation – Rochester Hills, MI March 2022 – November 2023
Role: Cloud Data Engineer
Responsibilities:
Responsible for building Deployment Manager templates for Google Identity and Access Management (IAM),
Google Cloud Storage (GCS), Cloud SQL, Cloud Monitoring, and integrating with Google Service Catalog.
Led the migration of critical datasets from Hadoop to Google Cloud Storage (GCS) in a large-scale environment,
utilizing Dataflow and Cloud Data Fusion to ensure regulatory compliance and data integrity.
Built end-to-end data pipelines using Google Cloud Dataproc, including data extraction, transformation, and
loading (ETL) processes.
Designed and set up an Enterprise Data Lake using GCP to support various use cases, including storing,
processing, analytics, and reporting of high-volume, dynamic data with services such as BigQuery, GCS, and
Cloud Pub/Sub.
Created VPCs, subnets (both private and public), and NAT gateways within a multi-region, multi-zone
infrastructure landscape to manage global operations.
Created data models for BigQuery and Hive from dimensional data models.
Worked on setting up data streaming with Kafka on Google Cloud Pub/Sub and monitored data processing to
GCS and BigQuery.
Built a data pipeline in Apache NiFi to process large datasets and configured lookups for data validation and
integrity.
Utilized Google Cloud Dataproc security features, including encryption, network security, and compliance with
industry standards.
Ingested data in mini-batches and performed RDD transformations on those mini-batches using Spark Streaming
in Google Cloud Dataproc.
Involved in creating system roles, custom roles, and role hierarchies in BigQuery.
Created data warehouses, databases, schemas, and tables in BigQuery, writing SQL queries to validate data
feeds from source systems.
Leveraged APIGEE for real-time data integration between internal systems and external platforms.
Enabled data transformation and enrichment processes using APIGEE policies.
Configured continuous data loading into BigQuery using the BigQuery Data Transfer Service and Google Cloud
Storage integration.
Worked extensively with Google Cloud Data Fusion, including data transformations, custom connectors, and
migrating data pipelines to higher environments using Deployment Manager.
Created stored procedures in BigQuery to load data into fact and dimension tables.
Designed and developed data ingestion scripts using PySpark and Spark SQL in Dataproc, orchestrating them
through Cloud Composer (Airflow) and Data Fusion.
Ingested data from on-premises databases to GCP and processed it in Dataproc and Cloud Data Fusion.
Developed ETL scripts using PySpark and Spark Scala in Dataproc, enabling ad-hoc data transformations and
exploratory data analysis. Leveraged Data Fusion for orchestrating data workflows and integrating with other
GCP services.
Imported data from file-based systems and relational databases into Google Cloud Storage in standard file
formats like Apache Parquet using Cloud Data Fusion and Dataproc.
Implemented machine learning algorithms using Python to predict quantities users might order for specific items
and automated suggestions using BigQuery ML and GCS.
Utilized Google Cloud Workflows to schedule and automate batch jobs, integrating applications, Data Fusion
pipelines, and other services like HTTP requests and email triggers.
Used Google Cloud Transfer Service to move data between Oracle/Teradata databases and GCS, and Cloud
Pub/Sub to stream log data from servers.
Designed dimensional modeling using Google BigQuery BI Engine for end-user analysis, creating hierarchies and
defining dimension relationships.
Created reports and dashboards in Google Data Studio as per business requirements, connecting with various
data sources.
Migrated data into Google Cloud’s data pipeline using Dataproc, Spark SQL, and Scala.
Utilized Dataproc and GCS encryption to secure data with server-side encryption.
Created Power BI reports and dashboards as per the business requirement using different data sources.
Migrate data into the RV data pipeline using Databricks, Spark SQL, and Scala.
Used Databricks for encrypting data using server-side encryption.
Developed a reusable framework to be leveraged for future migrations that automate ETL from Oracle systems
to the Data Lake utilizing Spark Data Sources and Hive data objects.
Used HiveQL for the analysis of the data and validating the data.
Created Hive Load Queries for loading the data from HDFS.
Configured and scheduled job tasks in Airflow, defining dependencies, and managing task execution and retries.
Worked on Hive to create External and Internal tables and did some analysis of the data.
Worked on the Oozie scheduler to run the jobs to import and export the data to/from the HDFS.
Client: Humana – Louisville, KY January 2019 – March 2022
Role: Lead Data Engineer
Responsibilities:
Migrated data from on-premises data lakes to Google Cloud Storage and BigQuery in a large-scale environment,
utilizing Cloud Data Fusion and Database Migration Service (DMS) to ensure regulatory compliance and data
integrity.
Created Cloud Functions and provisioned Google Compute Engine instances, implemented Firewall Rules, and
administered Google Virtual Private Cloud (VPC).
Involved in the development of Ansible playbooks with Python and SSH as wrappers for managing Google
Compute Engine configurations and testing playbooks on GCP instances.
Designed and developed a security framework to provide fine-grained access to objects in Google Cloud Storage
using Cloud Functions and Cloud Spanner.
Worked on Cloud Data Fusion to configure data loads from Google Cloud Storage to BigQuery.
Implemented Cloud Workflows and Cloud Functions to automate and orchestrate tasks such as publishing data
to Google Cloud Storage, training machine learning models, and deploying them for predictions.
Worked on App Engine for fast deployment of various applications developed in Java and Python on customer
service application servers.
Used Database Migration Service (DMS) to migrate tables from homogeneous and heterogeneous databases
from on-premises to the Google Cloud Platform.
Created data ingestion modules using Cloud Data Fusion for loading data into various layers in Google Cloud
Storage and for reporting using Looker Studio.
Handled BigQuery performance tuning and query optimization (EXPLAIN plans, partitioning, clustering,
indexing).
Copied fact/dimension table data and aggregated output from Google Cloud Storage to BigQuery for historical
data analysis using Looker Studio.
Developed PySpark code for Dataproc jobs and Dataflow.
Designed and developed ETL processes in Cloud Data Fusion to migrate data from external sources such as
Google Cloud Storage, Parquet, and text files into BigQuery.
Worked with Dataproc for fast and efficient processing of big data.
Implemented Dataflow pipelines to read from and write to Google Cloud Storage, enabling efficient data
processing and analytics on large-scale data lakes in GCP.
Created Apache Spark jobs that process the source files, performing various transformations on the source data
using Spark DataFrame, Dataset, and Spark SQL API.
Implemented one-time data migration of multistate-level data from SQL Server to BigQuery using Python and
Cloud SQL.
Performed ETL data translation using Informatica and functional requirements to create source-to-target data
mappings to support large datasets in Cloud SQL and BigQuery.
Created resource monitoring and alert systems in BigQuery using Cloud Monitoring.
Developed big data jobs to load heavy volumes of data into Google Cloud Storage and then into BigQuery.
Created Dataflow pipelines for continuous data load from staged data residing on cloud gateway servers.
Implemented real-time data streaming solutions using Dataflow for processing high-volume, high-velocity data
streams, enabling timely insights and actions based on streaming events.
Optimized Scala code for performance and scalability in distributed computing environments using Dataproc
clusters.
Developed ETL solutions using Spark SQL in Dataproc for data extraction, transformation, and aggregation from
multiple file formats and sources to uncover insights into customer usage patterns.
Collaborated with data modeling teams to create Hive tables for performance optimization and converted Hive
queries into Spark transformations using RDDs.
Integrated and automated data workloads to Snowflake Warehouse by analyzing the SQL scripts and designed
the solution to implement using Pyspark.
Performed logical and physical data structure designs and DDL generation to facilitate the implementation of
database tables and columns out to the Snowflake environment.
Created Azure Data Lake data sources for ad-hoc querying and business dashboarding using Power BI.
Used Databricks notebooks for interactive analysis, leveraging Spark APIs.
Imported data from different sources like HDFS/HBase into Spark RDD and performed computations using
Pyspark.
Used PyCharm IDE for Python/PySpark development and Git for version control and repository management.
Used Sqoop to export data to Netezza from Hive and to import data from Netezza to Hive.
Created shell scripts for extracting files from local systems to the edge node and Python scripts to perform data
transformations for loading data into Netezza.
Written oozie workflows and job and properties files for managing the oozie jobs. Also Configured all our
MapReduce, Hive, Sqoop jobs in oozie workflow.
Created databases and tables in Netezza for storing data from Hive.
Extensively used Spark to create RDDs and Hive SQL for aggregating data.
Validated the Hadoop jobs like MapReduce, Oozie using CLI. Able to handle the jobs in HUE too.
Worked on data ingestion from Oracle to Hive and transferred data from Hive to Netezza.
Wrote Hive SQL scripts for creating complex tables with partitioning, clustering, and skewing to enhance
performance.
Created views and indexed views to simplify database complexities for end users.
Client: Vanguard Groups - Malvern, PA January 2018 – December 2018
Role: Sr. Hadoop Developer
Responsibilities:
Involved in creating Hive tables, loading data, and writing Hive queries that will run internally in a map reduced
way.
Used Hive queries to analyze the partitioned and bucketed data and compute various metrics for reporting on
the dashboard.
Worked on Data Serialization formats for converting Complex objects into sequence bits by using AVRO,
PARQUET, and CSV formats.
Involved in creating and designing data ingest pipelines using technologies such as Kafka.
Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop
jobs such as Java MapReduce, Hive, and Sqoop as well as system-specific jobs.
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using
Spark Context, Spark -SQL, Data Frame, Pair RDDs, and Spark YARN.
Experienced with batch processing of data sources using Apache Spark, and Elastic search.
Developed Kafka producer and consumers, Spark, and Hadoop MapReduce jobs.
Import the data from different sources like HDFS/HBase into Spark RDD.
Involved in converting Map Reduce programs into Spark transformations using Spark RDDs on Scala.
Used the version control system GIT to access the repositories and used it in coordinating with CI tools.
Integrated maven with GIT to manage and deploy project-related tags.
Experience with AWS S3 services creating buckets, configuring buckets with permissions, logging, versioning,
and tagging.
Implementing a Continuous Integration and Continuous Deployment framework using Jenkins, and Maven in a
Linux environment.
Client: CenturyLink Inc - Denver, CO January 2017 - December 2017
Role: Hadoop Developer
Responsibilities:
Worked with the systems engineering team to plan and deploy new Hadoop environments and expand
existing Hadoop clusters with agile methodology.
Monitored multiple Hadoop cluster environments and monitored workload, job performance, and capacity
planning using Cloudera Manager.
Load and transform large sets of structured, semi-structured, and unstructured data.
Used Flume to collect, aggregate, and store web log data from different sources like web servers, and mobile
and network devices and pushed to HDFS.
Worked extensively with Sqoop to import and export data from HDFS to Relational Database
systems/mainframe and vice-versa.
Developed Oozie workflow for scheduling Pig and Hive Scripts.
Configured the Hadoop Ecosystem components like YARN, Hive, Pig, HBase, and Impala.
Analyzed the web log data using HiveQL to extract the number of unique visitors per day and visit duration.
Involved in creating the workflow to run multiple Hive and Pig jobs, which run independently with time and data
availability.
Developed Pig Latin scripts to do operations of sorting, joining, and filtering source data.
Performed MapReduce programs on log data to transform into a structured way to find Customer names, age
groups, etc.
Proactively monitored systems and services, architecture design and implementation of Hadoop deployment,
configuration management, backup, and disaster recovery systems and procedures.
Executed test cases in automation tool, Performed System, Regression, and Integration Testing, reviewed
results, and logged defects.
Client: Meta minds Software Solutions Ltd - Hyderabad, Telangana July 2013 – December 2015
Role: Java/J2EE Developer (Intern from July 2013 to April 2014)
Role: Java/J2EE Developer (Fulltime from May 2014 to December 2015)
Responsibilities:
Analyzed project requirements for this product and was involved in designing using UML infrastructure.
Interacting with the system analysts & business users for design & requirement clarification.
Extensive use of HTML5 with Angular JS, JSTL, JSP, jQuery, and Bootstrap for the presentation layer along with
JavaScript for client-side validation.
Taken care of the Java Multithreading part in backend components.
Developed HTML reports for various modules as per the requirement.
Developed Web Services using SOAP, SOA, and WSDL Spring MVC and developed DTDs, and XSD schemas for
XML (parsing, processing, and design) to communicate with Active Directory application using Restful API.
Created multiple RESTful web services using the jersey2 framework.
Used Aqua Logic BPM (Business Process Management) for workflow management.
Developed the application using NOSQL on MongoDB for storing data on the server.
Developed complete business tier with a state full session Java beans and CMP Java entity beans with EJB 2.0.
Developed integration services using SOA, Web Services, SOAP, and WSDL.
Designed, developed, and maintained the data layer using the ORM framework in Hibernate.
Used Spring framework's JMS support for writing to JMS Queue, Hibernate Dao Support for interfacing with the
database and integrated Spring with JSF.
Responsible for managing the Sprint production test data with the help of tools like Telegance, CRM, etc. for
tweaking the test data during the IAT / UAT Testing.
Involved in writing Unit test cases using JUnit and involved in integration testing.
Technical Skills:
Cloud Technologies Azure, AWS, GCP
Streaming tools Kafka, NiFi, Stream sets
Big Data Technologies Hive, Impala, Sqoop, Spark, Oozie, Zookeeper, MapReduce, Tez
Programming Languages Java, Python, Shell Scripting, Scala
Data Warehouse Tools Snowflake, Teradata, Redshift
No SQL Databases MongoDB, HBase, Netezza
Databases Oracle 10g, MySQL, MSSQL
Data Governance Tools Collibra
IDE/Tools Eclipse, Visual Studios
Version control GIT, SVN
Platforms Windows, Unix, Linux
BI Tools Tableau, MS Excel, Power BI
Hadoop Distribution Cloudera (CDH5 to CDH7.1), Hortonworks
Education Details:
University Course Year State/City
Completed
University of Cumberlands Masters in information technology Dec -2021 Kentucky
and management
Frostburg State University Masters in computer science May - 2017 Maryland
Jawaharlal Nehru Technological Bachelors in computer science May – 2014 Hyderabad
University