Data Engg
Data Engg
ta Engineering
rules (e.g., schema enforcement, data quality checks via Great Expectations or Deequ)
nance tracking (e.g., using OpenLineage, Apache Atlas) to trace input source and transform
cy analysis and DAG evaluation (e.g., Airflow, dbt) to check for unused/critical transforma
ataloging (e.g., using DataHub or Amundsen) with data risk annotation and sensitivity tagg
posture (e.g., CI/CD for ETL jobs, signed data artifacts, pipeline execution attestation)
ecycle: signed data snapshots, access-controlled storage, encryption at rest/in-transit
data output (e.g., data quality checks, anomaly detection, and output validation on BI dashb
gineering / fault injection (e.g., simulate malicious input or broken sources to test resilience)
ices with secure images (e.g., scanning Airflow/Spark containers with Trivy or Clair)
access layers (e.g., enforcing authZ/authN on Presto/Trino endpoints, query firewalls)
platforms (e.g., Soda Cloud, Monte Carlo, Datafold test suite management for pipelines)
ts for ETL jobs and transformations (e.g., using pytest, dbt tests)
s and data pipelines (e.g., simulate concurrent access on warehouses or test latency in data
CloudSec vs. Data Engineering: Technique Mapping Table
Similar/Applied Technique in
CloudSec Practice / Tool
CSPM (Cloud Security Posture Data
Data Infra Engineering
Posture Management
Management)
CWPP (Cloud Workload (via Infra-as-Code)
Data Workload Protection
Protection Platform)
CIEM (Cloud Infrastructure (runtime/data services security)
Data Access Governance / RBAC
Entitlement Management) Analysis
Data Threat Modeling and
Threat Intelligence (Cloud-based)
Anomaly
Access/DataDetection
Quality Policies as
Policy-as-Code (OPA, Rego)
Pipeline Scanning (IaC security Code
Scanning Data Pipelines for Risky
scanning)
Workload Protection (runtime Patterns
Real-time(e.g., secrets, PII
Monitoring leaks)
of ETL Job
behavior monitoring)
Policy Enforcement (e.g., deny Behavior / Data API Access
Data Platform Governance (deny
risky Data Compliance (e.g., GDPR,
Compliance Rules (e.g., HIPAA, unsafe schema or access changes)
deployments)
CCPA via data tagging and
GDPR via cloud config analysis)
lineage)
🔄 Explanation: How These Techniques Translate
CSPM in Data Engg → It's about auditing cloud resources used by data pipelines: buckets, DBs, secrets managers. You check
encrypted, etc.
CWPP → Think of it as securing the runtime of your Spark jobs, data containers, and even Lambda functions that move/proce
CIEM → Who can access which data? Track and prune over-permissioned roles in data platforms (IAM/ACLs/RBAC).
Threat Intel → Feed known malicious indicators into your ingestion filters or alerting systems in pipelines.
Policy-as-code & Compliance → Automate enforcement of your org's data policies—like not pushing PII to logs or test enviro
ble
How It’s
Tools like Steampipe, OpenApplied in Data
Policy Agent Engineering
(OPA), or Checkov are used to
evaluate cloud data environments (e.g., S3, RDS, BigQuery)
Tools like Falco and Trivy monitor containers running data workloads against
compliance/security
(e.g., policies.
ToolsSpark, Flink, Kafka)
and practices like IAM for analysis,
abnormalpolicy-as-code
behavior and image
(OPA), or even
vulnerabilities.
Sentry/custom
Integration of threat feeds (e.g., IP/domain reputation) intowarehouses
tools to monitor and enforce access to data data ingestion
(Snowflake,
pipelines Redshift).
Used to enforce row/column-level data access control, schemadata
to filter malicious data sources or detect suspicious access
evolution
patterns.
policies, datarepositories
retention, etc.,
Check code and often via Rego,
Airflow/dbt YAML,
configs or native tooling in
for hardcoded
dbt/Snowflake.
credentials, PII, or insecure data sink configurations using tools like
Use Prometheus/Grafana,
GitLeaks, Falco, or custom alerting to detect anomalies
Checkov, or TruffleHog.
Governance
like excessive rules block
reads, schema
failed jobs, drift, disallow sensitive
data exfiltration data export to
patterns.
untrusted
Automated tagging of sensitive data, lineage tracing, auditor
targets, etc., often enforced via CI/CD pipelines dbt's CI
logging for
checks.
data access. Tools: Collibra, DataHub, BigID, custom audits in dbt or
cloud-native tools.
used by data pipelines: buckets, DBs, secrets managers. You check if they're public,
jobs, data containers, and even Lambda functions that move/process data.
er-permissioned roles in data platforms (IAM/ACLs/RBAC).
ingestion filters or alerting systems in pipelines.
your org's data policies—like not pushing PII to logs or test environments.
Traditional IT Systems vs. Cloud Services (AWS, Azure, GCP)
Traditional IT Component AWS Azure
Physical Server EC2 Virtual Machines (VMs)
Virtualization Platform EC2 with AMI, Auto Scaling Azure Virtual Machine Scale Sets
Server OS & Images Amazon Machine Images (AMI) Azure Images
Load Balancer Elastic Load Balancer (ELB) Azure Load Balancer / Application
Storage (SAN, NAS) EBS, EFS, S3 Azure Disks, Azure Files, Blob Sto
Networking (Switches, VLANs) VPC, Subnets, Route Tables Virtual Network (VNet), Subnets
DNS Server Route 53 Azure DNS
Firewall / Security Security Groups, NACLs, WAF NSGs, Azure Firewall, Azure WAF
Directory Services (AD) AWS Directory Service Azure Active Directory
Monitoring / Logging CloudWatch, CloudTrail Azure Monitor, Log Analytics
Backup & Recovery AWS Backup, S3 Glacier Azure Backup, Azure Site Recover
Databases (On-Prem SQL, etc.) RDS, Aurora, DynamoDB Azure SQL, Cosmos DB
Data Center / Colocation AWS Outposts, Local Zones Azure Stack, Azure Edge Zones
Enterprise Service Bus (ESB) Amazon EventBridge, SQS, SNS Azure Service Bus, Event Grid
Application Hosting Elastic Beanstalk, ECS, EKS App Services, Azure Kubernetes Se
CI/CD Tools CodePipeline, CodeBuild Azure DevOps, GitHub Actions
GCP Description / Use
Compute Engine Scalable virtual machines for compute tasks.
Managed Instance Groups Auto-scaling and load-balanced VM management.
Custom Images / OS Images Pre-configured OS and software templates.
Cloud Load Balancing Distributes incoming traffic across multiple resources.
Persistent Disk, Filestore, Cloud StBlock, file, and object storage options.
VPC, Subnets Virtual networks for securely isolating and connecting resources.
Cloud DNS Domain name resolution and routing services.
Firewall Rules, VPC Firewall Protect cloud resources with fine-grained traffic rules.
Cloud Identity, Managed Microsof Identity and access management services.
Stackdriver (now Cloud OperationsObservability: metrics, logs, alerts.
Backup and DR via Cloud Storage Data backup and disaster recovery solutions.
Cloud SQL, Bigtable, Firestore Managed relational and NoSQL databases.
Anthos, GCP Edge Hybrid cloud solutions bridging on-prem and cloud.
Pub/Sub, Eventarc Messaging and event-driven architecture support.
App Engine, GKE Managed platforms for hosting applications.
Cloud Build, Cloud Deploy Continuous Integration and Delivery pipelines.
Cloud Services Commonly Used by Data Engineers
Stage Function AWS
Ingest Real-time Data Streaming Kinesis Data Streams / Firehose
AWS Glue, AWS Data Pipeline, S3
Batch Data Ingestion
Uploads
API / Webhook Ingestion API Gateway + Lambda
Store Object Storage (Raw Data) Amazon S3
File Storage (Semi-Structured) Amazon EFS, FSx
Relational DB Amazon RDS, Aurora
NoSQL / Semi-Structured DB DynamoDB
Data Lake S3 + Lake Formation
Process ETL/ELT AWS Glue / EMR
Batch Processing AWS EMR (Spark, Hive)
Real-time Processing Kinesis Analytics / Lambda
AWS Step Functions / Managed
Orchestrate Job Scheduling & Workflow
Workflows
Serverless Task Automation AWS Lambda
Analyze BI / SQL Engine Amazon Athena / Redshift
ML Integration SageMaker
Secure IAM & RBAC IAM + Roles + Policies
Encryption KMS + S3 Bucket Policies
Audit / Monitoring CloudTrail / CloudWatch
DevOps / Infra Infra as Code AWS CloudFormation / Terraform
Containerized Jobs ECS / EKS / Fargate