4-days Instructor-led
Course Description
This four-day course equips data professionals with the knowledge and hands-on skills to design and operate scalable, resilient data engineering systems. Participants will explore architectural patterns, performance strategies, fault-tolerant design, and distributed data processing using tools like Apache Kafka, Spark, SQL, and cloud storage. The course includes multiple labs each day, reinforcing real-world system design and operational problem-solving.
Key Takeaways
- Design end-to-end data architectures that scale with growing data volume and complexity
- Apply distributed computing principles to batch and streaming workloads
- Use message queues, distributed file systems, and compute engines effectively
- Build fault-tolerant systems with monitoring, retries, and failover strategies
- Translate functional requirements into infrastructure-ready data workflows
- Gain hands-on experience using open-source and cloud-native tools
Prerequisites
- Familiarity with Python or SQL scripting
- Basic understanding of data pipelines, relational databases, and cloud concepts
- Experience with Linux and command-line tools is a plus
Module 1: Designing the Data Pipeline Foundation
- Role of the Data Engineer: Architect vs. Operator
- Types of Pipelines: Batch, Micro-batch, and Streaming
- Storage Design: Object Storage, Data Lakes, Columnar Formats (Parquet, Avro)
- Data Modeling and Schema Design for Scale
- Overview of ETL vs ELT Architecture
- Hands-On Lab 1: Design a batch data ingestion pipeline using Apache Spark
- Hands-On Lab 2: Convert raw JSON into partitioned Parquet files in S3 or local HDFS
- Hands-On Lab 3: Implement basic schema validation and transformation with PySpark
Module 2: Building for Scalability and Throughput
- Scaling Compute: Horizontal vs. Vertical Scaling
- Partitioning, Bucketing, and File Optimization Techniques
- High-throughput Ingestion with Kafka or AWS Kinesis
- Stream Processing Frameworks (Spark Structured Streaming, Kafka Streams)
- Use of Watermarks and Windowed Aggregations
- Hands-On Lab 1: Build a Kafka ingestion pipeline with producer/consumer code
- Hands-On Lab 2: Develop a streaming job that performs aggregations over time windows
- Hands-On Lab 3: Apply partitioning and bucketing to improve read performance in Spark
Module 3: Ensuring Reliability and Fault Tolerance
- Common Failure Modes in Distributed Pipelines
- Retry Strategies and Idempotent Processing
- Dead Letter Queues (DLQ) and Error Logging Patterns
- High Availability (HA) Architecture Patterns: Quorum, Replication, Failover
- Testing and Validating Data Pipelines in Production
- Hands-On Lab 1: Inject failures and handle retry logic in a Spark job
- Hands-On Lab 2: Create a dead letter queue using Kafka for failed messages
- Hands-On Lab 3: Simulate node failure in a distributed environment and verify recovery
Module 4: Monitoring, Observability, and End-to-End System Integration
- Logging, Metrics, and Tracing in Data Pipelines
- Alerting and Monitoring Tools: Prometheus, Grafana, CloudWatch
- CI/CD for Data Pipelines and Infrastructure as Code (IaC)
- Data Lineage and Governance: OpenLineage, Great Expectations
- Hands-On Lab 1: Set up metric logging and alerts for pipeline performance
- Hands-On Lab 2: Add validation and data quality checks using Great Expectations