Instructor-led 4 Days
Course Description:
This hands-on, immersive 4-Day bootcamp is designed to equip participants with the foundational and advanced skills required for a successful career in data engineering. The course blends theoretical concepts with real-world labs, enabling participants to build and manage data pipelines, integrate structured and unstructured data, implement ETL/ELT frameworks, and work with cloud-native data engineering tools.
The curriculum emphasizes practical implementation using tools like Python, Apache Airflow, Apache Spark, SQL, Docker, and cloud platforms like AWS and GCP. By the end of this course, attendees will be able to design scalable data workflows, optimize data infrastructure, and support advanced analytics initiatives.
Prerequisites:
- Basic understanding of Python programming
- Familiarity with SQL and relational databases
- General knowledge of cloud computing (AWS, GCP, or Azure)
- Basic knowledge of Linux/command line usage is helpful but not required
Key Takeaways:
- Understand the core principles and architecture of modern data engineering
- Build and deploy robust, scalable data pipelines using Apache Airflow and Spark
- Design and implement ETL/ELT workflows using real-world datasets
- Integrate data from APIs, databases, and cloud storage
- Apply best practices in data ingestion, transformation, and storage
- Optimize data pipelines for performance and reliability
- Gain hands-on experience with cloud services and containerization for data workflows
🟦 Module 1: Foundations of Data Engineering
Topics:
- Introduction to Data Engineering: Role, Scope, and Trends
- Data Lifecycle and Pipeline Architecture
- Batch vs. Stream Processing
- Ingestion Strategies (APIs, Files, Databases)
Hands-On Labs:
- Lab 1: Exploring a Data Pipeline – Ingest CSV and JSON files into a local PostgreSQL database
- Lab 2: Writing Python scripts to extract data from REST APIs
🟨 Module 2: ETL/ELT Pipelines and Orchestration
Topics:
- ETL vs. ELT Explained
- Building ETL Pipelines with Python and Pandas
- Data Validation and Cleaning
- Workflow Orchestration with Apache Airflow
Hands-On Labs:
- Lab 3: Build an ETL pipeline that reads, cleans, and loads data into a database
- Lab 4: Use Apache Airflow to orchestrate a multi-step data workflow
🟩 Module 3: Big Data and Cloud-Native Data Engineering
Topics:
- Introduction to Big Data: Hadoop vs. Spark
- Working with Apache Spark (PySpark)
- Cloud Storage (S3, Google Cloud Storage, Azure Blob)
- Cloud ETL Tools Overview (Glue, Dataflow, Data Factory)
Hands-On Labs:
- Lab 5: Use PySpark to transform a large dataset and save it to Parquet format
- Lab 6: Load and process data from AWS S3 using Python and Spark
🟥 Module 4: Containerization, Deployment, and Optimization
Topics:
- Data Pipeline Deployment Strategies
- Using Docker for Data Engineering Workflows
- Monitoring and Logging Pipelines
- Pipeline Optimization and Fault Tolerance
Hands-On Labs:
- Lab 7: Containerize a data pipeline using Docker
- Lab 8: Monitor and log a running Airflow pipeline