Modern Data Engineering Bootcamp

Instructor-led 4 Days

Course Description:

This hands-on, immersive 4-Day bootcamp is designed to equip participants with the foundational and advanced skills required for a successful career in data engineering. The course blends theoretical concepts with real-world labs, enabling participants to build and manage data pipelines, integrate structured and unstructured data, implement ETL/ELT frameworks, and work with cloud-native data engineering tools.

The curriculum emphasizes practical implementation using tools like Python, Apache Airflow, Apache Spark, SQL, Docker, and cloud platforms like AWS and GCP. By the end of this course, attendees will be able to design scalable data workflows, optimize data infrastructure, and support advanced analytics initiatives.

Prerequisites:

Basic understanding of Python programming
Familiarity with SQL and relational databases
General knowledge of cloud computing (AWS, GCP, or Azure)
Basic knowledge of Linux/command line usage is helpful but not required

Key Takeaways:

Understand the core principles and architecture of modern data engineering
Build and deploy robust, scalable data pipelines using Apache Airflow and Spark
Design and implement ETL/ELT workflows using real-world datasets
Integrate data from APIs, databases, and cloud storage
Apply best practices in data ingestion, transformation, and storage
Optimize data pipelines for performance and reliability
Gain hands-on experience with cloud services and containerization for data workflows

🟦 Module 1: Foundations of Data Engineering

Topics:

Introduction to Data Engineering: Role, Scope, and Trends
Data Lifecycle and Pipeline Architecture
Batch vs. Stream Processing
Ingestion Strategies (APIs, Files, Databases)

Hands-On Labs:

Lab 1: Exploring a Data Pipeline – Ingest CSV and JSON files into a local PostgreSQL database
Lab 2: Writing Python scripts to extract data from REST APIs

🟨 Module 2: ETL/ELT Pipelines and Orchestration

Topics:

ETL vs. ELT Explained
Building ETL Pipelines with Python and Pandas
Data Validation and Cleaning
Workflow Orchestration with Apache Airflow

Hands-On Labs:

Lab 3: Build an ETL pipeline that reads, cleans, and loads data into a database
Lab 4: Use Apache Airflow to orchestrate a multi-step data workflow

🟩 Module 3: Big Data and Cloud-Native Data Engineering

Topics:

Introduction to Big Data: Hadoop vs. Spark
Working with Apache Spark (PySpark)
Cloud Storage (S3, Google Cloud Storage, Azure Blob)
Cloud ETL Tools Overview (Glue, Dataflow, Data Factory)

Hands-On Labs:

Lab 5: Use PySpark to transform a large dataset and save it to Parquet format
Lab 6: Load and process data from AWS S3 using Python and Spark

🟥 Module 4: Containerization, Deployment, and Optimization

Topics:

Data Pipeline Deployment Strategies
Using Docker for Data Engineering Workflows
Monitoring and Logging Pipelines
Pipeline Optimization and Fault Tolerance

Hands-On Labs:

Lab 7: Containerize a data pipeline using Docker
Lab 8: Monitor and log a running Airflow pipeline

Modern Data Engineering Bootcamp

Contact us to customize this course for your team and for your organization.

Contact

Links

Training

Search

Interested?
Modern Data Engineering Bootcamp

Modern Data Engineering Bootcamp

Contact us to customize this course for your team and for your organization.

Contact

Links

Training

Search

Interested?Modern Data Engineering Bootcamp

Interested?
Modern Data Engineering Bootcamp