PG Diploma in Data Engineering

Our Data Engineering program prepares you to excel from day one with skills in Data Management, Modeling, Distributed Systems, and Cloud Computing. Gain hands-on experience through real-world projects and industry use cases, mastering Big Data platforms and cloud technologies. Learn from experts and fast-track your career in data engineering. Start your journey today!

Technologies You Will Learn

Experience

Lorem ipsum dolor sit amet, consec tetur adipis cing elit. Ut elit tellus, luctus nec ullam corper mattis, pulvinar dapibus.

Education

Lorem ipsum dolor sit amet, consec tetur adipis cing elit. Ut elit tellus, luctus nec ullam corper mattis, pulvinar dapibus.

Certificate

Lorem ipsum dolor sit amet, consec tetur adipis cing elit. Ut elit tellus, luctus nec ullam corper mattis, pulvinar dapibus.
Study at Your Own Pace

Boost Your Career by Learning Skills in High Demand

Program Highlights

300 + hours

of immersive learning with in-class/online learning options

Expert Faculty

pool of academicians and industry practitioners

Campus Placement

Opportunities 

Structured Program

Program crated with experts 

Learning Format

Online / Offline both

Membership

Membership to the PST IT Solutions Alumni Network

Course Content

Basic fundamentals of Linux
Management of Files and Directories
Architecture of linux
File management
Hardware & System Information Commands
Linux Basic Commands from scratch
Editor types
Compression & Extraction Types or commands
Monitoring Process & Types of process
Linux level monitoring commands.
Users & group Commands.
Networking Information
Networking Commands.
Advanced commands of file permissions in Linux
Advance Linux
Shell Script Operators
Conditional Statements and Loop statements
Job Scheduling Methods with commands
Shell Programming Examples & Shell script programs examples

Python Installation / Environment Setup

 Python Basic Syntax

Python Variables

 Data Types

Python Operators

Python Control Flow – Decision making/Conditional statements

 Python Control Flow – Loop statements

 Python Control Flow – Branching statements

List

String handling in Python

Python Data Structures

 Python Functions

Python Modules

Python IO (Input & Output)

 Python Regular Expressions

Python Exception handling

 Python Object-Oriented Programming

 Advance OOPS

 NumPy, Pandas, Matplotlib

 Mini Project

1. Introduction to Data Engineering (5 hours)
– Overview of Data Engineering
– Key Tools and Technologies
– Introduction to Big Data

2. Hadoop Ecosystem Overview (10 hours)
– Introduction to Hadoop Architecture
– Components of Hadoop Ecosystem

3. HDFS (Hadoop Distributed File System) (15 hours)
– HDFS Architecture and Design
– Data Storage and Block Management
– HDFS Commands and Hands-on

4. YARN (10 hours)
– YARN Architecture and Resource Management
– Running Applications on YARN
– Scheduling and Resource Allocation in YARN

5. Hive (15 hours)
– Introduction to Hive
– Data Warehousing Concepts
– HiveQL and Query Optimization
– Hands-on with Hive Queries and Data Analysis

6. Apache Spark (20 hours)
– Spark Core Architecture and RDDs
– Transformations and Actions
– Optimizing Spark Jobs
– Hands-on with Core Spark Applications

7. Spark SQL and DataFrames (15 hours)
– Introduction to Spark SQL
– DataFrames and Datasets
– SQL Queries with Spark
– Integrating Spark SQL with Hive and HDFS

8. Spark Streaming (15 hours)
– Spark Streaming Architecture
– Processing Real-time Data Streams
– Window Operations and State Management
– Use Cases and Hands-on with Streaming Data

9. Kafka (15 hours)
– Introduction to Kafka
– Kafka Architecture and Key Components
– Producer, Consumer, and Broker Roles
– Real-time Data Processing with Kafka and Spark

10. AWS Cloud for Data Engineering (30 hours)
– AWS Basics (5 hours)
– Introduction to AWS and Key Concepts
– IAM, EC2, and S3 Basics
– AWS Glue (5 hours)
– Data Ingestion with Glue
– Data Cleaning and ETL Processes
– AWS Kinesis (5 hours)
– Real-time Data Processing with Kinesis
– Integrating Kinesis with Spark and other tools
– AWS Athena (5 hours)
– Data Analysis using Athena
– Integrating Athena with S3
– AWS EMR (5 hours)
– Running Spark on EMR
– Optimizing EMR Performance for Big Data
– AWS Redshift (3 hours)
– Data Warehousing with Redshift
– Data Modeling and Analytics
– AWS DynamoDB (2 hours)
– NoSQL Database Concepts
– Integrating DynamoDB with Data Pipelines

11. Apache Airflow (15 hours)
– Introduction to Workflow Orchestration
– DAGs, Tasks, and Operators in Airflow
– Scheduling and Monitoring Pipelines
– Hands-on with Airflow for Orchestration

12. Apache NiFi (15 hours)
– Introduction to NiFi for Data Flow
– NiFi Architecture and Processors
– Creating Data Flows and Automation
– Hands-on Data Ingestion with NiFi

13. Version Control with GitHub (5 hours)
– Git Basics and Workflow
– Branching, Merging, and Collaboration
– GitHub for Version Control in Data Engineering Projects

14. Jenkins for CI/CD (10 hours)
– Introduction to Continuous Integration and Deployment
– Configuring Jenkins Pipelines for Data Workflows
– Integrating Jenkins with GitHub and AWS

15. Data Visualization with Power BI (10 hours)
– Introduction to Power BI
– Creating Reports and Dashboards
– Connecting Power BI with Data Sources
– Real-world Visualization Scenarios

16. End-to-End Data Engineering Project (30 hours)
– Phase 1: Data Ingestion and Storage (5 hours)
– Setting up Data Sources and Initial Ingestion with NiFi and Glue
– Phase 2: Data Cleaning and Transformation (5 hours)
– Data Preparation with Spark on EMR
– Phase 3: Real-time Data Processing (5 hours)
– Kafka and Spark Streaming Integration
– Phase 4: Data Warehousing and Analytics (5 hours)
– Data Modeling in Redshift and Analysis with Athena
– Phase 5: Orchestration and Automation (5 hours)
– Pipeline Orchestration using Airflow and Jenkins
– Phase 6: Visualization and Reporting (5 hours)
– Building Dashboards in Power BI and Insights Generation

Our Amazing Clients