Our Data Engineering program prepares you to excel from day one with skills in Data Management, Modeling, Distributed Systems, and Cloud Computing. Gain hands-on experience through real-world projects and industry use cases, mastering Big Data platforms and cloud technologies. Learn from experts and fast-track your career in data engineering. Start your journey today!
Technologies You Will Learn
Experience
Education
Certificate
Program Highlights
300 + hours
of immersive learning with in-class/online learning options
Expert Faculty
pool of academicians and industry practitioners
Campus Placement
Opportunities
Structured Program
Program crated with experts
Learning Format
Online / Offline both
Membership
Membership to the PST IT Solutions Alumni Network
Course Content
Linux
Basic fundamentals of Linux
Management of Files and Directories
Architecture of linux
File management
Hardware & System Information Commands
Linux Basic Commands from scratch
Editor types
Compression & Extraction Types or commands
Monitoring Process & Types of process
Linux level monitoring commands.
Users & group Commands.
Networking Information
Networking Commands.
Advanced commands of file permissions in Linux
Advance Linux
Shell Script Operators
Conditional Statements and Loop statements
Job Scheduling Methods with commands
Shell Programming Examples & Shell script programs examples
Python
Python Installation / Environment Setup
Python Basic Syntax
Python Variables
Data Types
Python Operators
Python Control Flow – Decision making/Conditional statements
Python Control Flow – Loop statements
Python Control Flow – Branching statements
List
String handling in Python
Python Data Structures
Python Functions
Python Modules
Python IO (Input & Output)
Python Regular Expressions
Python Exception handling
Python Object-Oriented Programming
Advance OOPS
NumPy, Pandas, Matplotlib
Mini Project
Hadoop, Hive, Apache Spack, SQL, Kafka
1. Introduction to Data Engineering (5 hours)
– Overview of Data Engineering
– Key Tools and Technologies
– Introduction to Big Data
2. Hadoop Ecosystem Overview (10 hours)
– Introduction to Hadoop Architecture
– Components of Hadoop Ecosystem
3. HDFS (Hadoop Distributed File System) (15 hours)
– HDFS Architecture and Design
– Data Storage and Block Management
– HDFS Commands and Hands-on
4. YARN (10 hours)
– YARN Architecture and Resource Management
– Running Applications on YARN
– Scheduling and Resource Allocation in YARN
5. Hive (15 hours)
– Introduction to Hive
– Data Warehousing Concepts
– HiveQL and Query Optimization
– Hands-on with Hive Queries and Data Analysis
6. Apache Spark (20 hours)
– Spark Core Architecture and RDDs
– Transformations and Actions
– Optimizing Spark Jobs
– Hands-on with Core Spark Applications
7. Spark SQL and DataFrames (15 hours)
– Introduction to Spark SQL
– DataFrames and Datasets
– SQL Queries with Spark
– Integrating Spark SQL with Hive and HDFS
8. Spark Streaming (15 hours)
– Spark Streaming Architecture
– Processing Real-time Data Streams
– Window Operations and State Management
– Use Cases and Hands-on with Streaming Data
9. Kafka (15 hours)
– Introduction to Kafka
– Kafka Architecture and Key Components
– Producer, Consumer, and Broker Roles
– Real-time Data Processing with Kafka and Spark
10. AWS Cloud for Data Engineering (30 hours)
– AWS Basics (5 hours)
– Introduction to AWS and Key Concepts
– IAM, EC2, and S3 Basics
– AWS Glue (5 hours)
– Data Ingestion with Glue
– Data Cleaning and ETL Processes
– AWS Kinesis (5 hours)
– Real-time Data Processing with Kinesis
– Integrating Kinesis with Spark and other tools
– AWS Athena (5 hours)
– Data Analysis using Athena
– Integrating Athena with S3
– AWS EMR (5 hours)
– Running Spark on EMR
– Optimizing EMR Performance for Big Data
– AWS Redshift (3 hours)
– Data Warehousing with Redshift
– Data Modeling and Analytics
– AWS DynamoDB (2 hours)
– NoSQL Database Concepts
– Integrating DynamoDB with Data Pipelines
11. Apache Airflow (15 hours)
– Introduction to Workflow Orchestration
– DAGs, Tasks, and Operators in Airflow
– Scheduling and Monitoring Pipelines
– Hands-on with Airflow for Orchestration
12. Apache NiFi (15 hours)
– Introduction to NiFi for Data Flow
– NiFi Architecture and Processors
– Creating Data Flows and Automation
– Hands-on Data Ingestion with NiFi
13. Version Control with GitHub (5 hours)
– Git Basics and Workflow
– Branching, Merging, and Collaboration
– GitHub for Version Control in Data Engineering Projects
14. Jenkins for CI/CD (10 hours)
– Introduction to Continuous Integration and Deployment
– Configuring Jenkins Pipelines for Data Workflows
– Integrating Jenkins with GitHub and AWS
15. Data Visualization with Power BI (10 hours)
– Introduction to Power BI
– Creating Reports and Dashboards
– Connecting Power BI with Data Sources
– Real-world Visualization Scenarios
16. End-to-End Data Engineering Project (30 hours)
– Phase 1: Data Ingestion and Storage (5 hours)
– Setting up Data Sources and Initial Ingestion with NiFi and Glue
– Phase 2: Data Cleaning and Transformation (5 hours)
– Data Preparation with Spark on EMR
– Phase 3: Real-time Data Processing (5 hours)
– Kafka and Spark Streaming Integration
– Phase 4: Data Warehousing and Analytics (5 hours)
– Data Modeling in Redshift and Analysis with Athena
– Phase 5: Orchestration and Automation (5 hours)
– Pipeline Orchestration using Airflow and Jenkins
– Phase 6: Visualization and Reporting (5 hours)
– Building Dashboards in Power BI and Insights Generation
Our Amazing Clients

