Responsibilities:
Data Ingestion and Extraction:
Bringing data from various sources (structured, unstructured, real-time) into Azure.
Data Transformation and Cleaning:
Ensuring data quality and consistency through cleaning, transformation, and integration processes.
Data Storage:
Designing and implementing data storage solutions, including Azure Blob Storage, Azure Data Lake Storage, and Azure SQL Database.
Data Warehousing:
Building and maintaining data warehouses using Azure Synapse Analytics.
Data Pipeline Development:
Creating and managing automated data pipelines for efficient data movement and processing using Azure Data Factory or Azure Databricks.
Data Security and Compliance:
Implementing security measures (encryption, access control) and ensuring compliance with data privacy laws.
Performance Monitoring and Optimization:
Identifying and resolving performance bottlenecks in data systems.
Collaboration:
Working with data scientists, analysts, and business stakeholders to understand their needs and implement appropriate data solutions.
Azure Data Engineering Full Stack Course Curriculum
Azure Databricks
Day 1:
What is Big Data Analytics
Data Analytics Platform
- Storage
- Compute
Data Processing Paradigms
- Monolithic Computing
- Distributed Computing
Day 2:
Distributed Computing Frameworks
- Hadoop MapReduce
- Apache Spark
Big Data Analytics : Data Lakes
- Tightly Coupled Data Lake
- Looseky Coupled Data Lake
Day 3:
Big Data File Formats
- Row Storage Format
- Columnar Storage Format
Scalability
- Scale - Up (Vertical Scalability)
- Scale - Out (Horizontal Scalability)
Day 4: Intruduction To Azure Databricks
- Core Databricks Concepts
- Workspace
- Notebooks
- Library
- Folder
- Repos
- Data
- Compute
- Workflows
Day 5: Introducing Spark Fundamentals
- What is Apache Spark
- Why Choose Apache Spark
- What are the Spark use cases
Day 6: Spark Architecture
- Spark Components
- Spark Driver
- SparkSession
- Cluster manager
- Spark Executors
Day 7: Create Databricks Workspace
- Workspace Assets
Day 8: Creating Spark Cluster
- All-Purpose Cluster
- Single Node Cluster
- Multi Node Cluster
Day 9: Databricks - Internal Storage
- Databricks File System (DBFS)
- Uploading Files to DBFS
Day 10: DBUTILS Module
- Interaction with DBFS
- %fs Magic Command
Day 11: Spark Data API's
- RDD (Resilient Distributed Dataset)
- DataFrame
- Dataset
Day 12: Create Data Frame
- Using Python Collection
- Converting RDD to DataFrame
Day 13: Reading CSV data with Apache Spark
- Inferred Schema
- Explicit Schema
- Parsing Modes
Day 14: Reading JSON data with Apache Spark
- SingleLine JSON
- Multiline JSON
- Complex JSON
- explode() Function
Day 15: Reading XML Data with Apache Spark
- Install Spark-xml Library
- User Defined Schema
- DDL String Approach
- StructType() with StructFields()
Day 16: Reading Excel File With Apache Spark
- Single Sheet Reading
- Multiple Sheet Reading Using List object
Day 17: Reading Excel File With Apache Spark
- Multiple Excel Sheets with Same Structure
- Multiple Excel Sheets with Different Structures
Day 18: Reading parquet data With Apache Spark
- Uploading parquet data
- View the data DataFrame
- view the Schema of the DataFrame
- limitations of parquet file
- Schema Evolution
Day 19: Introduction to Delta Lake
- Delta Lake Features
- Delta Lake Components
Day 20: Delta lake Features
- DML Operations
- Time Travel Operations
Day 21: Delta lake Features
- Schema Validation and Enforcement
- Schema Evolution
Day 22: Access Data from Azure Blob Storage
- Account Access Key
- Windows Azure Storage Blob driver (WASB)
- Read Operations
- Write Operation
Day 23: Access Data from Azure Data Lake Gen2
- Azure Service Principal
- Azure Service Principal
- Azure Blob Filesystem driver (ABFS)
- Read Operations
- Write Operation
Day 24: Access Data from Azure Data Lake Gen2
- Shared access signatures (SAS)
- Azure Blob Filesystem driver (ABFS)
- Read Operations
- Write Operation
Day 25: Access Data from Azure SQL Database
- Configure a connection to SQL server
Day 26: Access Data from Synapse Dedicated SQL Pool
- Configure storage account access key
- Read data from an Azure Synapse table
- Write Data to Azure Synapse table
Day 27: Access Data from Snowflake
- Reading Data
- Writing Data
Day 28: Create Mount Point to Azure Cloud Storages
- Azure Blob Storage
- Azure Data Lake Storage
Day 29: Introduction to Spark SQL Module
- Hive Metastore
- Spark Catalog
Day 30: Spark SQL - Create Global Managed Tables
- DataFrame API
- SQL API
Day 31: Spark SQL - Create Global Un-Managed Tables
- DataFrame API
- SQL API
Day 32: Spark SQL_Create Views
- Temporary Views
- Global Temporary Views
- DataFrame API
- SQL API
- Dropping Views
Day 33: Spark Batch Processing
- Reading Batch Data
- Writing Batch Data
Day 34: Spark Structured Streaming API
- Reading Streaming Data
- Write Streaming Data
- checkPoint Location
Day 35: Spark Structured Streaming API - outputModes
- Append
- Complete
- Update
Day 36: Spark Structured Streaming API_Triggers
- Unspecified Trigger (Default Behavior)
- trigger(availableNow = True)
- trigger(processingTime = "n minutes")
Day 37: Spark Structured Streaming API
- Data Processing
- Joins
- Aggregation
Day 38: Code Modularity of Notebooks
- %run Magic Command
Day 39: dbutils.notebook Utility
- run()
- exit()
Day 40: Widgets_Types of Widgets
- text
- dropdown
- multiselect
- combobox
Day 41:Parameterization of Notebooks
- History Load
- Incremental Load
Day 42:Trigger Notebook from Data Factory Pipeline
- Notebook Parameters
Day 43:Databricks Workflow
- Orchestration of Tasks
Day 44:Databricks Workflow
- Task Parameters
- Job Trigger
Day 45: Delta Lake Implementation
- SCD Type0 Dimension
Day 46:Delta Lake Implementation
- SCD Type1 Dimension
Day 47:Delta Lake Implementation
- SCD Type2 Dimension
Day 48:Delta Lake Implementation
- SCD Type3 Dimension
Day 49:PySpark Performance Optimization
- Cache()
- Persist()
Day 50:PySpark Performance Optimization
- repartition()
- coalesce()
Day 51:PySpark Performance Optimization
- Column Predicate Pushdown
- partitionBy()
Day 52:PySpark Performance Optimization
- bucketBy()
Day 53:PySpark Performance Optimization
- BroadCastJoin
Day 54:Delta Lake_Performance Optimization
- OPTIMIZE
- ZORDER
Day 55:Delta Lake_Performance Optimization
- Delta Cache
Day 56:Delta Lake_Performance Optimization
- Liquid Clustering
Day 57:Delta Lake_Performance Optimization
- Partitioning
- Liquid Clustering
Day 58:Databricks Unity Catalog
- Metastore
- Catalog
- Schema
- Tables
- Volumes
- Views
Day 59:Databricks Unity Catalog
- Managed Tables
- External Tables
Day 60:Databricks Unity Catalog
- Managed Volumes
- External Volumes
Day 61:Databricks - Auto Loader
- Auto Loader file detection modes
- Directory Listing mode
- File Notification mode
- Schema Evolution with Auto Loader
Day 62:Delta Live Tables
- Simple Declarative SQL & Python APIs
- Automated Pipeline Creation
- Data Quality Checks