Master Data Engineering with 50 free flashcards. Study using spaced repetition and focus mode for effective learning in Programming.
ETL stands for Extract, Transform, Load. Data is first extracted from source systems, then transformed (cleaned, enriched, aggregated) in a staging area, and finally loaded into a target data warehouse. It is the traditional approach for data integration where transformations happen before data reaches the warehouse.
ELT stands for Extract, Load, Transform. Unlike ETL, raw data is loaded directly into the target system (e.g., a cloud data warehouse) and transformations happen inside the warehouse using its compute power. ELT leverages the scalability of modern warehouses like BigQuery, Snowflake, and Redshift to handle transformations at scale.
A data pipeline is an automated series of steps that moves data from one or more sources to a destination system. It typically includes stages for ingestion, transformation, validation, and loading. Pipelines can be batch-based, real-time, or a hybrid of both, and they are often orchestrated by tools like Apache Airflow or Prefect.
A data warehouse is a centralized repository designed for analytical querying and reporting. It stores structured, historical data that has been cleaned and transformed from operational systems. Data warehouses use schema-on-write, meaning data is structured before being stored, and are optimized for read-heavy OLAP workloads.
A star schema is a data modeling technique where a central fact table (containing measurable metrics) is surrounded by dimension tables (containing descriptive attributes). It is called a "star" because the diagram resembles a star shape. Star schemas are denormalized for fast query performance and are the most common schema in data warehouses.
A snowflake schema is a variation of the star schema where dimension tables are normalized into multiple related tables. For example, a product dimension might be split into product, category, and brand tables. This reduces data redundancy but requires more joins, which can impact query performance.
Apache Spark is a distributed computing framework for large-scale data processing. It provides APIs in Python (PySpark), Scala, Java, and R, and supports batch processing, stream processing, machine learning (MLlib), and graph processing. Spark uses in-memory computation, making it significantly faster than Hadoop MapReduce for many workloads.
Apache Kafka is a distributed event streaming platform used for building real-time data pipelines. It operates on a publish-subscribe model where producers write messages to topics and consumers read from them. Kafka provides high throughput, fault tolerance through replication, and message durability, making it ideal for streaming architectures.
A data lake is a centralized storage repository that holds vast amounts of raw data in its native format—structured, semi-structured, or unstructured. Unlike a data warehouse, a data lake uses schema-on-read, meaning structure is applied only when the data is queried. Common implementations use HDFS, Amazon S3, or Azure Data Lake Storage.
Batch processing handles data in large, discrete chunks at scheduled intervals (e.g., hourly or daily), while stream processing handles data continuously in near real-time as it arrives. Batch is suited for historical analytics and reporting; streaming is suited for real-time dashboards, fraud detection, and event-driven systems. Tools like Spark support both paradigms.
Data quality refers to the accuracy, completeness, consistency, timeliness, and validity of data. Poor data quality leads to incorrect analytics, bad business decisions, and compliance risks. Key practices include data validation, profiling, anomaly detection, and implementing quality checks at each stage of the pipeline using tools like Great Expectations or dbt tests.
Apache Airflow is an open-source workflow orchestration platform for authoring, scheduling, and monitoring data pipelines. Pipelines are defined as DAGs (Directed Acyclic Graphs) in Python code. Airflow provides a web UI for monitoring, supports retries and alerting, and integrates with cloud services, databases, and processing frameworks.
Flashcards
Flip to reveal
Focus Mode
Spaced repetition
Multiple Choice
Test your knowledge
Type Answer
Active recall
Learn Mode
Multi-round mastery
Match Game
Memory challenge