Q: What is ETL and how does it work?

ETL stands for Extract, Transform, Load . Data is first extracted from source systems, then transformed (cleaned, enriched, aggregated) in a staging area, and finally loaded into a target data warehouse. It is the traditional approach for data integration where transformations happen before data reaches the warehouse.

Q: What is ELT and how does it differ from ETL?

ELT stands for Extract, Load, Transform . Unlike ETL, raw data is loaded directly into the target system (e.g., a cloud data warehouse) and transformations happen inside the warehouse using its compute power. ELT leverages the scalability of modern warehouses like BigQuery , Snowflake , and Redshift to handle transformations at scale.

Q: What is a data pipeline?

A data pipeline is an automated series of steps that moves data from one or more sources to a destination system. It typically includes stages for ingestion , transformation , validation , and loading . Pipelines can be batch-based, real-time, or a hybrid of both, and they are often orchestrated by tools like Apache Airflow or Prefect.

Q: What is a data warehouse?

A data warehouse is a centralized repository designed for analytical querying and reporting . It stores structured, historical data that has been cleaned and transformed from operational systems. Data warehouses use schema-on-write , meaning data is structured before being stored, and are optimized for read-heavy OLAP workloads.

Q: What is a star schema in data warehousing?

A star schema is a data modeling technique where a central fact table (containing measurable metrics) is surrounded by dimension tables (containing descriptive attributes). It is called a "star" because the diagram resembles a star shape. Star schemas are denormalized for fast query performance and are the most common schema in data warehouses.

Q: What is a snowflake schema?

A snowflake schema is a variation of the star schema where dimension tables are normalized into multiple related tables. For example, a product dimension might be split into product , category , and brand tables. This reduces data redundancy but requires more joins, which can impact query performance.

Q: What is Apache Spark?

Apache Spark is a distributed computing framework for large-scale data processing. It provides APIs in Python ( PySpark ), Scala, Java, and R, and supports batch processing , stream processing , machine learning (MLlib), and graph processing . Spark uses in-memory computation, making it significantly faster than Hadoop MapReduce for many workloads.

Q: What is Apache Kafka?

Apache Kafka is a distributed event streaming platform used for building real-time data pipelines. It operates on a publish-subscribe model where producers write messages to topics and consumers read from them. Kafka provides high throughput, fault tolerance through replication, and message durability, making it ideal for streaming architectures.

Q: What is a data lake?

A data lake is a centralized storage repository that holds vast amounts of raw data in its native format—structured, semi-structured, or unstructured. Unlike a data warehouse, a data lake uses schema-on-read , meaning structure is applied only when the data is queried. Common implementations use HDFS , Amazon S3 , or Azure Data Lake Storage .

Q: What is the difference between batch and stream processing?

Batch processing handles data in large, discrete chunks at scheduled intervals (e.g., hourly or daily), while stream processing handles data continuously in near real-time as it arrives. Batch is suited for historical analytics and reporting; streaming is suited for real-time dashboards, fraud detection, and event-driven systems. Tools like Spark support both paradigms.

Question 1

What is ETL and how does it work?

Accepted Answer

ETL stands for Extract, Transform, Load. Data is first extracted from source systems, then transformed (cleaned, enriched, aggregated) in a staging area, and finally loaded into a target data warehouse. It is the traditional approach for data integration where transformations happen before data reaches the warehouse.

Question 2

What is ELT and how does it differ from ETL?

Accepted Answer

ELT stands for Extract, Load, Transform. Unlike ETL, raw data is loaded directly into the target system (e.g., a cloud data warehouse) and transformations happen inside the warehouse using its compute power. ELT leverages the scalability of modern warehouses like BigQuery, Snowflake, and Redshift to handle transformations at scale.

Question 3

What is a data pipeline?

Accepted Answer

A data pipeline is an automated series of steps that moves data from one or more sources to a destination system. It typically includes stages for ingestion, transformation, validation, and loading. Pipelines can be batch-based, real-time, or a hybrid of both, and they are often orchestrated by tools like Apache Airflow or Prefect.

Question 4

What is a data warehouse?

Accepted Answer

A data warehouse is a centralized repository designed for analytical querying and reporting. It stores structured, historical data that has been cleaned and transformed from operational systems. Data warehouses use schema-on-write, meaning data is structured before being stored, and are optimized for read-heavy OLAP workloads.

Question 5

What is a star schema in data warehousing?

Accepted Answer

A star schema is a data modeling technique where a central fact table (containing measurable metrics) is surrounded by dimension tables (containing descriptive attributes). It is called a "star" because the diagram resembles a star shape. Star schemas are denormalized for fast query performance and are the most common schema in data warehouses.

Question 6

What is a snowflake schema?

Accepted Answer

A snowflake schema is a variation of the star schema where dimension tables are normalized into multiple related tables. For example, a product dimension might be split into product, category, and brand tables. This reduces data redundancy but requires more joins, which can impact query performance.

Question 7

What is Apache Spark?

Accepted Answer

Apache Spark is a distributed computing framework for large-scale data processing. It provides APIs in Python (PySpark), Scala, Java, and R, and supports batch processing, stream processing, machine learning (MLlib), and graph processing. Spark uses in-memory computation, making it significantly faster than Hadoop MapReduce for many workloads.

Question 8

What is Apache Kafka?

Accepted Answer

Apache Kafka is a distributed event streaming platform used for building real-time data pipelines. It operates on a publish-subscribe model where producers write messages to topics and consumers read from them. Kafka provides high throughput, fault tolerance through replication, and message durability, making it ideal for streaming architectures.

Question 9

What is a data lake?

Accepted Answer

A data lake is a centralized storage repository that holds vast amounts of raw data in its native format—structured, semi-structured, or unstructured. Unlike a data warehouse, a data lake uses schema-on-read, meaning structure is applied only when the data is queried. Common implementations use HDFS, Amazon S3, or Azure Data Lake Storage.

Question 10

What is the difference between batch and stream processing?

Accepted Answer

Batch processing handles data in large, discrete chunks at scheduled intervals (e.g., hourly or daily), while stream processing handles data continuously in near real-time as it arrives. Batch is suited for historical analytics and reporting; streaming is suited for real-time dashboards, fraud detection, and event-driven systems. Tools like Spark support both paradigms.

Question 11

What is data quality and why does it matter?

Accepted Answer

Data quality refers to the accuracy, completeness, consistency, timeliness, and validity of data. Poor data quality leads to incorrect analytics, bad business decisions, and compliance risks. Key practices include data validation, profiling, anomaly detection, and implementing quality checks at each stage of the pipeline using tools like Great Expectations or dbt tests.

Question 12

What is Apache Airflow?

Accepted Answer

Apache Airflow is an open-source workflow orchestration platform for authoring, scheduling, and monitoring data pipelines. Pipelines are defined as DAGs (Directed Acyclic Graphs) in Python code. Airflow provides a web UI for monitoring, supports retries and alerting, and integrates with cloud services, databases, and processing frameworks.

Data Engineering

🎯 What You'll Learn

Preview Questions