GitHub - kaustav202/E2E-Data-Engineering-Pipeline: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

This project serves as a step-by-step guide to building a complete data engineering pipeline, covering everything from data ingestion to processing and storage. Here’s a detailed breakdown:

Data Ingestion with Apache Airflow:
- Task Orchestration: Apache Airflow is used to manage and schedule tasks in the pipeline. It fetches random user data from the randomuser.me API and stores it in a PostgreSQL database.
- Ease of Use: Airflow makes it easy to set up and monitor complex workflows, ensuring data is ingested reliably.
Real-Time Data Streaming with Apache Kafka:
- High-Throughput Streaming: Apache Kafka streams the data from PostgreSQL to the data processing engine. It handles large volumes of data efficiently and ensures that the data flows smoothly through the pipeline.
- Coordination with Zookeeper: Apache Zookeeper manages Kafka brokers, providing distributed synchronization and ensuring the system is fault-tolerant.
Data Processing with Apache Spark:
- Scalable Processing: Apache Spark processes the streamed data, performing transformations and aggregations as needed. Spark's distributed computing capabilities allow it to handle large datasets and complex computations efficiently.
- Master and Worker Nodes: Spark's architecture, with master and worker nodes, ensures that the processing tasks are distributed and managed effectively.
Data Storage with Cassandra:
- High Availability Storage: The processed data is stored in a Cassandra database. Cassandra is chosen for its high availability and scalability, making it suitable for storing large volumes of processed data.
- Fast Data Retrieval: Cassandra provides fast read and write capabilities, ensuring that the stored data can be accessed quickly for analysis and reporting.

Getting Started

Clone the Repository:

git clone https://github.com/kaustav202/E2E-Data-Engineering-Pipeline.git

Navigate to the Project Directory:
```
cd E2E-Data-Engineering-Pipeline
```
Install Docker: Ensure you have Docker installed on your machine. You can download and install Docker from here.
Build and Run Containers: Use Docker Compose to build and run the containers.
```
docker-compose up --build
```
Access Components:
- Airflow Web Interface: http://localhost:8080
- Kafka Control Center: http://localhost:9021

Running the Application

After completing the setup steps, it's ready to start streaming.

Open the Airflow webserver UI and run the DAG.

http://localhost:8080/login/

-user: airflow -password: airflow

Alternatively, use the command line to run the application.

Run the Producer:

cd app
python ./dags/kafka_stream.py

use spark-submit to run the spark Job

docker exec -it spark-master bash
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,com.datastax.spark:spark-cassandra-connector_2.12:3.4.0 spark_stream.py

Using "cqlsh" Cassandra Query Language Shell

Using cqlsh to execute SQL batch scripts on the data loaded into Cassandra.

Enter the Cassandra container:
```
docker exec -it cassandra bash
```
then use cassandra query lang shell
```
cqlsh
```
start writing your queries, for example show the data
```
 SELECT * FROM created_users ; 
```

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dags		dags
script		script
Data engineering architecture (1).png		Data engineering architecture (1).png
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
spark_stream.py		spark_stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Getting Started

Running the Application

Using "cqlsh" Cassandra Query Language Shell

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kaustav202/E2E-Data-Engineering-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Getting Started

Running the Application

Using "cqlsh" Cassandra Query Language Shell

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages