Create your own easy data pipeline with azagar and Doker

Photo by Author | Ideogram

Data is an asset that pushes our work as a data professionals. Without proper data, we cannot perform our work, and our business will fail to gain competitive advantage. Thus, it is very important for any data professional to secure the appropriate data, and the data pipelines are the systems for this purpose.

Data pipelines are systems that are designed to transmit and replace the data from one source to another. This system is part of the overall infrastructure for any business that relies on data, as they guarantee that our data is reliable and always ready to use.

Making a data pipeline may seem complicated, but some easy tools are enough to create reliable data pipeline with just a few lines of code. In this article, we will discover how to build a straight data pipeline using Azgar and Dokar that you can apply for your daily data.

Let’s enter it.

Building data pipeline

Before we create our data pipeline, let’s understand the concept of ITL, which stands for squeeze, change and load. ETL is a process where the data pipeline performs the following actions:

Extract data from different sources.
Change the data into a correct format.
Load the data at the location of accessible storage.

ETL data is a standard sample for pipelines, so what we make will follow this structure.

With the dors and doers, we can create data pipeline around the ETL process with a simple setup. Anygor is a valuable tool for orching any data flow activity, while the data pipeline is useful for handling the application environment using Dokar containers.

Let us set our data pipeline with Azar and Doker.

Step 1: Preparation

First of all, we must need to install our system on our system (we will not cover it here).

Our example, we will use Heart Attack Dataset from Kagal As a data source to promote our ETL process.

With everything, we will develop a project structure. Overall, the following skeleton will be in easy data pipeline:

simple-data-pipeline/
├── app/
│   └── pipeline.py
├── data/
│   └── Medicaldataset.csv
├── Dockerfile
├── requirements.txt
└── docker-compose.yml

Called an important folder simple-data-pipelineWhich contains:

A app Contains folder pipeline.py File
A data Folder containing source data (Medicaldataset.csv,
requirements.txt File for environment dependence.
Dockerfile For Dokar Configure.
docker-compose.yml File to describe and run our multi -container Dokar application.

We will fill out first requirements.txt The file, which contains the libraries required for our project.

In this case, we will only use the following library:

In the next section, we will compile data pipeline using our sample data.

Step 2: Set the pipeline

We will set up a pipeline.py File for the ETL process. In our case, we will use the following code.

import pandas as pd
import os

input_path = os.path.join("/data", "Medicaldataset.csv")
output_path = os.path.join("/data", "CleanedMedicalData.csv")

def extract_data(path):
    df = pd.read_csv(path)
    print("Data Extraction completed.")
    return df

def transform_data(df):
    df_cleaned = df.dropna()
    df_cleaned.columns = (col.strip().lower().replace(" ", "_") for col in df_cleaned.columns)
    print("Data Transformation completed.")
    return df_cleaned

def load_data(df, output_path):
    df.to_csv(output_path, index=False)
    print("Data Loading completed.")

def run_pipeline():
    df_raw = extract_data(input_path)
    df_cleaned = transform_data(df_raw)
    load_data(df_cleaned, output_path)
    print("Data pipeline completed successfully.")

if __name__ == "__main__":
    run_pipeline()

The pipeline follows the ETL process, where we load the CSV file, perform data changes, such as droping lost data and clearing column names, and loading cleaned data into a new CSV file. We wrapped these steps in a single run_pipeline The function that performs the whole process.

Step 3: Set the postage

With the developed pipeline file ready, we will fill Dockerfile To set up for a Dokar Container using the following code:

FROM python:3.10-slim

WORKDIR /app
COPY ./app /app
COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

CMD ("python", "pipeline.py")

In the aforementioned code, we make it clear that the container will use version 3.10 as its environment. Next, we configured the Container’s Working Directory /app And copy everything from our local app Folder in the container app Directory. We also copy requirements.txt File and follow the PIP installation inside the container. Finally, when we start the container, we order the script to run.

With Dockerfile Ready, we will prepare docker-compose.yml File to manage overall processing:

version: '3.9'

services:
  data-pipeline:
    build: .
    container_name: simple_pipeline_container
    volumes:
      - ./data:/data

The YAML file above, when executed, will produce a Dokar image from the current directory using available use available Dockerfile. We also mount local data To the folder data The folder inside the container makes the datastate accessible to our script.

Processing the pipeline

With all the files ready, we will process the data pipeline in the Doker. Go to the project route folder and run the following command in your command prompt to make a doker image and put the pipeline into practice.

docker compose up --build

If you run it successfully, you will see an informative log like the following:

 ✔ data-pipeline                           Built                                                                                   0.0s 
 ✔ Network simple_docker_pipeline_default  Created                                                                                 0.4s 
 ✔ Container simple_pipeline_container     Created                                                                                 0.4s 
Attaching to simple_pipeline_container
simple_pipeline_container  | Data Extraction completed.
simple_pipeline_container  | Data Transformation completed.
simple_pipeline_container  | Data Loading completed.
simple_pipeline_container  | Data pipeline completed successfully.
simple_pipeline_container exited with code 0

If everything is hanged successfully, you will look a new CleanedMedicalData.csv File in your data folder.

Congratulations! You have just developed a simple data pipeline with Azigar and Doker. Try to use different data sources and ETL processes to find out if you can handle more complex pipeline.

Conclusion

Understanding data pipelines is very important for every data professional, as they are necessary to get the right data for their work. In this article, we discovered how to develop a simple data pipeline using Azgar and Dokar and learned how to process it.

I hope it has helped!

Cornelius Yodha Vijaya Data Science is Assistant Manager and Data Writer. Elijan, working in Indonesia for a full time, likes to share indicators of data and data through social media and written media. CorneLius writes on various types of AI and machine learning titles.

Building data pipeline

Step 1: Preparation

Step 2: Set the pipeline

Step 3: Set the postage

Processing the pipeline

Conclusion

Editor's pick

Get latest news

Create your own easy data pipeline with azagar and Doker

Building data pipeline

Step 1: Preparation

Step 2: Set the pipeline

Step 3: Set the postage

Processing the pipeline

Conclusion

Fit Bet outage is breaking sync for users everywhere

Bouververd rapidly rapidly raised $ 80 million in self -care run by Botox and GLP -1

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news