
Photo by Author | IdeogramData is an asset that pushes our work as a data professionals. Without proper data, we cannot perform our work, and our business will fail to gain competitive advantage. Thus, it is very important for any data professional to secure the appropriate data, and the data pipelines are the systems for this purpose.
Data pipelines are systems that are designed to transmit and replace the data from one source to another. This system is part of the overall infrastructure for any business that relies on data, as they guarantee that our data is reliable and always ready to use.
Making a data pipeline may seem complicated, but some easy tools are enough to create reliable data pipeline with just a few lines of code. In this article, we will discover how to build a straight data pipeline using Azgar and Dokar that you can apply for your daily data.
Let’s enter it.
Building data pipeline
Before we create our data pipeline, let’s understand the concept of ITL, which stands for squeeze, change and load. ETL is a process where the data pipeline performs the following actions:
- Extract data from different sources.
- Change the data into a correct format.
- Load the data at the location of accessible storage.
ETL data is a standard sample for pipelines, so what we make will follow this structure.
With the dors and doers, we can create data pipeline around the ETL process with a simple setup. Anygor is a valuable tool for orching any data flow activity, while the data pipeline is useful for handling the application environment using Dokar containers.
Let us set our data pipeline with Azar and Doker.
Step 1: Preparation
First of all, we must need to install our system on our system (we will not cover it here).
Our example, we will use Heart Attack Dataset from Kagal As a data source to promote our ETL process.
With everything, we will develop a project structure. Overall, the following skeleton will be in easy data pipeline:
simple-data-pipeline/
├── app/
│ └── pipeline.py
├── data/
│ └── Medicaldataset.csv
├── Dockerfile
├── requirements.txt
└── docker-compose.ymlCalled an important folder simple-data-pipelineWhich contains:
- A
appContains folderpipeline.pyFile - A
dataFolder containing source data (Medicaldataset.csv, -
requirements.txtFile for environment dependence. -
DockerfileFor Dokar Configure. -
docker-compose.ymlFile to describe and run our multi -container Dokar application.
We will fill out first requirements.txt The file, which contains the libraries required for our project.
In this case, we will only use the following library:
In the next section, we will compile data pipeline using our sample data.
Step 2: Set the pipeline
We will set up a pipeline.py File for the ETL process. In our case, we will use the following code.
import pandas as pd
import os
input_path = os.path.join("/data", "Medicaldataset.csv")
output_path = os.path.join("/data", "CleanedMedicalData.csv")
def extract_data(path):
df = pd.read_csv(path)
print("Data Extraction completed.")
return df
def transform_data(df):
df_cleaned = df.dropna()
df_cleaned.columns = (col.strip().lower().replace(" ", "_") for col in df_cleaned.columns)
print("Data Transformation completed.")
return df_cleaned
def load_data(df, output_path):
df.to_csv(output_path, index=False)
print("Data Loading completed.")
def run_pipeline():
df_raw = extract_data(input_path)
df_cleaned = transform_data(df_raw)
load_data(df_cleaned, output_path)
print("Data pipeline completed successfully.")
if __name__ == "__main__":
run_pipeline()The pipeline follows the ETL process, where we load the CSV file, perform data changes, such as droping lost data and clearing column names, and loading cleaned data into a new CSV file. We wrapped these steps in a single run_pipeline The function that performs the whole process.
Step 3: Set the postage
With the developed pipeline file ready, we will fill Dockerfile To set up for a Dokar Container using the following code:
FROM python:3.10-slim
WORKDIR /app
COPY ./app /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
CMD ("python", "pipeline.py")In the aforementioned code, we make it clear that the container will use version 3.10 as its environment. Next, we configured the Container’s Working Directory /app And copy everything from our local app Folder in the container app Directory. We also copy requirements.txt File and follow the PIP installation inside the container. Finally, when we start the container, we order the script to run.
With Dockerfile Ready, we will prepare docker-compose.yml File to manage overall processing:
version: '3.9'
services:
data-pipeline:
build: .
container_name: simple_pipeline_container
volumes:
- ./data:/dataThe YAML file above, when executed, will produce a Dokar image from the current directory using available use available Dockerfile. We also mount local data To the folder data The folder inside the container makes the datastate accessible to our script.
Processing the pipeline
With all the files ready, we will process the data pipeline in the Doker. Go to the project route folder and run the following command in your command prompt to make a doker image and put the pipeline into practice.
docker compose up --buildIf you run it successfully, you will see an informative log like the following:
✔ data-pipeline Built 0.0s
✔ Network simple_docker_pipeline_default Created 0.4s
✔ Container simple_pipeline_container Created 0.4s
Attaching to simple_pipeline_container
simple_pipeline_container | Data Extraction completed.
simple_pipeline_container | Data Transformation completed.
simple_pipeline_container | Data Loading completed.
simple_pipeline_container | Data pipeline completed successfully.
simple_pipeline_container exited with code 0If everything is hanged successfully, you will look a new CleanedMedicalData.csv File in your data folder.
Congratulations! You have just developed a simple data pipeline with Azigar and Doker. Try to use different data sources and ETL processes to find out if you can handle more complex pipeline.
Conclusion
Understanding data pipelines is very important for every data professional, as they are necessary to get the right data for their work. In this article, we discovered how to develop a simple data pipeline using Azgar and Dokar and learned how to process it.
I hope it has helped!
Cornelius Yodha Vijaya Data Science is Assistant Manager and Data Writer. Elijan, working in Indonesia for a full time, likes to share indicators of data and data through social media and written media. CorneLius writes on various types of AI and machine learning titles.