
Photo by Author | Ideogram
You know that when you have data scattered in different forms and sources, do you realize this, and you need to understand all this? That’s what we are doing today. Let’s create an ETL pipeline that takes dirty data and turns it into something useful.
In this article, I run by making you a pipeline that acts on e -commerce transactions. Nothing likes, just the practical code that works.
We will get data from the CSV file (such as you download from the e -commerce platform), clean it, and store it in a suitable database for analysis.
🔗 🔗 Link from the code on the Gut Hub
What is a squeeze, change, load (ETL) pipeline?
Each ETL pipeline follows the same sample. You grab (squeeze) data from somewhere, clean it and make it better, then keep it useful somewhere useful.

ETL Pipeline | Photo by Author | Diagrams. Net (Draw.u)
The process begins Evacuate Step, where data is obtained from different source systems such as database, APIS, files, or streaming platforms. During this phase, the pipeline identifies and draws related data, while maintaining contact with systems working on various systems and formats.
Next Shift The phase represents the basic processing phase, where the extra data goes through cleaning, verification and reorganization. At this stage, data quality issues have been solved, business rules are applied, calculated, and data is converted into the desired form and structure. Common changes include data type conversion, field mapping, gatherings, and duplication or elimination of incorrect records.
End, Load The phase now transmits the changing data to the target system. This move can be through a full load, where the entire datases are replaced, or the additional burden, where only new or changed data is added. Loading strategy depends on factors such as data volume, system performance requirements, and business requirements.
Step 1: Squeeze
The “squeeze” step is the place where we put a hand on the data. In the real world, you are downloading this CSV from your e -commerce platform reporting dashboard, pulling it from the FTP server, or getting it via API. Here, we are reading from an available CSV file.
def extract_data_from_csv(csv_file_path):
try:
print(f"Extracting data from {csv_file_path}...")
df = pd.read_csv(csv_file_path)
print(f"Successfully extracted {len(df)} records")
return df
except FileNotFoundError:
print(f"Error: {csv_file_path} not found. Creating sample data...")
csv_file = create_sample_csv_data()
return pd.read_csv(csv_file)
Now we have raw data from its source (Raw_transActions.csv), We need to turn it into something usable.
Step 2: Change
This is the place where we make the data really useful.
def transform_data(df):
print("Transforming data...")
df_clean = df.copy()
# Remove records with missing emails
initial_count = len(df_clean)
df_clean = df_clean.dropna(subset=('customer_email'))
removed_count = initial_count - len(df_clean)
print(f"Removed {removed_count} records with missing emails")
# Calculate derived fields
df_clean('total_amount') = df_clean('price') * df_clean('quantity')
# Extract date components
df_clean('transaction_date') = pd.to_datetime(df_clean('transaction_date'))
df_clean('year') = df_clean('transaction_date').dt.year
df_clean('month') = df_clean('transaction_date').dt.month
df_clean('day_of_week') = df_clean('transaction_date').dt.day_name()
# Create customer segments
df_clean('customer_segment') = pd.cut(df_clean('total_amount'),
bins=(0, 50, 200, float('inf')),
labels=('Low', 'Medium', 'High'))
return df_clean
First, we are leaving rows with lost emails because the customer’s incomplete data is not helpful for most analysis.
Then we calculate total_amount
Multiplication by price and quantity. This may be clear, but you will be surprised how many times such fields disappear from raw data.
Date is really easy to make. Instead of just having a time stamp, we now have different years, month and day columns. This makes it easier to analyze samples like “Do we sell more over the weekend?”
Customer distribution using pd.cut()
Could be especially useful. It automatically burst into the category of consumers. Now, instead of just having the amount of transactions, we have meaningful business classes.
Step 3: load
In a real project, you are loading in a database, sending an API, or pushing cloud storage.
Here, we are loading our clean data into a suitable SQLITE database.
def load_data_to_sqlite(df, db_name="ecommerce_data.db", table_name="transactions"):
print(f"Loading data to SQLite database '{db_name}'...")
conn = sqlite3.connect(db_name)
try:
df.to_sql(table_name, conn, if_exists="replace", index=False)
cursor = conn.cursor()
cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
record_count = cursor.fetchone()(0)
print(f"Successfully loaded {record_count} records to '{table_name}' table")
return f"Data successfully loaded to {db_name}"
finally:
conn.close()
Now analysts can run SQL questions, connect BI tools, and in fact use this data for decision making.
SQ Elite works well because it is lightweight, it does not require a setup, and creates a file that you can easily share or backup. if_exists="replace"
Parameter means that you can run this pipeline several times without worrying about duplicate data.
We have included verification measures so you know the burden was successful. Your data is nothing worse than a safe way to find an empty table later.
ETL is running pipeline
It arcves the whole squeeze, change, load workflower.
def run_etl_pipeline():
print("Starting ETL Pipeline...")
# Extract
raw_data = extract_data_from_csv('raw_transactions.csv')
# Transform
transformed_data = transform_data(raw_data)
# Load
load_result = load_data_to_sqlite(transformed_data)
print("ETL Pipeline completed successfully!")
return transformed_data
Consider how it connects everything together. Squeeze, change, burden,. You can run it and see your processed data immediately.
You can get the full code On the gut hub.
Wrap
This pipeline takes raw transaction data and transforms it into a acting thing to an analyst or data scientist. You have clean records, calculated fields and meaningful classes.
Each function performs one task well, and you can easily change or increase any part without breaking the rest.
Now try to run it yourself. Try to edit it in terms of any other use. Happy coding!
Pray Ca Is a developer and technical author from India. She likes to work at the intersection of mathematics, programming, data science, and content creation. The fields of interest and expertise include dupas, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, they are working with the developer community to learn and share their knowledge with the developer community by writing a lesson, how to guide, feed and more. The above resources review and coding also engages lessons.