Early leader for data analysis with polar

Photo by Author | Ideogram

. Introduction

When you are new to analyzing with azigar, Pandas Usually the same is what most analysts learn and use. But Polar He is very popular and is fast and more efficient.

Made in rust, polar data handles processing works that will slow down other tools. It is designed for speed, memory performance, and ease of use. In this early friendly article, we will rotate the statistics of the imaginary coffee shop and analyze it to learn polar. Seems interesting? Let’s start!

🔗 🔗 Link from the code on the Gut Hub

. Joke

Before we dive to analyze the data, let’s remove the installation measures. First, install the polar:

! pip install polars numpy

Now, let’s import the libraries and modules:

import polars as pl
import numpy as np
from datetime import datetime, timedelta

We use pl As an alias for poles.

. Creating sample data

Imagine that you are managing a small coffee shop, “Ben there” somewhere, and there are hundreds of receipts and related data to analyze. You want to understand which drinks are best selling, which days bring the most income and related questions. So yes, let’s start coding! ☕ ☕

Let’s make this leader practical, let’s create a realistic datastate for “Ben there coffee shop”. We will produce data which any little business owner will recognize:

# Set up for consistent results
np.random.seed(42)

# Create realistic coffee shop data
def generate_coffee_data():
    n_records = 2000
    # Coffee menu items with realistic prices
    menu_items = ('Espresso', 'Cappuccino', 'Latte', 'Americano', 'Mocha', 'Cold Brew')
    prices = (2.50, 4.00, 4.50, 3.00, 5.00, 3.50)
    price_map = dict(zip(menu_items, prices))

    # Generate dates over 6 months
    start_date = datetime(2023, 6, 1)
    dates = (start_date + timedelta(days=np.random.randint(0, 180))
             for _ in range(n_records))

    # Randomly select drinks, then map the correct price for each selected drink
    drinks = np.random.choice(menu_items, n_records)
    prices_chosen = (price_map(d) for d in drinks)

    data = {
        'date': dates,
        'drink': drinks,
        'price': prices_chosen,
        'quantity': np.random.choice((1, 1, 1, 2, 2, 3), n_records),
        'customer_type': np.random.choice(('Regular', 'New', 'Tourist'),
                                          n_records, p=(0.5, 0.3, 0.2)),
        'payment_method': np.random.choice(('Card', 'Cash', 'Mobile'),
                                           n_records, p=(0.6, 0.2, 0.2)),
        'rating': np.random.choice((2, 3, 4, 5), n_records, p=(0.1, 0.4, 0.4, 0.1))
    }
    return data

# Create our coffee shop DataFrame
coffee_data = generate_coffee_data()
df = pl.DataFrame(coffee_data)

This produces a sample datastate with 2,000 coffee transactions. Each row represents a sale in which the details were ordered, when, how much it costs, and who bought it.

. Are looking at your data

Before analyzing any data, you need to understand who you are working with. Think about it like to see a new recipe before starting to cook:

# Take a peek at your data
print("First 5 transactions:")
print(df.head())

print("\nWhat types of data do we have?")
print(df.schema)

print("\nHow big is our dataset?")
print(f"We have {df.height} transactions and {df.width} columns")

head() The method shows you the first few rows. The scheme tells you what kind of information is in each column (numbers, text, dates, etc.).

First 5 transactions:
shape: (5, 7)
┌─────────────────────┬────────────┬───────┬──────────┬───────────────┬────────────────┬────────┐
│ date                ┆ drink      ┆ price ┆ quantity ┆ customer_type ┆ payment_method ┆ rating │
│ ---                 ┆ ---        ┆ ---   ┆ ---      ┆ ---           ┆ ---            ┆ ---    │
│ datetime(μs)        ┆ str        ┆ f64   ┆ i64      ┆ str           ┆ str            ┆ i64    │
╞═════════════════════╪════════════╪═══════╪══════════╪═══════════════╪════════════════╪════════╡
│ 2023-09-11 00:00:00 ┆ Cold Brew  ┆ 5.0   ┆ 1        ┆ New           ┆ Cash           ┆ 4      │
│ 2023-11-27 00:00:00 ┆ Cappuccino ┆ 4.5   ┆ 1        ┆ New           ┆ Card           ┆ 4      │
│ 2023-09-01 00:00:00 ┆ Espresso   ┆ 4.5   ┆ 1        ┆ Regular       ┆ Card           ┆ 3      │
│ 2023-06-15 00:00:00 ┆ Cappuccino ┆ 5.0   ┆ 1        ┆ New           ┆ Card           ┆ 4      │
│ 2023-09-15 00:00:00 ┆ Mocha      ┆ 5.0   ┆ 2        ┆ Regular       ┆ Card           ┆ 3      │
└─────────────────────┴────────────┴───────┴──────────┴───────────────┴────────────────┴────────┘

What types of data do we have?
Schema({'date': Datetime(time_unit="us", time_zone=None), 'drink': String, 'price': Float64, 'quantity': Int64, 'customer_type': String, 'payment_method': String, 'rating': Int64})

How big is our dataset?
We have 2000 transactions and 7 columns

. Adding new columns

Now start out of business insights. Every owner of a coffee shop wants to know their total income in every transaction:

# Calculate total sales amount and add useful date information
df_enhanced = df.with_columns((
    # Calculate revenue per transaction
    (pl.col('price') * pl.col('quantity')).alias('total_sale'),

    # Extract useful date components
    pl.col('date').dt.weekday().alias('day_of_week'),
    pl.col('date').dt.month().alias('month'),
    pl.col('date').dt.hour().alias('hour_of_day')
))

print("Sample of enhanced data:")
print(df_enhanced.head())

Output (your exact number may vary):

Sample of enhanced data:
shape: (5, 11)
┌─────────────┬────────────┬───────┬──────────┬───┬────────────┬─────────────┬───────┬─────────────┐
│ date        ┆ drink      ┆ price ┆ quantity ┆ … ┆ total_sale ┆ day_of_week ┆ month ┆ hour_of_day │
│ ---         ┆ ---        ┆ ---   ┆ ---      ┆   ┆ ---        ┆ ---         ┆ ---   ┆ ---         │
│ datetime(μs ┆ str        ┆ f64   ┆ i64      ┆   ┆ f64        ┆ i8          ┆ i8    ┆ i8          │
│ )           ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
╞═════════════╪════════════╪═══════╪══════════╪═══╪════════════╪═════════════╪═══════╪═════════════╡
│ 2023-09-11  ┆ Cold Brew  ┆ 5.0   ┆ 1        ┆ … ┆ 5.0        ┆ 1           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-11-27  ┆ Cappuccino ┆ 4.5   ┆ 1        ┆ … ┆ 4.5        ┆ 1           ┆ 11    ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-09-01  ┆ Espresso   ┆ 4.5   ┆ 1        ┆ … ┆ 4.5        ┆ 5           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-06-15  ┆ Cappuccino ┆ 5.0   ┆ 1        ┆ … ┆ 5.0        ┆ 4           ┆ 6     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-09-15  ┆ Mocha      ┆ 5.0   ┆ 2        ┆ … ┆ 10.0       ┆ 5           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
└─────────────┴────────────┴───────┴──────────┴───┴────────────┴─────────────┴───────┴─────────────┘

What is happening here:

with_columns() Includes new columns to our data
pl.col() Current columns refer to
alias() Gives our new columns a descriptive name
dt Extracts parts from the dates (like receiving only the month from the whole date)

Think about it as adding to the calculation fields in a spreadsheet. We are not changing the original data, just adding more information to work.

. Grouping data

Let’s answer some interesting questions now.

!! Question 1: Which drinks are our best sellers?

This code groups all transactions through the type of beverage, then calculates the gross and average for each group. It is like sorting all your receipts in piles by drinking type, then calculate the total pile tomorrow.

drink_performance = (df_enhanced
    .group_by('drink')
    .agg((
        pl.col('total_sale').sum().alias('total_revenue'),
        pl.col('quantity').sum().alias('total_sold'),
        pl.col('rating').mean().alias('avg_rating')
    ))
    .sort('total_revenue', descending=True)
)

print("Drink performance ranking:")
print(drink_performance)

Output:

Drink performance ranking:
shape: (6, 4)
┌────────────┬───────────────┬────────────┬────────────┐
│ drink      ┆ total_revenue ┆ total_sold ┆ avg_rating │
│ ---        ┆ ---           ┆ ---        ┆ ---        │
│ str        ┆ f64           ┆ i64        ┆ f64        │
╞════════════╪═══════════════╪════════════╪════════════╡
│ Americano  ┆ 2242.0        ┆ 595        ┆ 3.476454   │
│ Mocha      ┆ 2204.0        ┆ 591        ┆ 3.492711   │
│ Espresso   ┆ 2119.5        ┆ 570        ┆ 3.514793   │
│ Cold Brew  ┆ 2035.5        ┆ 556        ┆ 3.475758   │
│ Cappuccino ┆ 1962.5        ┆ 521        ┆ 3.541139   │
│ Latte      ┆ 1949.5        ┆ 514        ┆ 3.528846   │
└────────────┴───────────────┴────────────┴────────────┘

!! Question 2: How does a daily sale look like?

Now let’s find the number of transactions and related income for every day a week.

daily_patterns = (df_enhanced
    .group_by('day_of_week')
    .agg((
        pl.col('total_sale').sum().alias('daily_revenue'),
        pl.len().alias('number_of_transactions')
    ))
    .sort('day_of_week')
)

print("Daily business patterns:")
print(daily_patterns)

Output:

Daily business patterns:
shape: (7, 3)
┌─────────────┬───────────────┬────────────────────────┐
│ day_of_week ┆ daily_revenue ┆ number_of_transactions │
│ ---         ┆ ---           ┆ ---                    │
│ i8          ┆ f64           ┆ u32                    │
╞═════════════╪═══════════════╪════════════════════════╡
│ 1           ┆ 2061.0        ┆ 324                    │
│ 2           ┆ 1761.0        ┆ 276                    │
│ 3           ┆ 1710.0        ┆ 278                    │
│ 4           ┆ 1784.0        ┆ 288                    │
│ 5           ┆ 1651.5        ┆ 265                    │
│ 6           ┆ 1596.0        ┆ 259                    │
│ 7           ┆ 1949.5        ┆ 310                    │
└─────────────┴───────────────┴────────────────────────┘

. Filtering data

Let’s find our high -value transactions:

# Find transactions over $10 (multiple items or expensive drinks)
big_orders = (df_enhanced
    .filter(pl.col('total_sale') > 10.0)
    .sort('total_sale', descending=True)
)

print(f"We have {big_orders.height} orders over $10")
print("Top 5 biggest orders:")
print(big_orders.head())

Output:

We have 204 orders over $10
Top 5 biggest orders:
shape: (5, 11)
┌─────────────┬────────────┬───────┬──────────┬───┬────────────┬─────────────┬───────┬─────────────┐
│ date        ┆ drink      ┆ price ┆ quantity ┆ … ┆ total_sale ┆ day_of_week ┆ month ┆ hour_of_day │
│ ---         ┆ ---        ┆ ---   ┆ ---      ┆   ┆ ---        ┆ ---         ┆ ---   ┆ ---         │
│ datetime(μs ┆ str        ┆ f64   ┆ i64      ┆   ┆ f64        ┆ i8          ┆ i8    ┆ i8          │
│ )           ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
╞═════════════╪════════════╪═══════╪══════════╪═══╪════════════╪═════════════╪═══════╪═════════════╡
│ 2023-07-21  ┆ Cappuccino ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 5           ┆ 7     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-08-02  ┆ Latte      ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 3           ┆ 8     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-07-21  ┆ Cappuccino ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 5           ┆ 7     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-10-08  ┆ Cappuccino ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 7           ┆ 10    ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
│ 2023-09-07  ┆ Latte      ┆ 5.0   ┆ 3        ┆ … ┆ 15.0       ┆ 4           ┆ 9     ┆ 0           │
│ 00:00:00    ┆            ┆       ┆          ┆   ┆            ┆             ┆       ┆             │
└─────────────┴────────────┴───────┴──────────┴───┴────────────┴─────────────┴───────┴─────────────┘

. To analyze consumer behavior

Let’s consider users’ samples:

# Analyze customer behavior by type
customer_analysis = (df_enhanced
    .group_by('customer_type')
    .agg((
        pl.col('total_sale').mean().alias('avg_spending'),
        pl.col('total_sale').sum().alias('total_revenue'),
        pl.len().alias('visit_count'),
        pl.col('rating').mean().alias('avg_satisfaction')
    ))
    .with_columns((
        # Calculate revenue per visit
        (pl.col('total_revenue') / pl.col('visit_count')).alias('revenue_per_visit')
    ))
)

print("Customer behavior analysis:")
print(customer_analysis)

Output:

Customer behavior analysis:
shape: (3, 6)
┌───────────────┬──────────────┬───────────────┬─────────────┬──────────────────┬──────────────────┐
│ customer_type ┆ avg_spending ┆ total_revenue ┆ visit_count ┆ avg_satisfaction ┆ revenue_per_visi │
│ ---           ┆ ---          ┆ ---           ┆ ---         ┆ ---              ┆ t                │
│ str           ┆ f64          ┆ f64           ┆ u32         ┆ f64              ┆ ---              │
│               ┆              ┆               ┆             ┆                  ┆ f64              │
╞═══════════════╪══════════════╪═══════════════╪═════════════╪══════════════════╪══════════════════╡
│ Regular       ┆ 6.277832     ┆ 6428.5        ┆ 1024        ┆ 3.499023         ┆ 6.277832         │
│ Tourist       ┆ 6.185185     ┆ 2505.0        ┆ 405         ┆ 3.518519         ┆ 6.185185         │
│ New           ┆ 6.268827     ┆ 3579.5        ┆ 571         ┆ 3.502627         ┆ 6.268827         │
└───────────────┴──────────────┴───────────────┴─────────────┴──────────────────┴──────────────────┘

. To keep it all together

Let’s create a comprehensive business summary:

# Create a complete business summary
business_summary = {
    'total_revenue': df_enhanced('total_sale').sum(),
    'total_transactions': df_enhanced.height,
    'average_transaction': df_enhanced('total_sale').mean(),
    'best_selling_drink': drink_performance.row(0)(0),  # First row, first column
    'customer_satisfaction': df_enhanced('rating').mean()
}

print("\n=== BEAN THERE COFFEE SHOP - SUMMARY ===")
for key, value in business_summary.items():
    if isinstance(value, float) and key != 'customer_satisfaction':
        print(f"{key.replace('_', ' ').title()}: ${value:.2f}")
    else:
        print(f"{key.replace('_', ' ').title()}: {value}")

Output:

=== BEAN THERE COFFEE SHOP - SUMMARY ===
Total Revenue: $12513.00
Total Transactions: 2000
Average Transaction: $6.26
Best Selling Drink: Americano
Customer Satisfaction: 3.504

. Conclusion

You have just completed a comprehensive introduction to data analysis with Polar! For example of our coffee shop, (I hope) you have learned how to change the raw transaction data into meaningful business insights.

Remember, mastering data analysis such as learning to cook – you start with basic recipes (such as examples in this guide) and gradually improve. The key process and curiosity.

The next time you analyze the dataset ask yourself:

What story does this data tell?
Which samples can be hidden here?
What question can this data answer?

Then use your new polar skills to find out. Happy Analysis!

Pray Ca Is a developer and technical author from India. She likes to work at the intersection of mathematics, programming, data science, and content creation. The fields of interest and expertise include dupas, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, they are working with the developer community to learn and share their knowledge with the developer community by writing a lesson, how to guide, feed and more. The above resources review and coding also engages lessons.

. Introduction

. Joke

. Creating sample data

. Are looking at your data

. Adding new columns

. Grouping data

!! Question 1: Which drinks are our best sellers?

!! Question 2: How does a daily sale look like?

. Filtering data

. To analyze consumer behavior

. To keep it all together

. Conclusion

Editor's pick

Get latest news

Early leader for data analysis with polar

. Introduction

. Joke

. Creating sample data

. Are looking at your data

. Adding new columns

. Grouping data

!! Question 1: Which drinks are our best sellers?

!! Question 2: How does a daily sale look like?

. Filtering data

. To analyze consumer behavior

. To keep it all together

. Conclusion

Only 7 days left up to $ 668 to disrupt the tickets of 2025

This is a rare opportunity to save more than 70 % on Quick Box Desktop Pro Plus 2024

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news