Ray or Dask? A practical leader for data scientists

Photo by Author | Ideogram

As a data scientists, we handle major datases or complex models that require a lot of time to run. We use tools to save time and get the results faster, which perform tasks simultaneously or in multiple machines. There are two famous Azigar libraries for this Ray And Dosk. Both help to accelerate data processing and model training, but they are used for a variety of tasks.

In this article, we will explain what rays and dasks are and when to choose everyone.

. What are Dask and Ray?

Dask is a library used to handle large quantities of data. It is designed to work like a consumer who feels familiar with PandasFor, for, for,. nUmpyOr Skate. Dask breaks data and tasks into smaller parts and operates them parallel. This makes this data excellent for scientists who want to increase their data analysis without learning many new concepts.

Ray is another common tool that helps you create and run distributed applications. It is particularly strong in machine learning and AI works.

Ray also has additional libraries on top of it, such as:

Ray lyrics To tune hyperpressors in machine learning
Ray train For models training on multiple GPUs
Ray service To deploy the model as web services

If you want to create scaleable machine learning pipelines or deploy AI applications that need to be pairingly complex.

. The comparison of the feature

A systematic comparison of basic attributes and rays:

Feature	Dosk	Ray
Basic abstract	Data frames, rows, delayed work	Remote functions, actors
The best for	Extended data processing, machine Learning Pipelines	Distributed machine learning training, tuning, and service
Ease in use	High for Pandas/Napi users	Moderate, more boiler plate
Ecosystem	Is integrated with `scikit-learn`For, for, for,. xgboost	Built -in libraries: BullyFor, for, for,. ServiceFor, for, for,. rllib
Scale Ebbitty	Batch is great for processing	Excellent, more control and flexibility
Scheduling	Work steeling shadowler	Dynamic, actor -based schedule
Cluster management	Ancestral or cabinets, through the yarn	Ray Dashboard, Cabinets, AWS, GCP
The community/maturity	Old, adult, widely adopted	Faster -growing, strong machine learning support

. When to use?

Choose the disk if you:

Use Pandas).).NumPy And want to scale
Take action on a tablerr or array -like data
Perform batch ETL or feature engineering
Is required dataframe Or array Summary with slow processing

If you: Choose ray:

Need to run a lot of freezer functions parallel
Want to make machine learning pipelines, serve models, or manage long -running tasks
Microser -like scaling is required with state tasks

. Tools of ecosystem

Both libraries offer or support multiple tools to cover data science life cycle, but with different emphasis:

Task	Dosk	Ray
Data frames	`dask.dataframe`	Model (Made on ray or dask)
Rows	`dask.array`	No local support, rely on humidity
Hyper Parameter tuning	With manual or dasc-ml	Ray lyrics (Advanced features)
Machine Learning Pipelines	`dask-ml`Custom Work Floose	Ray trainFor, for, for,. Ray lyricsFor, for, for,. Ray Air
The model service	Customs flask/Fast PI setup	Ray service
Learn	Not supported	rllib
Dashboard	Built -in, very detailed	Built -in, easy

. Real -world scenario

!! Massive data cleaning and feature engineering

Use Dosk.

Why? Dask is easily integrated pandas And NumPy. Many data teams already use these tools. If your dataset is too big to fit in memory, the Dask can divide it into smaller parts and take these parts parallel. This helps work such as data cleaning and creating new features.

Example:

import dask.dataframe as dd
import numpy as np

df = dd.read_csv('s3://data/large-dataset-*.csv')
df = df(df('amount') > 100)
df('log_amount') = df('amount').map_partitions(np.log)
df.to_parquet('s3://processed/output/')

This code reads several large CSV files from S3 bucket using Dask in parallel. It filters rows where the money column is more than 100, applies the login change, and saves the result as pariel files.

!! Parallel hyperpressor tuning for machine learning models

Use Ray.

Why? Ray lyrics Machine learning model is great to try different settings while training. It is connected with the tools like Piturich And XGBoostAnd this can stop the bad runs quickly to save time.

Example:

from ray import tune
from ray.tune.schedulers import ASHAScheduler

def train_fn(config):
    # Model training logic here
    ...

tune.run(
    train_fn,
    config={"lr": tune.grid_search((0.01, 0.001, 0.0001))},
    scheduler=ASHAScheduler(metric="accuracy", mode="max")
)

This code explains a training function and uses ray tune to test different learning rates parallel. It automatically evaluates the schedules and the best layout using the ASHA schedule.

!! Distributed array computation

Use Dosk.

Why? Dask array is helpful when working with large sets of numbers. It divides the ranks into the blocks and acts in parallel.

Example:

import dask.array as da

x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x.mean(axis=0).compute()

This code makes a large random array that is divided into parts that can be processed parallel. It then calculates the middle of each column using the parallel computing power of the Dask.

!! Construction of closing machine learning service

Use Ray.

Why? Ray is designed not only for model training but also for service and life cycle management. With Ray serviceYou can deploy models in production, parallel to pre -processing logic, and even state actors on a scale.

Example:

from ray import serve

@serve.deployment
class ModelDeployment:
    def __init__(self):
        self.model = load_model()

    def __call__(self, request_body):
        data = request_body
        return self.model.predict((data))(0)

serve.run(ModelDeployment.bind())

This code explains a class so that the machine learning model can be loaded and it can be introduced by API using ray serving. The class receives a request, a prediction using the model, and results.

. Final recommendations

Use the case	The recommended tool
Skyable Data Analysis (Pandas Style)	Dosk
Mass machine learning training	Ray
Hyperprimator correction	Ray
Counting data frame out of cover	Dosk
Real Time Machine Learning Model Service	Ray
Customs pipelines with high harmony	Ray
Integration with Pitta Stack	Dosk

. Conclusion

Ray and Dask are both tools that help scientists handle large quantities of data and run programs faster. Ray is good for the tasks that require a lot of flexibility like machine learning projects. If you want to work using similar tools similar to large datasis is useful Pandas Or NumPy.

What you choose depends on what your project needs and you have a data type. It is a good idea to try both Both Both on the smaller examples of which improves your work.

Jayta gland Machine learning is a fond and technical author who is driven by his fondness for making machine learning model. He holds a master’s degree in computer science from the University of Liverpool.

. What are Dask and Ray?

. The comparison of the feature

. When to use?

. Tools of ecosystem

. Real -world scenario

!! Massive data cleaning and feature engineering

!! Parallel hyperpressor tuning for machine learning models

!! Distributed array computation

!! Construction of closing machine learning service

. Final recommendations

. Conclusion

Editor's pick

Get latest news

Ray or Dask? A practical leader for data scientists

. What are Dask and Ray?

. The comparison of the feature

. When to use?

. Tools of ecosystem

. Real -world scenario

!! Massive data cleaning and feature engineering

!! Parallel hyperpressor tuning for machine learning models

!! Distributed array computation

!! Construction of closing machine learning service

. Final recommendations

. Conclusion

What is Mr. AI? Everything to learn about Open AI’s rivals

Helping the local restaurant within the Pepsico project

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news