

Photo by Author | Ideogram
As a data scientists, we handle major datases or complex models that require a lot of time to run. We use tools to save time and get the results faster, which perform tasks simultaneously or in multiple machines. There are two famous Azigar libraries for this Ray And Dosk. Both help to accelerate data processing and model training, but they are used for a variety of tasks.
In this article, we will explain what rays and dasks are and when to choose everyone.
. What are Dask and Ray?
Dask is a library used to handle large quantities of data. It is designed to work like a consumer who feels familiar with PandasFor, for, for,. nUmpyOr Skate. Dask breaks data and tasks into smaller parts and operates them parallel. This makes this data excellent for scientists who want to increase their data analysis without learning many new concepts.
Ray is another common tool that helps you create and run distributed applications. It is particularly strong in machine learning and AI works.
Ray also has additional libraries on top of it, such as:
- Ray lyrics To tune hyperpressors in machine learning
- Ray train For models training on multiple GPUs
- Ray service To deploy the model as web services
If you want to create scaleable machine learning pipelines or deploy AI applications that need to be pairingly complex.
. The comparison of the feature
A systematic comparison of basic attributes and rays:
| Feature | Dosk | Ray |
|---|---|---|
| Basic abstract | Data frames, rows, delayed work | Remote functions, actors |
| The best for | Extended data processing, machine Learning Pipelines | Distributed machine learning training, tuning, and service |
| Ease in use | High for Pandas/Napi users | Moderate, more boiler plate |
| Ecosystem | Is integrated with scikit-learnFor, for, for,. xgboost | Built -in libraries: BullyFor, for, for,. ServiceFor, for, for,. rllib |
| Scale Ebbitty | Batch is great for processing | Excellent, more control and flexibility |
| Scheduling | Work steeling shadowler | Dynamic, actor -based schedule |
| Cluster management | Ancestral or cabinets, through the yarn | Ray Dashboard, Cabinets, AWS, GCP |
| The community/maturity | Old, adult, widely adopted | Faster -growing, strong machine learning support |
. When to use?
Choose the disk if you:
- Use
Pandas).).NumPyAnd want to scale - Take action on a tablerr or array -like data
- Perform batch ETL or feature engineering
- Is required
dataframeOrarraySummary with slow processing
If you: Choose ray:
- Need to run a lot of freezer functions parallel
- Want to make machine learning pipelines, serve models, or manage long -running tasks
- Microser -like scaling is required with state tasks
. Tools of ecosystem
Both libraries offer or support multiple tools to cover data science life cycle, but with different emphasis:
| Task | Dosk | Ray |
|---|---|---|
| Data frames | dask.dataframe | Model (Made on ray or dask) |
| Rows | dask.array | No local support, rely on humidity |
| Hyper Parameter tuning | With manual or dasc-ml | Ray lyrics (Advanced features) |
| Machine Learning Pipelines | dask-mlCustom Work Floose | Ray trainFor, for, for,. Ray lyricsFor, for, for,. Ray Air |
| The model service | Customs flask/Fast PI setup | Ray service |
| Learn | Not supported | rllib |
| Dashboard | Built -in, very detailed | Built -in, easy |
. Real -world scenario
!! Massive data cleaning and feature engineering
Use Dosk.
Why? Dask is easily integrated pandas And NumPy. Many data teams already use these tools. If your dataset is too big to fit in memory, the Dask can divide it into smaller parts and take these parts parallel. This helps work such as data cleaning and creating new features.
Example:
import dask.dataframe as dd
import numpy as np
df = dd.read_csv('s3://data/large-dataset-*.csv')
df = df(df('amount') > 100)
df('log_amount') = df('amount').map_partitions(np.log)
df.to_parquet('s3://processed/output/')This code reads several large CSV files from S3 bucket using Dask in parallel. It filters rows where the money column is more than 100, applies the login change, and saves the result as pariel files.
!! Parallel hyperpressor tuning for machine learning models
Use Ray.
Why? Ray lyrics Machine learning model is great to try different settings while training. It is connected with the tools like Piturich And XGBoostAnd this can stop the bad runs quickly to save time.
Example:
from ray import tune
from ray.tune.schedulers import ASHAScheduler
def train_fn(config):
# Model training logic here
...
tune.run(
train_fn,
config={"lr": tune.grid_search((0.01, 0.001, 0.0001))},
scheduler=ASHAScheduler(metric="accuracy", mode="max")
)This code explains a training function and uses ray tune to test different learning rates parallel. It automatically evaluates the schedules and the best layout using the ASHA schedule.
!! Distributed array computation
Use Dosk.
Why? Dask array is helpful when working with large sets of numbers. It divides the ranks into the blocks and acts in parallel.
Example:
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x.mean(axis=0).compute()This code makes a large random array that is divided into parts that can be processed parallel. It then calculates the middle of each column using the parallel computing power of the Dask.
!! Construction of closing machine learning service
Use Ray.
Why? Ray is designed not only for model training but also for service and life cycle management. With Ray serviceYou can deploy models in production, parallel to pre -processing logic, and even state actors on a scale.
Example:
from ray import serve
@serve.deployment
class ModelDeployment:
def __init__(self):
self.model = load_model()
def __call__(self, request_body):
data = request_body
return self.model.predict((data))(0)
serve.run(ModelDeployment.bind())This code explains a class so that the machine learning model can be loaded and it can be introduced by API using ray serving. The class receives a request, a prediction using the model, and results.
. Final recommendations
| Use the case | The recommended tool |
|---|---|
| Skyable Data Analysis (Pandas Style) | Dosk |
| Mass machine learning training | Ray |
| Hyperprimator correction | Ray |
| Counting data frame out of cover | Dosk |
| Real Time Machine Learning Model Service | Ray |
| Customs pipelines with high harmony | Ray |
| Integration with Pitta Stack | Dosk |
. Conclusion
Ray and Dask are both tools that help scientists handle large quantities of data and run programs faster. Ray is good for the tasks that require a lot of flexibility like machine learning projects. If you want to work using similar tools similar to large datasis is useful Pandas Or NumPy.
What you choose depends on what your project needs and you have a data type. It is a good idea to try both Both Both on the smaller examples of which improves your work.
Jayta gland Machine learning is a fond and technical author who is driven by his fondness for making machine learning model. He holds a master’s degree in computer science from the University of Liverpool.